Recommendation engines are here to stay! These tools powered by machine learning are slowly touching every part of our lives. They serve as book experts when we don’t know what to read, they are our trusted stylists always recommending us fashionable garments, and they’re even movie critics and travel agents. With recommendation engines so prevalent in our daily lives, I believe we all ought to know their basic operating principles. Otherwise, how are we to surrender some of our choices to these black boxes, regardless of how much they appear to know about us?

Recommendation engines come in many shapes and forms [1]. For simplicity (which is nothing more that technical lingo for “I’m not crazy to attempt, to sum up, thousands and thousands of pages in a blog post”) I will discuss one type of recommendation engines: collaborative filtering (referred as CF).

Collaborative filtering is among one of the most used forms of a recommendation engine. The main building blocks of these engines are a collection of people (users) which rated some objects (items) based on their personal preference. Note that the users can be anything, not just people; they can be easily substituted for any objects, be it real or abstract. The only requirement is that these objects have something in common, for example, a set of meta-properties.

Collaborative filters are nothing more than a group of algorithms for predicting the unknown ratings of a user for the items she hasn’t seen yet using the information solely from other users with similar preferences (see figure above). Interestingly, collaborative filters don’t place any constraints on the type of items that they recommend. All you need is a group of articles and a group of users that have recorded their preference for some of those items (see Chapter 2 in [1] for a mathematical treatment of these ideas).

Assume that you have a doughnuts shop where you sell four types of doughnuts. You have been on the market for some time now, and you are thinking of taking your service to the next level by implementing a collaborative filter for your online customers. To formulate this as a recommendation engine, you have a set of users for which you know their preferences (users in the blue rectangle in the figure above). Then you have Jimmy, the current customer that likes chocolate doughnuts but he doesn’t know what to purchase next. The latter is the perfect scenario for a collaborative filter. Using the information from the previous users together with Jimmy’s preferences, the recommendation engine predicts that Jimmy would like a “Pink Glazed Doughnut” with a confidence of, say, 90%. Imagine how powerful this tool is. You not only showed Jimmy an alternative product that he can enjoy the same or even more than other products that he is usually purchasing, but you have also decreased the time it takes for Jimmy to convert from a browsing, undecided customer to one that made a purchase. And all you had to do was to build a collaborative filter. How cool is that!?

As you can imagine, collaborative filtering algorithms are far from being perfect. In the remaining of this article, I will briefly cover some of the important points to consider when designing and implementing a collaborative filter.

1. Cold Start Problem

Any collaborative filter relies on a database of users and their preferences. If you are new on the market, or you have just recently started to collect data about your users, then might have a bit of a problem. The filter won’t have any data to select similar users nor will it have any data about their preferences, so the standard collaborative filter algorithms will not work. Fortunately, there are a few alternatives to help you out with the cold start problem. The most common approach is to build a hybrid recommendation engine that uses a knowledge engine in the first days and then shifts to a collaborative filter once you gather sufficient data. Another option is to start with a mix of content based recommendation engines whereby you look for similar items to the items that the target user was browsing or has previously purchased. As with the hybrid between knowledge and collaborative filters, you can abandon the content-based recommendation engine once you have sufficient data, or you can continue to use it alongside with the CF.

2. Similarity function

The predicted ratings greatly depend on what similarity function we choose. The similarity function is responsible for selecting the other user’s ratings that are then combined into the final prediction. Standard approaches for numerical data are the cosine distance or the Pearson correlation coefficient. Also, when dealing with binary data it, is common to use the Jaccard similarity metric. It is important to note that the choce of a similarity function is one of the most important decisions you have to make when implementing a CF.

3. Bias

Imagine that in our example there are a few users who are more pessimistic about their food choices while others are super optimistic. This bias in user ratings will result in some users consistently having lower ratings while others will have higher ones. In this situation, if the similarity function doesn’t discard for this type of overly pessimistic or overly optimistic ratings, the actual predicted value to be lower or higher than it will normally be, respectively. To account for this bias, it is customary to transform the ratings before running the collaborative filter. The most common transformation is to normalise each user’s ratings such that they are mean centred.

4. Low execution speed

To predict one user’s rating, we need to run through a series of intensive operations of computing similarity scores and then predicting the final rating. Now imagine we are faced with predicting ratings for 10k users using the ratings of other 100k users. That task is virtually impossible without some clever hacks in how some of the computations are conducted. We are currently in the early stages trying to speed up some of these algorithms using various distributed computing frameworks, such as Hadoop and/or Spark, but more on that in a later post (when we actually make some real progress on this idea)

5. Spamming

What do you think will happen if half of the people used for making a recommendation are spammers (people who did not rate any item honestly)? The predictions would be far from accurate because these people will undoubtedly be very similar to all of the other genuine users and their ratings would be included in the prediction. Over the years, there have been numerous developments on how to better select the people who are relevant for computing these ratings. One variation that I find interesting is the trust-aware recommendation engine, where in addition to the similar user’s group, we also identify what users are trustworthy for predicting the ratings.

To sum up, recommendation engines based on a collaborative filtering algorithm predict user’s ratings using the information solely from other similar users. In other words, they are cleverly using other people’s preferences to derive what a user likes. Therefore, collaborative filters are always a good starting point when faced with building any recommendation engine where the properties of the items being recommended are partially or totally unknown.



[1] Charu C. Aggarwal, Recommender Systems, Springer International Publishing, 2016