Pizza Or Salad? Build A Recommendations Algorithm In Apache Mahout.

Whether you’re recommending products, finding similar users or just identifying correlations, Apache Mahout is a useful library of machine learning algorithms for Hadoop.

If you’re not already familiar with Machine Learning, it focuses on learning from a given dataset to make predictions on unseen data without explicit programming. In other words, you don’t need your program directly tell it how to get to a result – it draws conclusions based on signals instead.

You can find an example of a recommendations algorithm on Amazon, for example:

The algorithm used by Amazon is called collaborative filtering. In this tutorial, I am going to speak about content based filtering and collaborative filtering – both implemented in Apache Mahout.

Content based filtering

Content based filtering is an unsupervised mechanism based on the attributes of the items and preferences and model of the user.

“Unsupervised” means learning algorithms that try to find correlations without any external inputs other than the raw data. They do not try to find any logic in it – they cluster the data to determine groups. The data is grouped based on similar sizes, but not labeled.

For example, if a user views a movie with a certain set of attributes such as genre, actors, and awards, the systems recommends items with similar attributes. The preferences of the user are mapped with the attributes or features of the recommended item.

Eg: I watch Star Wars I and II. The recommendation algorithm will probably propose me Star Wars III and any sci-fi movies related to this genre of movies, actors in these movies.

Collaborative filtering

Collaborative filtering approaches consider the notion of similarity between items and users. No features of product or properties of users are considered here, as in content based filtering.

It is a supervised learning algorithm – the algorithm is fed with data in order to make a decision and find logic between it. All of your data is labelled, so you know exactly what type of data you give to the algorithm. The data can be categorised.

The collaborative filtering approach uses historical data on user behaviours such as clicks, views, and purchases to provide better recommendations.

The algorithm learns from the users, to better understand their needs. Amazon product recommendations are a typical example of this.

In collaborative filtering, for each item or user, a neighbourhood is formed with similar related items or users. Once you make an action to an item, the recommendations are drawn from that neighbourhood.

Collaborative filtering can be achieved using the following techniques:

  • User based recommendation
  • Item based recommendation

Collaborative filtering 1: User based recommendations

In user based recommendations, similar users from a given neighbourhood are identified and the item recommendations are given based on what similar users already bought or viewed.

In this example, you can see that the dude D1 likes pizzas and salads.

Dude 2 also likes pizzas and salads, but also beers.

Both of them are pretty similar; they have the same taste.

So, the algorithm will suggest beers to dude D1, as a result of his tastes and his similarity with the dude D2:

Another example: with movies : 5 users rate several movies from 1 to 10, with 10 being the best.

The ratings are stored in a .csv file:

How could we implement this in Java?

DataModel model = new FileDataModel (new File("movie.csv"));
UserSimilarity similarity = new PearsonCorrelationSimilarity (model);
UserNeighborhood neighborhood = new NearestNUserNeighborhood (2, similarity, model);
Recommender recommender = new GenericUserBasedRecommender (model, neighborhood, similarity);
List<RecommendedItem> recommendations = recommender.recommend(3, 2);
for (RecommendedItem recommendation : recommendations) {

The result is:

RecommendedItem[item:6, value:8.0]
RecommendedItem[item:3, value:5.181073]

The value for item 6 is higher than that of item 3 because both user 4 and user 5 has rated higher value for item 6.

Even though user 1 and user 2 have some common item interests (item 1, item 4) with user 3, they are not considered due to low ratings given for those co-occurring items.

More explanation about the code is on Mahout website.

In this example, what is a data model?

The data model model represents how we read data from different data sources. In our code example, we used FileDataModel which takes CSV input.

In addition, Apache Mahout supports the following input methods.

  • JDBCDataModel – reads from JDBC driver
  • GenericDataModel – populated through Java calls
  • GenericBooleanPrefDataModel – uses given user data, suitable for small experiments

Collaborative filtering 2: User based recommendations

It measures similarities between different items and picks the top k closest (in similarity) items to a given item to arrive at a rating prediction or recommendation for a given user for a given item.

For example, if I buy a CD of a rock band, the algorithm will also propose me a CD, similar to the one what I bought. It based only on the product.

How could we implement this algorithm in Java?

Same example as before, with 5 users who rate several movies.

DataModel model = new FileDataModel (new File("movie.csv"));
ItemSimilarity itemSimilarity = new EuclideanDistanceSimilarity (model);
Recommender itemRecommender = new GenericItemBasedRecommender(model,itemSimilarity);
List<RecommendedItem> itemRecommendations = itemRecommender.recommend(3, 2);
for (RecommendedItem itemRecommendation : itemRecommendations) {
System.out.println("Item: " + itemRecommendation);

Here is the result:

Item: RecommendedItem[item:2, value:7.7220707]
Item: RecommendedItem[item:3, value:7.5602336]

When the algorithm is only based on the items, the recommendations is totally different that we one what we have seen before. Here, the algorithm proposes the item 2 and 3 instead of the item 6 and 3 (with user based  recommendation).

More explanation about the code on Mahout website.

Assessing similarity

In the examples, I have used PearsonCorrelationSimilarity to find similarity between two users.

It implements UserSimilarity, so this is pretty efficient to compute user similarities.

We can also it for computing item similarity but, this is generally too slow to be useful.

Some other available similarity measures are listed below.

  • EuclideanDistanceSimilarity measures Euclidean distance between two users or items as dimensions, preference values given will be values along those dimensions. EuclideanDistanceSimilarity will not work if you have not given preference values.
  • TanimotoCoefficientSimilarity is applicable if preference values are consisting of binary responses. It is the number of similar items two users bought/ total number of items they bought.
  • LogLikelihoodSimilarity is a measure based on likelihood ratios
  • SpearmanCorrelationSimilarity: In SpearmanCorrelationSimilarity, relative ranking of preference values are compared instead of preference values.
  • UncenteredCosineSimilarity:  This is an implementation of cosine similarity.

When defining similarity measures you need to keep in mind that not all the datasets will work with all the similarity measures. You need to consider the nature of the dataset when selecting a similarity measure.

Also, to determine the optimal similarity measure for your scenario, you need to have a good understanding about the data set.

Trying out different similarity measures with your training data set is essential to find the optimal similarity measure.


The optimal recommendation algorithm depends on the nature of data and the scenario in hand.

However, if you have fewer users than items then it is better to use user-based recommendations. In contrast, if you have fewer items than users, then it is better to use item based recommendations to gain better performance.

If you're looking for your next job as a software engineer, have companies apply to you by adding your profile to