MusicMagal

It was about time somebody group-recommended music.

Fan of music? Fan of recommending systems? If you’re like me and my friend Daniel Franch, you love music passionately and extensively use services like Spotify and Last.Fm to get new music (and track your old musical habits).

This two and others like Shazam, Apple Music and Pandora (just to name a few) make use of machine learning, signal processing, powerful hardware systems and HUGE music and user-related databases to quietly and elegantly deliver fresh albums, tracks, playlists an artist that you’ll most probably like.

In Spotify’s case, ‘Discover weekly’ hands me a new playlist made out of 30 new tracks tailored just to my taste each week. A couple of friends like and share my taste in music (at least to some degree) and follow my version of this playlist since I’ve made it public. Still, this doesn’t address one little issue.

What if we wanted music recommended for ALL of us AT THE SAME TIME?

Daniel and I started thinking about applications for a hypothetical product. What if you’re pregaming with friends before going to a party and the playlist selected needs to please everyone? Or a study session? Or a gym? Why not even a night club? Imagine that. A night club where everybody logs their preferences at the entrance and the musical enjoyment is optimized evenly on a group level.

Why isn’t Spotify tackling this issue? Why aren’t the big leaders on this playing field offering a product like the one described? We decided to research into the question and what we found led to the creation of Musicmagal.

If you prefer the code version of this story, you can check the public repository.

So, let’s talk dive deeper into the subject.

Recommendation systems

Spotify, Last.Fm, Shazam, Pandora and Apple Music. Just to name a few on the music market.

There is an extensive class of web applications that involve predicting user responses to options. Such a commodity is called a recommendation system. Three good examples of recommendation systems are:

Offering news articles to on-line newspaper readers, based on a prediction of reader interests.
Offering customers of an on-line retailer suggestions about what they might like to buy, based on their past history of purchases and/or product searches.
Offering users new music based on their past listening habits (the case on which our project is based).

Recommendation systems use a number of different technologies. We can classify these systems into two broad groups. Both of this groups have advantages and disadvantages.

Content-based systems examine properties of the items recommended. For instance, if a Spotify user has listened to many metal songs, then recommend a song classified in the database as having the “metal” genre. This leads to a rich artist variety; various pieces including unrated ones can be recommended. To achieve this, it is necessary to associate user preferences with the music content by using a practical database where most users tend to rate few pieces as favorites. However, reliable methods for doing this have not been established. This is because a lot of attention has been paid to developing music retrieval systems in which queries that represent user preferences are prepared by users.
Collaborative filtering systems recommend items based on similarity measures between users and/or items. The items recommended to a user are those preferred by similar users. If Alex has listened to ‘Teenage Dream’ and ‘Can’t Stop the Feeling’ and Daniel has listened to ‘Teenage Dream’, ‘Can’t Stop the Feeling’ and ‘Despacito’; then Alex most probably is going to like ‘Despacito’. One major disadvantage lies on pieces not being rated. Newly-released CDs and minor songs cannot be recommended, for example. Therefore, there are not many chances of encountering unexpected favorite pieces. The second problem is that artists of the recommended pieces tend to be the same and are often well-known to a target user. Such recommendations may be unsatisfactory or meaningless.

Group recommendation

Group recommendation is a new task in which an item is recommended to a group of users who will consume it together. The challenge in such a task is modeling and matching the taste of the entire group even though only individual preferences may be known. There are various alternative proposals in the academic literature regarding how to combine the preferences of individuals in some way to generate the group taste. After researching and reading a plethora of interesting papers we decided on giving our own custom approach a shot.

Tools

Python 3 was our main ally during the project as everything was developed with this language. Sometimes in the form of a pure Python script but mostly using Jupyter Notebooks and some Python-developed library or framework.Pandas was our main data analysis and exploration tool, and we used Keras and TensorFlow on the deep learning part (yes! deep learning! ❤).Also, on the first part of our work pipeline, we used Implicit which is a fast Python library for collaborative filtering in implicit datasets.

Dataset

At first our main idea was to build our own dataset using Spotify’s Web API, or, actually, its python wrapper Spotipy. However, to get a considerable sample to call population we would have needed lots of users imputing their credentials and lots of queries to get as much as their listening history as we could (as Spotify’s API does not let you access your whole history at once). This, evidently, was going to be painful, slow, and not worth the suffering.

Instead, we found an already scrapped dataset online, sourced from Last.Fm. The key feature about this dataset is that it’s implicit. That means we don’t have a direct, or explicit, measure of quality (e.g. how many stars out of five), we have an indirect one: how many times the song was listened to. Thus, implicit.Doing some exploration we found some interesting info.

This is how the dataframe looked like. Basically, each row counts as one play from user u of track i. In the first row sampled you can see user ‘user_000517’ played track ‘871f38c3–357a-467c-91ac-f0f3aa9db1f7’ which is ‘Push It’ by ‘Garbage’. It also contains a timestamp.

From the description we also got interesting insight:

Last.FM users tend to like Radiohead
Most used name for a song is Intro (which makes sense)
We got data from ~1K users
There’s ~960K different tracks which makes it pretty diverse for a recommendation system
~107K different artists
The unique estimator has different values for artist_id and artist_name… what does this mean? More on this later.

So, following the Radiohead clue we come across the following:

As the most listened to track is not ‘Despacito’ or ‘Gangnam Style’ we suspect the dataset is biased to Indie Rock.

Aside for the funny typo in Ian Curtis’ band, this confirms the bias. The dataset is indeed biased to Indie Rock. And, yey! Radiohead is amazing.So, what now with the artist id and the artist name amount not coinciding?

One reason, apparently, is because some artist_ids are NaN. The other reason has to do with the fact that some artists have different ways of spelling their names. For example:

With all this in consideration we decide to still use the database, mostly as a way to our Minimum Viable Product.

Utility matrix

In order to start working on the data and create a model, we went from that dataframe to an utility matrix.

In a recommendation-system application there are two classes of entities, which, in our case, we shall refer to as users and tracks. Users have preferences for certain tracks, and these preferences must be teased out of the data. When data itself is represented as a utility matrix you consider each user-track pair as a value that represents what is known about the degree of preference of that user for that track. In our case, the only measurement we have of that preference is the amount of times that track was listened by that user. We assume that the matrix is sparse, meaning that most entries are “unknown.” An unknown rating implies that we have no explicit information about the user’s preference for the track.

A (super) small example from our utility matrix. Users are the columns and tracks are the rows.

In our case, not only the matrix was sparse… it was HYPER sparse.

Yes. 7.75 gigs on memory is way too much more than what my MacBook Pro can handle.

Scipy to the rescue! csr_matrix let us work comfortably with a 56Mb structure.

That’s a reduction of 13844.66% in size!!!!

The goal of a recommendation system is to predict the blanks in the utility matrix. For example, would user Alex like ‘Despacito’? There is little evidence from the tiny matrix.We should also be aware of a slightly different goal that makes sense in many applications. It is not necessary to predict every blank entry in a utility matrix. Rather, it is only necessary to discover some entries in each row that are likely to be high. In most applications, the recommendation system does not offer users info of all items, but rather suggests a few that the user should value highly. It may not even be necessary to find all tracks with the highest expected listen counts, but only to find a large subset of those with the highest listen counts.

Enough chit chat about the database, let’s dive deep into our solution.

Solution

We implemented a multi step approach for the problem. In order to guide you through the pipeline, imagine we have 3 users who want to listen to music together and require a playlist made out of 10 songs.

ALS

First step in our path will use Alternating Least Squares, an algorithm specially suited to implicit recommendation systems, to come up with one (the best) song recommendation for each user. The algorithm will suppose that the negative items are implicitly defined: it assumes that non-zero items in the item_users matrix means that the user liked the item. The negatives are left unset in this sparse matrix. For this step we’ll use Implicit to calculate each recommendation.

By the end of this step, each user will have the song that it’s most suited for each of them. 3 users, so 3 songs.

Item2Vec

Item2vec is a deep learning technique based on word2vec. Word2vec is a Natural Language Processing technique that embeds words into a vector space that captures the semantics of the words.

It starts by using the dummy task of training a neural network to predict the context of a word. That is, suppose we have the sentence “the cat sat on the mat”, if we gave this model the word “cat”, we would expect it to return us the context “the … sat on the mat”. But that’s not the interesting part yet.

If we ignore the output of this model and take the results from its hidden layer we see that a vector space for words were created, where similar words are close to each other. It has very interesting algebraic features such as taking the vector for the word KING, subtracting the vector for the word MAN and adding WOMAN results in the vector for QUEEN. Another neat thing, the distance between the vectors for LONDON to ENGLAND is similar to the distance between PARIS and FRANCE or BERLIN and GERMANY. This embedding space is able to understand some of the semantics of these words.

What does it have to do with song recommendations? We adapt the problem such that the sequence of songs listened by a user is a sentence where each song is considered to be a word. We train the neural network such that it predicts the context of a song and then create an embedding space such that each song in our vocabulary is represented by a vector and similar songs are placed close together.

At first, we got some pretty interesting results where the model placed songs from the same albums really close together. But after some thinking we see that this unsurprising as people tend to listen whole albums in sequence. We then retrained the model with less iterations so it could underfit a little and give some surprising recommendations.

In our running example, this means we take the 3 songs from the previous step and get the 3 respective song vectors from the embedding space.

Calculating the median

No more than basic algebra combined with basic statistics; the median is the value separating the higher half of a data sample, a population, or a probability distribution, from the lower half.

We compute the median vector (i.e. the median of each component) of the recommended song vectors. So out of the 3 song vectors we obtain a single one.

kNN

kNN is one of the most popular machine learning algorithms. It’s a non-parametric method used for classification and regression. In this case, we’ll use it as a classifier, but just to get the nearest neighbors without worrying about the results.

Basically, and in a simple way, we’ll find the k most similar songs to the output we got in our last step (k being the amount of songs we requested in the final playlist, so k=10). We are left with 10 vectors. And what about the songs you may ask?

Reverse dictionary

The vectors go through a reverse dictionary where vectors are keys and songs are values.

Thus, from 10 vectors we get our 10 songs and voilà!

Results

Using Spotipy we are able to translate our 0s and 1s in our model to 0s and 1s in your Spotify account.

Pretty cool, huh?

Evaluation

We arrive to a tricky part of the project. If we are unable to evaluate our development and model, we can’t be sure if it works and, in consequence, the work would be in vain. It is important to realize that we do not have a reliable feedback regarding which tracks are unloved, as not listening to a song can stem from multiple different reasons. In addition, we are currently unable to track user reactions to our recommendations. Thus, precision based metrics are not very appropriate, as they require knowing which tracks are undesired to an user. However, watching a program is an indication of liking it, making recall-oriented measures applicable.

We are evaluating a scenario where we generate for each user an ordered list of the songs, sorted from the one predicted to be most preferred till the least preferred one. Then, we present a prefix of the list to the user as the recommended songs. Let’s take a look at the rank metric we used.

For all users u and tracks i,

r_ui indicates the amount of times user u listend to song i; e.g. the R[u,i] if R is the utility matrix

•rank_ui is the percentile-ranking of song i within the ordered list of all programs prepared for user u.

Lower values of rank are more desirable, as they indicate ranking actually listened songs closer to the top of the recommendation lists. Notice that for random predictions, the expected value of rank_ui is 50% (placing i in the middle of the sorted list). Thus, rank 50% indicates an algorithm no better than random.

To get a sense of the algorithm’s performance, we make recommendations for 100 groups of one person, 100 groups of two people and so on until a max number of people. Because of hardware limitations, we were able to test our model for groups of users of up to 5 people. We evaluate the scores for each of these groups and for each groups size we take the average score and plot them.

We also compute the average cosine similarity for each one of these groups. The cosine similarity of two users is 1 minus the cosine distance between their columns in the utility matrix. We compute the similarity for each pair of users in the group and then average it out. Groups with similarities close to 1 are pretty homogeneous and it should be easier to make recommendations to them, while similarities close to 0 show groups that are very dissimilar and that shouldn’t even be hanging around together. We also average the similarities for each group size and plot them here.

Here are the results.

As expected, the similarity for single user groups is 1 (if you’re not similar to yourself you’re probably in deep trouble, buddy) and it goes exponentially down as group sizes get larger, which also makes sense as recommending for large groups should be a harder task.

Our results show that our recommendations are considerably better than random and they get worse as the similarity diminishes. We have a bit of an outlier for groups of size 2, but that’s probably due to the small number of groups we were taking into account.

That being said, the scores are still in the range from 20 to 30, indicating that there is a lot that can be done to improve these results. Which is a good thing, we still can have some more fun with this project!

Future lines of work

The model we created, though fun and somewhat effective, is not scalable since it only tethers to the 1k users and the 1M songs on the db. It would be amazing to work along a big company who owns a more comprehensive database or the means to quickly add one column and more rows and retrain the algorithm; but meanwhile a solution could be designed.

Also, we could try to upload our model to cloud computing services like AWS or Google Cloud and run tests up to 100 users to see how it behaves.

The item2vec approach is pretty novel for group recommendations and there are many different approaches that can be tested, such as making K recommendations for each user and then taking the median vector for each one of these or clustering song vectors together and taking a sample from each cluster, etc.

Serving the model online and coding a nice front-end user interface is also in the future plans.

Again, if you‘d like to check our code version of this story, you can check the public repository.

References

Mezei, Zsolt, and Carsten Eickhoff. “Evaluating Music Recommender Systems for Groups.” arXiv preprint arXiv:1707.09790 (2017).

Yoshii, Kazuyoshi, et al. “Hybrid Collaborative and Content-based Music Recommendation Using Probabilistic Model with Latent User Preferences.” ISMIR. Vol. 6. 2006.

Parra, Denis, et al. “Implicit feedback recommendation via implicit-to-explicit ordinal logistic regression mapping.” Proceedings of the CARS-2011 (2011).

Hu, Yifan, Yehuda Koren, and Chris Volinsky. “Collaborative filtering for implicit feedback datasets.” Data Mining, 2008. ICDM’08. Eighth IEEE International Conference on. Ieee, 2008.

Barkan, Oren, and Noam Koenigstein. “Item2vec: neural item embedding for collaborative filtering.” Machine Learning for Signal Processing (MLSP), 2016 IEEE 26th International Workshop on. IEEE, 2016.

Leskovec, Jure, Anand Rajaraman, and Jeffrey David Ullman. Mining of massive datasets. Cambridge university press, 2014.