YouTube's Recommendation Engine: Explained

Every successful tech product, by the very definition, is a result of some technological marvels working with impeccable user experience to solve a key problem for the users. One such marvel is the recommendation engine by YouTube.

In this post, I am going to shed some light on YouTube’s very popular recommending engine and why it stands out.

Have you ever wondered how you knock YouTube’s door for just a video on some topic and end up killing hours of your day just watching the recommended videos, or hook onto YouTube for your favourite music track and end up increasing your favourites list? Many content-driven enterprises are funnelling loads of capital for transforming their recommender systems as or more robust as YouTube’s, for making the website more engaging for us.

These recommender systems are the most trivial applications of Machine Learning that users will encounter, used by almost every website that has some content to offer, be it YouTube, Spotify, Facebook, Instagram, etc. The algorithm responsible for YouTube’s recommendations seems to be Sherlock Holmes himself who knows what video will make the website more riveting for you. Here is an article about the gears that run behind YouTube’s much-admired recommender systems.

YouTube’s recommender systems were run by Google Brain that was later opensourced by Google as TensorFlow. Doing this made it easy for the entire world including Google to train, test and deploy deep neural networks in a distributed fashion. Shortly after that, at the 10th ACM conference for recommender systems, Google’s delegates demonstrated the deep neural networks approach for recommender systems. After this unraveling of Deep Learning and more importantly the potential concealed behind it, many problems like the matrix factorization problem started getting tackled by the deep learning approach.

As explained multiple times by YouTube and its team, the sole purpose of Youtube's recommendation system is to put fresh and relevant content to their users. The videos a user views or the channels they subscribe to, everything matters. In a research paper named Deep Neural Networks for YouTube recommendations, all of this has been explained at length.

The paper published by Google analysts and researchers at ACM conference, gives insights about how to look at the recommender problem like a ranking system that is affected by myriad factors. But the most highlighted and emphasized ingredient of all is user feedback. Figuratively, Henry Ford also has a contribution here, as one of the most important techniques out of the bag of tricks is Multi-gate Mixture-of-Experts which multi-tasks learning by sharing the expert submodels across all tasks to manage multiple objectives, while also having a gating network trained to optimize each task, much like an assembly line.

If you really ponder over it, there are just two impediments on the way of solving any recommender problem. People often only consider recommender systems in the shoes of a user, but an integral part of these systems are also the videos that they have to rank. YouTube ranks its videos on multiple features like video content, thumbnail, audio, title, description, etc. The primary requirement according to the researchers was to bridge the semantic gap between low-level video features to get them on a single scale, and then make the model capable of distributing the items sparsely over the feature space.

Now when this sparsity was attained, you would assume that it is hays and sunshine all over. But not for the world’s most popular video platform. Because with popularity came users, and with users came scalability issues like, some features required by the model were only available online and could not be fetched beforehand. Also, due to the sparse distribution of the items, it was difficult for the matrix factorization approaches to scale among the entire feature space.

To solve this scalability issue, the video corpus containing billions of videos are passed through two networks and marginally lesser amounts of videos are retrieved after each stage of the network. Through this way, a sparse distribution of items is ensured without having the scalability issues that come with it.

The Candidate Generation Network :

The first network that the video corpus passes through is the candidate generation stage. The algorithms running behind this stage mean to spot paradigms between searched queries and candidate video. The candidate generation network takes the huge corpus containing billions of videos as input, user’s activity log into account and gives out a few hundred videos as an output that are considered for user recommendation. The network aims for accuracy and relevance, even if it means channeling out videos with greater views but prove to be irrelevant.

The Ranking Network:

The second stage that the items retrieved after candidate generation network go through is the ranking stage. In ranking network, a comprehensive set of video features is taken into account and the videos are given scores. It must be noted that user feedback is part of both the networks and is a more important criteria for this network. The degree of relevance on the other hand holds less weightage in this network, than the candidate generation network.

The user input in the ranking network basically comprises of two behaviours:

- Users' engagement behaviour that is collected from clicks and watches

- Degree of satisfaction that is gathered by likes, dislikes, dismissals, etc.

The ranking problem might look like a piece of cake but with all of its objectives lined up, it becomes the epitome of confusion. The researchers chose to go with learning-to-rank framework to solve this problem. They modelled this problem like a combination of classification and regression problems with multiple objectives. Here, Multi-gate Mixture-of-experts (MMoE) comes into the play.

The neural networks generally used Rectified Linear Unit layer as an activation function (responsible for transforming the summed weighted input from the node into the activation of the node or output for that input). But in the YouTube recommender system, the analysts and researchers substituted the Rectified Linear Unit layer with the Mixture-of-Experts layer and added a separate gating network.

The recommender system as a whole is then trained left-to-right, and test the collected data which holds user context upto time t and the system is asked what videos they’d be interested in at time t+1. A true marvel of modern computing, devops engineering, and much more!

Wrapping Up

The YouTube recommendation engine has really redefined content recommendation. The research paper published at the 10th ACM conference 2016 by Google, highlights the changes that should be made in the ranking systems to find the perfect way of handling multiple objectives. It should be read by every data science and machine learning student. The issues of scalability and consistent improvements in ranking systems have been resolved due to this paper.

By open-sourcing TensorFlow, Google has provided a platform for innovation in recommender systems and also deep learning neural networks.