Enhancing Netflix's Deep Personalization: The Full Potential Of Its Current AI Recommender Systems

Ugh! To think that this article started from a rant about my K-drama search results. SMH.

I’m thrilled that it did because it led me down the rabbit hole of Algorithms, Deep Personalization, Search optimization, and…wait for it…Netflix’s Research Library! That site is to my brain what a pâtisserie is to my eyeballs and belly. I digress, but seriously, if you haven’t checked it out, you should.

It is due to my curiosity that I decided to write this article regarding optimizing Netflix's ML ranker outcomes for a more efficient and accurate deep personalization experience of subscribers’ long and short-term viewing preferences — to address both Fetch and Explore intents. I found writing this article very interesting because it takes the business and customer angles, as well as the infrastructure, into consideration.

Context

Netflix seems to focus more on home-page recommendations that enhance users’ pre-query experience and less on post-search experience. To some extent, this is great because if it were to be the other way around from the beginning, we’d all have dropped Netflix years ago. After all, the user experience (UX) would have been a hot mess!

According to Das et al. (2022) in their paper on query-facet mapping and its applications, even though most of the content discovery happens on the home page, a large amount still comes from Search. So, I can’t help but wonder, if this is the case and Netflix has solidified its UX dominance amongst its competitors, how come members still don’t have some control over their experience, especially during Search? As a Product Manager, I understand focussing on features that affect more users, but sometimes, paying attention to the seemingly insignificant 20% that don’t seem like urgent user needs could lead to 80% of our desired product outcomes.

There’s been some recent buzz about Netflix’s customer satisfaction ratings in the US/Canada region, and the data shows that there seems to be quite some churn in this region despite Netflix’s growing presence in other regions. Nonetheless, it can be argued that that region was the primary market for the product. Thus, it’s a more mature market. Therefore, it is highly likely that most other markets will follow the same trajectory as the product/markets mature and other market players arise in the mid to long term.

The way I see it, as Alex Ratner stated in a tweet, “…Real value from AI always hinges on solving the hard 20%- and this almost always requires an entirely different set of approaches.” thus, Netflix must start to look into the features that they don’t focus much on — based on the evidence, that’d be query-related recommendations and Search. What if the “different set of approaches” is simple?

Let’s be real: Netflix’s recommender systems are SUPREME (All Hail!), and the recommendations on the homepage are great a lot of the time, but it becomes more stressful when I need to find films within a genre or category that falls within Netflix’s “movie genome” (“This show is …” e.g., Swoonworthy) because the recommendations on the homepage are finite. From my research, which included an admittedly ‘leading’ and informal Twitter poll with a small sample size, interview sessions, and scouring tons of forums on the internet for reasons why people don’t like Netflix, recommendations and search relevancy seem to be recurring customer issues. These hangups don’t stop the users from clocking their “Hours spent watching…” metric because the recommendations are decent enough, but from the responses I got, it’s almost like a nagging but dull pain point that subscribers have. They know it can be better, but it works fine to accomplish their primary goal, and they’ve used Netflix for so long that it’s a pain that they have learned to deal with.

As a subscriber, what do I want? I want to be able to find more movies that meet my specific taste so that my recommendations are more accurately tailored to my preferences.

What does Netflix want? According to their Research website, they want me, the subscriber, to spend less time searching and more time watching what I like.

So, what if I’m able to input search components once in a blue moon, and my filters are so detailed that the queries finetune the recommender systems to provide better in-session pre-query recommendations during Fetch and Explore user intent actions such as idly scrolling/hovering through recommendations lists and typing to search respectively?

For instance, this has happened to me too many times than I can count and apparently so many others experience the same;

Scenario: Uchenna prefers to watch Korean TV shows in the romance genre that are swoonworthy, comedy and heartfelt. It doesn’t mean that if Netflix recommends a Chinese TV show that is in the romance genre with genome tags for ‘Bittersweet, Heartfelt and Emotional’, she wouldn’t watch it. She most likely would or she would have to scroll for several minutes looking for something that meets her current mood and preference.

Netflix’s Problems

In my humble opinion, these are Netflix’s problems;

Firstly, some of its core metrics are extrinsically motivated, quite presumptive, and based on the premise that success is in the completion of the main goal, which is, watching something on the platform in video format. In a perfect product world with business KPIs at the forefront of decision-making, yes, this is right, but in a world with flawed humans with varying moods, preferences, and intents, this is a flawed and unempathetic approach — it does not take into account that I might enjoy watching something that was recommended to me, but it doesn’t mean that it is what I would usually go for or it’s what my mood calls for at the point of selection.

Because Netflix is in the growth stage of the product lifecycle, it seems the current metrics are based on customer acquisition and retention without much focus on customer satisfaction. However, the bottom line is that I’m generally happy, but my need at that moment hasn’t been met. In addition, in a scenario where I have to find films with labels that are close to what I have in mind, it’d take longer for me to find something, which then explains why most people are likely to watch something that’s been recommended to them by a friend instead of searching or always following the recommendations.
Secondly, another recurring complaint on the internet pertains to the tags. The genre tags are inconsistent and do not follow regular genres. In addition, the genome tags and categories seem to be inconsistent and disorganized as well. With inconsistent and scattered labeling of video content, the likelihood of the recommender systems producing pre-query and post-query recommendations that are exactly like the ‘previously watched’ content and/or the probability of accurately predicting the subscriber’s taste becomes diminished.

Thus, Netflix has a great recommender systems infrastructure, but the labels aren’t organized properly enough to help the customer decide as well as help the systems make fully optimized anticipatory decisions. For instance, last night, I started watching a TV show with genome tags ‘Swoonworthy, comedy, quirky,’ and there’s absolutely nothing comedic about this show. It’s quirky and romantic but definitely not comedy—wrong labeling.
Lastly, e-commerce, content-sharing, and video tech companies are beginning to optimize their Search function despite having great algorithms that predict content based on user preferences but Netflix’s search function barely even has any filters.

I found lots of comments online during my research, but one of them seems to encapsulate everything I know is wrong with Netflix.

Recommendations

Option 1: Optimization of Search Filters and Creation of ‘Saved Search Labels

Applying optimized search filters and member pre-selected ‘saved search’ results could help improve Netflix’s recommender systems’ outcomes for pre-query in-session and long-term recommendations. Using a saved search to establish a baseline for behavior and user interest. User interactions with the baseline recommendation results can then help predict user preferences within the confines of that saved search, e.g., a Saved search named Quirky K-Drama can show more related results as the user interacts with the recommendation results — ‘Because you saved Quirky K-Drama.’ Below is a design of the Advanced Search/Saved Search feature recommendation.

Option 2: Creating Expandable Pre-Query Recommendations Carousel Lists

‘Because you watched…’ and ‘Because you liked…’ rows/ recommendations are finite, which puts a limit on the subscriber’s explore intent. Currently, the recommender system provides a carousel row for each and then more carousel rows for movies with similar genome tags to what you’ve watched in the past. This creates quite a lot of clutter and diminishes the user’s experience because they have to scroll down to view films within each genome tag. Instead of creating too many lists/carousels that make the user continuously scroll downwards and sideways to find videos within mislabeled genome tags, Netflix can optimize the ‘Because you liked…’ and ‘Because you watched…’ carousels by opening them up to become expendable. This way, there’s not much of a break in the user’s exploratory actions, thus creating a better experience for them.

Option 3: Creating Genome Tags Ratings to Improve Labelling and Algorithmic Outcomes

I’m thinking out loud (on paper), but just keep up with me, and hopefully, this written soliloquy will be coherent enough for you to understand the point that I’m trying to make.

There’s no way that humans can possibly watch every video in the Netflix library to judge the genome tags and relabel them correctly. Besides, if this were possible, the labels might even be more disorganized because the perception of a video is, to some extent, subjective. Bringing that bias into the mix might make the data more inaccurate. But could there be a way for subscribers to rate a video’s genome tag accuracy? If this happens, the recommender systems can use the data to properly label the video content.

For instance, Grammarly does something similar with its text tone detector ratings. I believe these ratings feed into their algorithm that helps predict and detect the tone of text.

The drawback to this is that Netflix might have to introduce a more detailed rating system for videos, which is a feature that was discontinued many years ago. However, if there’s a way to introduce the genome tag rating system even for three months, I believe that predictions will be better. Open Beta testing provides the perfect environment and timeframe for Netflix to allow subscribers to help train the model for more accurate predictions.

Potential Limitations in Implementing My Recommendations

Cost-Benefit Dilemma: It is vital to figure out whether the cost of implementing these ‘delightful’ feature recommendations can justify the required investment because, at the end of the day, Netflix is a business, and cost-benefit analysis is a given. There was not much Netflix-specific data to analyze; thus, the cost-benefit analysis outcome is unknown, and the feasibility of implementing these suggestions cannot be determined with any certainty.
Risk of Sampling Error: Qualitative research for this article was done using a random sampling approach on a very small cumulative sample size of about 30 people. Other feedback gathered from different online forums might also be indicative of a recurring customer pain point. However, Netflix has over 200 million subscribers; thus, even though it may seem like these are widespread customer complaints, they might not necessarily be representative of the broader subscriber population size as the small sample size comes with the probability of a sampling error (Phew! What a long sentence). Anyway, only Netflix’s data can truly confirm the validity and importance of these issues to their users. Regardless, Hypothesis testing and further research would be required to know if these recommendations are worth implementing. Without testing, implementing these would be a shot in the dark.
Potentially Erroneous Assumption of Priority: These suggestions are presumptive that the features haven’t already come up on the Netflix Product team’s backlog or brainstorming board, and they either haven’t been prioritized or have been canned because they aren’t feasible.
Reality-Hypothesis Scenario Divergence: Recommendation Option 3 assumes that genome tags are determined by Netflix’s AI, however, there's a possibility that in reality, genome tags are added manually during the uploading process e.g. Think: How labels and categories are added to products in a POS system. If that's the case, then we are back to square one — human error due to subjectivity — this might account for discrepancies. However, on second thought, despite this change in scenario, I believe that the validity and viability of the genome tags ratings recommendation (Option 3) could very well still stand. Just think about it for a second. Unfortunately, unless you work at Netflix, this information is unknown, so let’s just stick with our imagination and inferences, okay? :)

Conclusion

Netflix arguably has the best Recommender systems in the game; its subscriber growth rate is increasing at a healthy pace due to expansion strategies, and its UX is probably the best in the competitive landscape as well. I envisage that they will continue to hold on to the majority of the market share for at least the next five years.

However, what happens if they eventually manage to capture the streaming markets in most target countries? What happens when they can't compete on price with local and international competitors? What happens when, like the US, most of these markets mature, and customers in these countries begin to look for better customer experiences with their video streaming? What happens when innovation in other streaming service companies, like Hulu, AppleTV, and HBO, outpace Netflix's holy grail recommender systems?

Ultimately, in the words of Steven Van Belleghem

, “What if customers want more than excellent service?”

I'm sure the amazingly talented people of Netflix are already thinking about these potential issues. These are long-term questions, but I believe that in the meantime, it's time to tweak Netflix's recommender systems to harness its optimum potential for this much-loved service to provide much better customer satisfaction. Customer experience (CX) is currently Netflix’s strength; however, customer satisfaction is slipping. A great product can have amazing CX, but if the customers want more than just excellent service to stay satisfied, more will need to be done. This is where these recommendations come in.

According to Former VP of Product at Netflix, Gibson Biddle, the Netflix product team is not just focused on understanding the customer’s current wants and needs but instead on ‘inventing and delivering on unanticipated future needs.’ I also came across an article in Harvard Business Review titled ‘Stop Trying to Delight Your Customers’, and one statement caught my attention, “According to conventional wisdom, customers are more loyal to firms that go above and beyond. But our research shows that exceeding their expectations during service interactions makes customers only marginally more loyal than simply meeting their needs.”

Source: Gibson Biddle (Former VP of Product at Netflix )

Therefore, typical metrics for customer satisfaction, such as NPS, might not capture the sentiments observed during my research. Focus groups and interviews are the best research methods for figuring out such “nagging-but-dull” pain points that customers have, as they don't cause churn per se. However, like a toothache, over time, the dull pain points eventually might be the reason for customers to migrate to a different service provider.

Thus, this is the reason why I strongly agree with Dixon et al. (2010) in their HBR article that states that Customer Satisfaction and Customer Retention/Loyalty are more likely to be captured using Customer Effort Score as opposed to Net Promoter Score or CSAT and that’s because you learn more when asking the main question in CES — “How much effort did it take the customer to achieve their desired goal.” I believe that if Netflix asks this question often, they will find the reason why customer experience (CX) is great but customer satisfaction (CS) and loyalty (especially in the US) are disproportionately lower, and the results would help deepen personalization efforts on the platform.

In the context of Netflix, Deep personalization stands at the intersection between Customer Satisfaction, Customer Experience, Customer Loyalty, and an optimized AI system. Supercharge Deep Personalization using features based on insights from research methods such as CES, the ML ranker becomes more efficient in determining user preferences. The hypotheses in this article address business problems and user problems and encourage revisiting the current product metrics at Netflix. It’s a win-win for all, right? Well, it’s up to you to decide.

In a nutshell, there’s still a lot that the Netflix ML/AI systems are capable of. There’s still a lot that Netflix can do better to acquire and retain subscribers if they ask the right questions and implement features that reduce the users’ effort, create a better experience, and enhance the efficiency of the recommender systems. Thus far, they have done a great job, but it can certainly be better. The question is whether the cost of enhancement is commensurate with the benefit of a more efficient platform.

What are your thoughts on this subject? Do the pain points resonate with you?

Also published here.