Machine Learning in Production

Before you embark on building a product that uses Machine learning, ask yourself, are you building a product around a model or designing an experience that happens to use a model. While these might feel like a chicken or an egg dichotomy, the answer can significantly change the validity of your product. From my experience building a product around a model is often fraught with complications, reductions in scope or performance, and an end result that looks and feels very different than the original concept. However, I’ve found when you design a product or experience that happens to achieve its effect by using a model it has routinely resulted in better outcomes across the board.

Any model you build will attempt to faithfully learn the patterns in the data you provide. That’s why it is so hard to build a product around a model. The incentive to make the best model often does not lead to choices that benefit people that consume the model’s output. Instead, keep your product goals in mind when crafting your training and tests data sets. That way you can avoid the trap of chasing down the best offline model accuracy.

For Textio, one example of good model building incentives comes from our recruiting mail product. The models that power this writing experience aim to understand the context and content people are writing as well as how the patterns in their writing relate to the chance that someone will respond to their mail. We don’t train these models on data from all kinds of email communication, rather we focus on the domain we expect to match the writing experience we are trying to create. In this case we curate mails from recruiters intending to reach out to a passive candidate with the goal of drawing them into the hiring process. We purposely set out with a narrowly defined goal.

Narrow goals are harder for a model to pick up, but the person using the product will see much more value. A model trained on a wider array of email types might obtain better response rate prediction accuracy on paper (after all, spam is a well-studied problem) but it’s unlikely to help you improve passive candidate sourcing. Put another way, why would you build a model that learns all animals in the zoo, when all you really want it to find and understand are elephants.

The key to a successful model is ensuring you have the type of data that is most relevant to your product and the people that use it. One way to gain this insight is to monitor the data people provide to you within your product. You can do this before you even build the first model. For example, if we trained our recruiting mail models on data predominately composed of short documents (100 words or less) but monitoring reveals that word counts in production are routinely above 400 words, that is a red flag! It’s unlikely that the predictions Textio is providing are still meaningful given how different those documents might be from the ones the model has seen before.

While this sounds clear in principle it’s often hard to achieve in practice. It takes work to curate data sets that represent the inputs you want a model to learn. While your narrowly focused model that solves a real business problem works hard to provide relevant predictions it can, at times, receive data it has quite literally never seen before. Let’s say you have instrumentation set up and can measure the difference between your training data and the data your product sees day in and day out. What do you do if the difference is meaningful? For our recruiting mail product, we built a model that checks, as someone is typing, whether the content looks like a recruiting mail. If not, we don’t send that text to the scoring model. This is another narrowly defined goal that serves the product purpose of preventing our scoring models from providing guidance and feedback on a document far different than something they have seen before. I call this approach layering, which results in a product that has many predictive capabilities without the downside of having it all wrapped up in one mega model.

Beyond monitoring your models in production, it is important to enable your product to collect its own labels. Often these labels include the outcomes you care about the most and that provide the best value proposition to people. When you bootstrap a data set for a new model you often do creative things to collect labels. You might have access to information that is correlated with the real outcome you care about, you might crowd source labels, or you may even start with simple heuristics and rules.

All of these are valid ways to get into the game but the long-term data product needs a consistent source of quality labels. Indeed, all of the monitoring you put in place is only useful if you are collecting labels that allows you to change your model. As an example, when we inform someone that they do not appear to be typing a recruiting mail we add a button to the bottom of the dialog allowing them to ignore the warning. This button is instrumented to collect the signal that our prediction might be wrong. This label is then used to improve the model itself.

When I worked at a search engine it dawned on me one day that the primary interaction with any model I built would be a blank text box on the internet. What you learn from this is that people can and will type literally anything you can imagine into that box. At Textio our data products power an augmented writing experience that has opened my eyes to unimaginable diversity in language. If you design a product first and foremost while using machine learning to achieve specific objectives rather than building a model and wrapping a product around it your customers and your future self will thank you kindly.