A Guide to Scaling Machine Learning Models in Production

Written by harkous | Published 2020/02/10
Tech Story Tags: machine-learning | coding | deep-learning | python | software-development | hackernoon-top-story | scaling-machine-learning | machine-learning-production

TLDR A Guide to Scaling Machine Learning Models in Production is not always needed. Many researchers/engineers find themselves responsible for handling the complete flow from conceiving the models to conceiving them to serving them to the outside world. We’ll be considering the context of Python-based frameworks on Linux servers on a Linux server. We will only consider the case of serving models over CPU, rather than GPUs, in this tutorial. Most components above can be easily replaced by equivalent components with little to no change in the rest of the steps.via the TL;DR App

The workflow for building machine learning models often ends at the evaluation stage: you have achieved an acceptable accuracy, and “ta-da! Mission Accomplished.
Beyond that, it might just be sufficient to get those nice-looking graphs for your paper or for your internal documentation.
In fact, going the extra mile to put your model into production is not always needed. And even when it is, this task is delegated to a system administrator.
However, nowadays, many researchers/engineers find themselves responsible for handling the complete flow from conceiving the models to serving them to the outside world. Whether it is a university project or a personal experiment, demonstrating our work is generally a great way to get a wider audience interested. Few people would make the extra effort to use a system for which the value is not instantly perceived.
In this article, we’ll go together through this workflow; a process that I had to repeatedly do myself. The assumption is that you have already built a machine learning or deep learning model, using your favorite framework (scikit-learn, Keras, Tensorflow, PyTorch, etc.). Now you want to serve it to the world at scale via an API.
By “at scale”, we’re not talking about an industrial scale of a huge company. The goal is to make the best out of that server with lots of CPUs and a large RAM, sitting idly at your institution or in the cloud. This entails serving multiple requests at a time, spawning new processes as the load increases, and reducing the number of processes as load decreases. You also want the additional guarantee that your server would be restarted after unexpected system failures.
If that is what you have in mind, let’s go through it together.
We’ll be considering the context of Python-based frameworks on Linux servers. Our setup will involve:
Anaconda: for managing package installation and creating an isolated Python 3 environment.
Keras: a high-level neural networks API, that is capable of running on top of TensorFlow, CNTK, or Theano.
Flask: a minimalistic python framework for building RESTful APIs. Despite being easy to use, Flask’s built-in server serves only one request at a time by default; hence it is not suitable on its own for deployment in production.
nginx: the highly stable web server, which provides benefits such as load-balancing, SSL configuration, etc.
uWSGI: a highly configurable WSGI server (Web Server Gateway Interface) that allows forking multiple workers to serve multiple requests at a time.
systemd: an init system used in multiple Linux distributions to manage system processes after booting.
Nginx will be our interface to the internet, and it will be the one handling clients’ requests. Nginx has native support for the binary uWSGI protocol, and they communicate via Unix sockets. In turn, the uWSGI server will be invoking a callable object within our Flask application directly. That is the way that requests will be served.
A few notes at the beginning of this tutorial:
- Most components above can be easily replaced by equivalent components with little to no change in the rest of the steps. For example, Keras can be easily replaced with PyTorch, Flask can be easily replaced with Bottle, and so on.
- We will only consider the case of serving models over CPU. The typical case is having access to a server with a lot of CPU cores and trying to make use of these cores to serve the models. GPUs, on the other hand, are more expensive to get in large numbers. Moreover, depending on your application, the speed gain you get from using a GPU at prediction time might not be that significant (especially in NLP applications).

Setting up the Environment

To begin with, we need to install the systemd and nginx packages:
sudo apt-get install systemd nginx
Next, we have to install Anaconda by following the instructions on the official site, which consist of downloading the executable, running it, and adding Anaconda to your system’s PATH. Below, we will assume that Anaconda is installed under the home directory.
All the code and configuration files in this article are available from the accompanying Github repository. But make sure you follow the steps below to get the full workflow.
Next, let’s create the isolated Anaconda environment from the environment.yml file. Here is how this file looks like (it already contains several of the frameworks we’ll be using):
name: production_ml_env
channels:
  - conda-forge
dependencies:
- python=3.6
- keras
- flask
- uwsgi
- numpy
- pip
- pip:
  - uwsgitop
We run the following to create the environment:
<code class="markup--code markup--pre-code">conda env create --file environment.yml</code>
When we want to activate this environment, we run:
source activate production_ml_env
By now, we have Keras installed, alongside flask, uwsgi, uwsgitop, etc. So we are ready to get started.

Building the Flask Web App

For the purposes of this tutorial, we will not dive deeply into how to build your ML model. Instead, we will adapt the example of topic classification using the Reuters newswire dataset bundled within Keras. This is the code for building the classifier:
'''Trains and evaluate a simple MLP
on the Reuters newswire topic classification task.
'''
from __future__ import print_function
import os
import numpy as np
import keras
from keras.datasets import reuters
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.preprocessing.text import Tokenizer
from keras.callbacks import ModelCheckpoint

MODEL_DIR = './models'

max_words = 1000
batch_size = 32
epochs = 5

print('Loading data...')
(x_train, y_train), (x_test, y_test) = reuters.load_data(num_words=max_words,
                                                         test_split=0.2)
print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')

num_classes = np.max(y_train) + 1
print(num_classes, 'classes')

print('Vectorizing sequence data...')
tokenizer = Tokenizer(num_words=max_words)
x_train = tokenizer.sequences_to_matrix(x_train, mode='binary')
x_test = tokenizer.sequences_to_matrix(x_test, mode='binary')
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)

print('Convert class vector to binary class matrix '
      '(for use with categorical_crossentropy)')
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)
print('y_train shape:', y_train.shape)
print('y_test shape:', y_test.shape)

print('Building model...')
model = Sequential()
model.add(Dense(512, input_shape=(max_words,)))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(num_classes))
model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

if not os.path.exists(''):
    os.makedirs(MODEL_DIR)

mcp = ModelCheckpoint(os.path.join(MODEL_DIR, 'reuters_model.hdf5'), monitor="val_acc",
                      save_best_only=True)

history = model.fit(x_train, y_train,
                    batch_size=batch_size,
                    epochs=epochs,
                    verbose=1,
                    validation_split=0.1,
                    callbacks=[mcp])

score = model.evaluate(x_test, y_test,
                       batch_size=batch_size, verbose=1)
print('Test score:', score[0])
print('Test accuracy:', score[1])
To replicate the setup we use here, simply run the following to train a model without a GPU:
export CUDA_VISIBLE_DEVICES=-1<br>KERAS_BACKEND=theano python build_classifier.py
This will create a model file reuters_model.hdf5 in the folder models. Now, we are ready to serve the model via Flask on port 4444. In the code below, we provide a single REST endpoint /predict that supports GET requests, where the text to classify is provided as a parameter. The returned JSON is of the form{"prediction": "N"}, where N is an integer representing the predicted class.
from flask import Flask
from flask import request
from keras.models import load_model
from keras.datasets import reuters
from keras.preprocessing.text import Tokenizer, text_to_word_sequence
from flask import jsonify
import os

MODEL_DIR = './models'

max_words = 1000

app = Flask(__name__)

print("Loading model")
model = load_model(os.path.join(MODEL_DIR, 'reuters_model.hdf5'))
# we need the word index to map words to indices
word_index = reuters.get_word_index()
tokenizer = Tokenizer(num_words=max_words)


def preprocess_text(text):
    word_sequence = text_to_word_sequence(text)
    indices_sequence = [[word_index[word] if word in word_index else 0 for word in word_sequence]]
    x = tokenizer.sequences_to_matrix(indices_sequence, mode='binary')
    return x


@app.route('/predict', methods=['GET'])
def predict():
    try:
        text = request.args.get('text')
        x = preprocess_text(text)
        y = model.predict(x)
        predicted_class = y[0].argmax(axis=-1)
        print(predicted_class)
        return jsonify({'prediction': str(predicted_class)})
    except:
        response = jsonify({'error': 'problem predicting'})
        response.status_code = 400
        return response


if __name__ == "__main__":
    app.run(host='0.0.0.0', port=4444)
To start the Flask application server, we run:
python app.py
Voila! Now we have the simple, lightweight server running.
You can test the server with your favorite REST client (e.g., Postman) or by simply going to this URL in your web browser (replace your_server_url by your server’s URL): http://your_server_url:4444/predict?text=this is a news sample text about sports and football in specific
You should get back a response as
{<br>  "class": "11"<br>}

Configuring the uWSGI Server

Now, we are off to scaling our simple application server. uWSGI will be the key here. It communicates with our Flask application by invoking the callable object app in the file app.py. uWSGI includes most of the parallelization features we are after. Its configuration file looks as follows:
[uwsgi]
# placeholders that you have to change
my_app_folder = /home/harkous/Development/production_ml
my_user = harkous

socket = %(my_app_folder)/production_ml.sock
chdir = %(my_app_folder)
file = app.py
callable = app

# environment variables
env = CUDA_VISIBLE_DEVICES=-1
env = KERAS_BACKEND=theano
env = PYTHONPATH=%(my_app_folder):$PYTHONPATH

master = true
processes = 5
# allows nginx (and all users) to read and write on this socket
chmod-socket = 666
# remove the socket when the process stops
vacuum = true

# loads your application one time per worker
# will very probably consume more memory,
# but will run in a more consistent and clean environment.
lazy-apps = true

uid = %(my_user)
gid = %(my_user)

# uWSGI will kill the process instead of reloading it
die-on-term = true
# socket file for getting stats about the workers
stats = %(my_app_folder)/stats.production_ml.sock

# Scaling the server with the Cheaper subsystem

# set cheaper algorithm to use, if not set default will be used
cheaper-algo = spare
# minimum number of workers to keep at all times
cheaper = 5
# number of workers to spawn at startup
cheaper-initial = 5
# maximum number of workers that can be spawned
workers = 50
# how many workers should be spawned at a time
cheaper-step = 3
On your side, you have to modify the option my_app_folder to be the folder of your own app directory and the option my_user to be your own username. Depending on your needs and file locations, you might need to modify/add other options too.
One important section in the uwsgi.ini is the part where we use the Cheaper subsystem in uWSGI, which allows us to run multiple workers in parallel to serve multiple concurrent requests. This is one of the cool features of uWSGI, where dynamically scaling up and down is attainable with a few parameters. With the above configuration, we will have at least 5 workers at all times. Upon load increase, Cheaper will allocate 3 additional workers at a time until all the requests find available workers. The maximum number of workers above is set to 50.
In your case, the best configuration options depend on the number of cores in the server, the total memory available, and the memory consumption of your application. Take a look at the official docs for advanced deployment options.

Connecting uWSGI with nginx

We’re almost there. If we start uWSGI now (which we’ll do in a while), it will take care of invoking the app from the file app.py, and we will benefit from all the scaling features it provides. However, in order to get REST requests from the internet and to pass them to the Flask app via uWSGI, we will be configuring nginx.
Here is a barebone configuration file for nginx, with just the part we depend on for this application. Of course, nginx can be additionally used for configuring SSL or to serve static files, but that is out of the scope of this article.
server {
    listen 4444;
    # change this to your server name or IP
    server_name YOUR_SERVER_NAME_OR_IP;

    location / {
        include         uwsgi_params;
        # change this to the location of the uWSGI socket file (set in uwsgi.ini)
        uwsgi_pass      unix:/home/harkous/Development/production_ml/production_ml.sock;
    }
}
We place this file in /etc/nginx/sites-available/nginx_production_ml (you will need sudo access for that). Then, to enable this nginx configuration, we link it to the sites-enabled directory:
<code class="markup--code markup--pre-code">sudo ln -s /etc/nginx/sites-available/nginx_production_ml /etc/nginx/sites-enabled</code>
We restart nginx:
sudo service nginx restart

Configuring the systemd Service

Finally, we will launch the uWSGI server we configured earlier. However, in order to ensure that our server does not die forever after system restarts or unexpected failures, we will launch it as a systemd service. Here is our service configuration file, which we place in the /etc/systemd/system directory using:
sudo vi /etc/systemd/system/production_ml.service
[Unit]
Description=uWSGI instance to serve production_ml service

[Service]
User=harkous
Group=harkous
WorkingDirectory=/home/harkous/Development/production_ml/
ExecStart=/home/harkous/anaconda3/envs/production_ml_env/bin/uwsgi --ini /home/harkous/Development/production_ml/uwsgi.ini
Restart=on-failure

[Install]
WantedBy=multi-user.target
We start the service with:
sudo systemctl start <code class="markup--code markup--pre-code">production_ml</code>.service
To allow this service to start when the machine is rebooted:
sudo systemctl enable <code class="markup--code markup--pre-code">production_ml</code>.service
At this stage, our service should start successfully. If we update the service later, we simply have to restart it:
sudo systemctl restart <code class="markup--code markup--pre-code">production_ml</code>.service

Monitoring the Service

To keep an eye on the service and see the load per worker, we can use uwsgitop. In uwsgi.ini, we have already configured a stats socket within our application folder. To see the stats, execute the following command in that folder:
<code class="markup--code markup--pre-code">uwsgitop stats.production_ml.sock</code>
Here is an example of the workers in action, with additional workers that have already been spawned. To simulate such a heavy load on your side, even with simple tasks, you can artificially add a time.sleep(3) in the prediction code.
One way to send concurrent requests to your server is using curl (remember to replace YOUR_SERVER_NAME_OR_IP by your server’s URL or IP address.
#!/usr/bin/env bash
url="http://YOUR_SERVER_NAME_OR_IP:4444/predict?text=this%20is%20a%20news%20sample%20text%20about%20sports,%20and%20football%20in%20specific" # add more URLs here

for i in {0..10}
do
   # run the curl job in the background so we can start another job
   # and disable the progress bar (-s)
   echo "fetching $url"
   curl $url -s &
done
wait #wait for all background jobs to terminate
In order to monitor the log of the application itself, we can make use of journalctl:
<code class="markup--code markup--pre-code">sudo journalctl -u production_ml.service -f</code>
Your output should look like this:

Final Notes

If you have reached this stage, and your application has run successfully, then this article has served its purpose. Some additional notes are worth mentioning though at this stage:
lazy-apps mode in uwsgi: To keep this article general enough, we have used the lazy-apps mode in uwsgi, which loads the application one time per worker. According to the docs, this will require O(n) time to load (where n is the number of workers). It also likely requires more memory but results in a clean environment per worker. By default, uWSGI loads the whole application differently. It starts with one process; then it forks itself multiple times for additional workers. This results in more memory savings. However, this does not play well with all of the ML frameworks. For example, the TensorFlow backend in Keras fails without the lazy-apps mode (e.g., check this, this, and this). The best could be to try first without lazy-apps = true, and shifting to it if you encounter similar issues.
- Parameters of the Flask App: Because uWSGI invokes app as a callable, the parameters of the app itself should not be passed via the command line. You’re better off using a configuration file with the likes of configparser to read such parameters.
- Scaling across multiple servers: The guide above does not discuss the case of multiple servers. Luckily enough, this can be achieved without a significant change in our setup. Benefiting from the load balancing feature in nginx, you can setup multiple machines, each with the uWSGI setup we described above. Then, you can configure nginx to route the requests to the different servers. nginx comes with multiple methods to distribute the load, ranging from a simple round-robin to accounting for the number of connections or the average latency.
- Port selection: The above guide uses port 4444 for illustration purposes. Feel free to adapt to your own ports. And make sure that you open these ports in the firewall or ask your institution’s administrators to do so.
- Socket permissions: We have been permissive in the socket permissions by giving write access to all users. Feel free to also adjust these permissions to your purposes and to run the service with a different user/group. Make sure that your nginx and uWSGI can still talk to each other successfully after your changes.
So that’s it! I hope this guide and the associated repository will be helpful for all those trying to deploy their models into production as part of a web application or as an API. If you have any feedback, feel free to drop a comment below.
And thanks for reading! You might also be interested in checking my other articles on my Medium page:
… or in seeing what I tweet about:

Published by HackerNoon on 2020/02/10