Deploying Deep Learning Models with Model Server

Anyone familiar with Deep Learning or any other similar buzz words is usually also familiar with Jupyter Notebooks. Unfortunately, many associate Machine/Deep Learning only with Jupyter Notebooks and nothing beyond that. However, in practice, Jupyter Notebook is a prototyping tool for model building, training, and experimentation.

Deployment of the models is an essential part of being a Data Scientist since, irrespective of the achieved accuracies, models that cannot be deployed will hardly be of any practical use.

The model deployment comes in many forms depending on the application. For example, if you are building a mobile app, you would likely want to use TensorFlow Lite or Pytorch Mobile to convert your models to an optimized format that can be integrated into the application. On the other hand, if you are building your applications for an edge device, then depending on the hardware, you may want to use TensorRT (for Jetson devices and Nvidia GPUs), OpenVINO (for Intel CPUs), or TensorFlow Lite (for Coral Edge TPU devices).

In this writeup, we will explore a part of a deployment that deals with hosting the deep learning model to make it available across the web for inference, known as model servers. There are different kinds of open-source model servers developed by various open-source communities. A few popular ones are listed below:

TensorFlow Serving – official release by TensorFlow
TorchServe – official release by Pytorch
OpenVINO Model Server – official release by Intel OpenVINO
Triton Inference Server – official release by Nvidia
KServe (Formerly KFServing) – official release by Kubeflow (Kubernetes)

Exploring all of the above is beyond the scope of this article, so we will first learn how to build our own, and then explore the Triton Inference Server (by Nvidia), which is platform-independent and supports a wide variety of model formats.

Since we are exploring a relatively advanced topic, there are certain prerequisites for this, and the same are listed below:

Python – Everything we will be doing is based on Python
TensorFlow – That is my preferred framework, but feel free to follow along with Pytorch
Flask – You just need the flask “hello world” steps; pretty easy to learn
OpenCV – In this example, we will be dealing with images
Communication Protocols: REST API request-response and gRPC

Before we dive into the hands-on section, let’s talk about what is all the fuss about. Fundamentally a model server is a web server that hosts the deep learning model and allows it to be accessed over standard network protocols. Functionally it is similar to a web server as well. You send a request to get back a response. Similarly, just like a web server, the model can be accessed across devices as long as they are connected via a common network. A high-level block diagram illustrating the same is shown below.

As shown in the diagram, the primary advantage that a model server provides is its ability to “serve” multiple client requests simultaneously. This means that if multiple applications are using the same model to run inference, then a model server is the way to go.

This extends to a second advantage, that is, since the same server is serving multiple client requests simultaneously, the model does not consume excessive CPU/GPU memory. The memory footprint roughly remains the same as that of a single model. Further, the model server can be hosted on a remote server (e.g., AWS, Azure, or GCP), or locally in the same physical system as your client(s). The inference latency would vary depending on the closeness of the server to the client(s) and the network bandwidth. Though a large number of simultaneous requests would slow down the inference speed significantly, in which case, multiple instances of the model server can be hosted, and the hosting hardware can be scaled up as a solution. But that is beyond the scope of this article.

Now that the basics are out of the way, let’s dive into the hands-on part, shall we? I recommend that you use Linux for this tutorial. macOS and Windows should work fine, but no promises. I am using Ubuntu 20.04.3 LTS.

Step 0 - Installations & Versions – Feel free to skip this section if you are already a pro.

Python (v3.7.4): My recommendation: Miniconda – Download | Installation
TensorFlow (v2.7.0): pip install tensorflow
Flask (v2.0.2): pip install flask
TritonClient (v2.6.0): pip install tritonclient[all] --extra-index-url https://pypi.ngc.nvidia.com
OpenCV (v4.5.4.58): pip install opencv-python
Docker (v20.10.11): Linux (recommended) | Windows | MacOS

Matching the versions is not required. Feel free to install the latest version. They should probably work fine. In case they don’t, you have the versions for reference.

Flask (Custom) Model Server

Step 1 - We need a model. So, let’s train one! Feel free to use any compatible model, it should work fine as long as it’s not overly complicated.

I will keep it simple and use the official Keras MNIST example, and save the model in TensorFlow SavedModel format.

import os

import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# Model / data parameters
num_classes = 10
input_shape = (28, 28, 1)

# the data, split between train and test sets
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()

# Scale images to the [0, 1] range
x_train = x_train.astype("float32") / 255
x_test = x_test.astype("float32") / 255
# Make sure images have shape (28, 28, 1)
x_train = np.expand_dims(x_train, -1)
x_test = np.expand_dims(x_test, -1)
print("x_train shape:", x_train.shape)
print(x_train.shape[0], "train samples")
print(x_test.shape[0], "test samples")


# convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)

model = keras.Sequential(
    [
        keras.Input(shape=input_shape, name="input_1"),
        layers.Conv2D(32, kernel_size=(3, 3), activation="relu"),
        layers.MaxPooling2D(pool_size=(2, 2)),
        layers.Conv2D(64, kernel_size=(3, 3), activation="relu"),
        layers.MaxPooling2D(pool_size=(2, 2)),
        layers.Flatten(),
        layers.Dropout(0.5),
        layers.Dense(num_classes, activation="softmax", name="output_1"),
    ]
)

model.summary()

batch_size = 128
epochs = 15

model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])

model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs, validation_split=0.1)

score = model.evaluate(x_test, y_test, verbose=0)
print("Test loss:", score[0])
print("Test accuracy:", score[1])

tf.keras.models.save_model(model, "mnist_model")

Step 2 - Build an inference script and integrate it with Flask to host it as a server which will act as the “model server”.

As a next step, the model server can be hosted on a remote machine/server and the process is similar to deploying a web app. Though, keep in mind that the server should have enough computational resources for the smooth operations of the Deep Learning frameworks: Pytorch or Tensorflow, and hence, the free tier of Heroku likely will not make the cut.

from flask import Flask, redirect, jsonify, request
import cv2, os, sys, imutils
import numpy as np
import tensorflow as tf

MODEL_PATH = 'mnist_model'

app = Flask(__name__)
app.config['DEBUG'] = False

# Load the Model
model = tf.keras.models.load_model(MODEL_PATH)

@app.route('/mnist_infer', methods=['POST'])
def hand_classifier():
    # Receive the encoded frame and convert it back to a Numpy Array
    encoded_image = np.frombuffer(request.data, np.uint8)
    image = cv2.imdecode(encoded_image, -1) # Decode image without converting to BGR
    
    image = np.expand_dims([image], axis=-1) # Add dimensions to create appropriate tensor shapes
    
    # Run inference on the frame
    hand = model.predict(image)

    return jsonify(str(np.argmax(hand))) # Because only string can be converted to JSON

if __name__ == '__main__':
    app.run(host='0.0.0.0', port='5000', threaded=True, use_reloader=False)

Step 3 - Build a client script to randomly pick images, send them to the model server for inference and process the predictions to calculate the accuracy.

This script is created just to represent how to run inference. However, instead of this script, in practice, it can be any device making the inference request call. For example, you can build an Android application that clicks an image with the camera and sends it to the model server hosted on the cloud (AWS, Azure, GCP, etc.) for inference. Once the predictions are received, you can then display the results on the Android app or perform further operations on the same.

import cv2, random, requests
import numpy as np
from glob import glob
from tqdm import tqdm

TEST_MNIST_PATHS = glob("/media/ActiveTraining/WOBOT/data/MNIST Dataset JPG format/MNIST - JPG - testing/*/*.jpg")
MODEL_ENDPOINT = "http://0.0.0.0:5000/mnist_infer"
NUM_SAMPLES = 1000
INPUT_SHAPE = (28, 28)

# Choose random images from the test set
path_choices = random.choices(TEST_MNIST_PATHS, k=NUM_SAMPLES)


def preprocess_image(image):
    image = cv2.resize(image, INPUT_SHAPE) # This step is not strictly required for MNIST
    image = image / 255 # Normalize Image
    return image

accuracy = list()

# Loop though each image and run inference
for test_mnist_path in tqdm(path_choices):
    label = int(test_mnist_path.split("/")[-2]) # Get the label
    image = cv2.imread(test_mnist_path, -1) # Read image without converting to BGR
    image = preprocess_image(image) # Preprocess the image
    
    # Encode the image as JPG and Send to the Model Server
    _, img_encoded = cv2.imencode('.jpg', image)
    response = requests.post(MODEL_ENDPOINT, data=img_encoded.tobytes())
    pred = int(response.json()) # Decode the response to get the predictions

    accuracy.append(pred == label)

print(f"Testing Accuracy: {np.mean(accuracy)*100:.2f}%")

Step 4 - Keep experimenting to get a thorough understanding of the inner workings of what is happening here.

Triton Inference Server

Now that we have built our own model server and ran inference, we can move on to the next step, where we use a pre-built and a much more optimized model server as provided by Nvidia themselves: the Triton Inference Server.

Before we move on to the hands-on section, I recommend you go through the beautiful block diagram provided by Nvidia.

Now, moving on to the hands-on section, we will use gRPC and Docker. Both require their own introductory article and hence are beyond the scope of this article. For simplicity:

gRPC: Think of it as a communication protocol developed by Google which is more efficient than a REST API call and is based on web RPC. For a further read, I would recommend: What is gRPC? by Miguel Ratia and Google’s gRPC: A Lean and Mean Communication Protocol for Microservices by Janakiram MSV.
Docker: This is much easier to grasp – think of it like a secluded virtual environment where you can do whatever you want without affecting your host device (mostly). You can also just download an existing environment and run it without any additional setups (pretty much what we are about to do). For a detailed read: What is Docker by IBM, and Official Docker Documentation – Nothing beats this, one of the best-documented products (my opinion).

Step 1 (Pytorch) - This will require additional steps and model conversions before moving to the next step. You can choose to convert the Pytorch model to TorchScript or ONNX and use follow the specific directions as mentioned in the Triton Server Model Repository readme.

Step 1 (Tensorflow) - We skip installations (since it’s already done), and move on to a specific directory structure required by the Triton Server.

Move the SavedModel contents to <model-name>/1/model.savedmodel/<SavedModel-contents>. The detailed instructions can be found in the official readme provided by Nvidia. In summary, the new directory structure should look something like this (The .py files are the python scripts discussed in this article):

.
├── mnist_model
│   ├── 1
│   │   └── model.savedmodel
│   │       ├── assets
│   │       ├── keras_metadata.pb
│   │       ├── saved_model.pb
│   │       └── variables
│   │           ├── variables.data-00000-of-00001
│   │           └── variables.index
├── flask_client.py
├── flask_server.py
├── train_mnist.py
└── triton_client.py

Step 2 - Pull the required Triton Server Docker image and run the container using the following command: docker run --gpus=all --rm -it -p 8000-8002:8000-8002 --name triton_server -v $PWD:/models nvcr.io/nvidia/tritonserver:21.02-py3 tritonserver --model-repository=/models --strict-model-config=false

In this command, use the --gpus=all flag only if you have a GPU and have nvidia-docker installed. Else skip it if you want to run CPU inference (slower).
Use the --rm (optional) flag if you want the container to be deleted once you stop the server
Use the --it (optional) flag to view the container logs and stop them using keyboard interrupt (Ctrl+C)
Use the --p flag to map the ports from inside the container to your host system. Without this, the inference will not take place since the program on the host system will not have access to the port inside the container.
Use --name (optional) to identify the container with a chosen name. Otherwise, a random one will automatically be assigned.
Use -v to mount the volumes into the container. Here $PWD points to the current directory, assuming your model is located in the same directory.
The docker image name is: nvcr.io/nvidia/tritonserver:21.02-py3
--model-repository=/models points to the directory containing the model name and the files inside it. (Directory structure is very important)
strict-model-config=false: This flag allows the Triton Server to auto-configure for the given model. Alternatively, a specific config.pbtxt can be created to specify the config. But that is again beyond the scope of this article, and more information can be found here: Triton Model Configuration Readme.

Step 3 - Verify if your model is loaded properly or not.

Once the model is loaded successfully, you should see the same printed in the docker logs and the status for the given model should be “READY”. Another way to verify the model is to use the REST API to verify the config. If you have followed all instructions properly, then you should find the JSON model config by visiting http://localhost:8000/v2/models/mnist_model/config on your browser. The general format for this URL is: http://<host-ip-address>:<mapped-http-port>/v2/models/<model-name>/config.

Step 4 - Triton Inference Client Script with gRPC – This is again similar to the previous client script that we had created for the Flask Model Server.

import cv2, random
import numpy as np
from glob import glob
from tqdm import tqdm
import tritonclient.grpc as grpcclient

# GLOBAL VARIABLES
TEST_MNIST_PATHS = glob("/media/ActiveTraining/WOBOT/data/MNIST Dataset JPG format/MNIST - JPG - testing/*/*.jpg")
NUM_SAMPLES = 1000
INPUT_SHAPE = (28, 28)

# Triton Variables
TRITON_IP = "localhost"
TRITON_PORT = 8001
MODEL_NAME = "mnist_model"
INPUTS = []
OUTPUTS = []
INPUT_LAYER_NAME = "input_1"
OUTPUT_LAYER_NAME = "output_1"

# Choose random images from the test set
path_choices = random.choices(TEST_MNIST_PATHS, k=NUM_SAMPLES)


def preprocess_image(image):
    image = cv2.resize(image, INPUT_SHAPE) # This step is not strictly required for MNIST
    image = image / 255 # Normalize Image
    image = np.expand_dims([image], axis=-1) # Increase the dimensions to match that of the model
    return image.astype(np.float32)

def postprocess_output(preds):
    return np.argmax(np.squeeze(preds))

accuracy = list()

# Triton Initializations
INPUTS.append(grpcclient.InferInput(INPUT_LAYER_NAME, [1, INPUT_SHAPE[0], INPUT_SHAPE[1], 1], "FP32"))
OUTPUTS.append(grpcclient.InferRequestedOutput(OUTPUT_LAYER_NAME))
TRITON_CLIENT = grpcclient.InferenceServerClient(url=f"{TRITON_IP}:{TRITON_PORT}")

# Loop though each image and run inference
for test_mnist_path in tqdm(path_choices):
    label = int(test_mnist_path.split("/")[-2]) # Get the label
    image = cv2.imread(test_mnist_path, -1) # Read image without converting to BGR
    image = preprocess_image(image) # Preprocess the image

    # Run the Inference using the gRPC triton client
    INPUTS[0].set_data_from_numpy(image) # Set the Inputs
    result = TRITON_CLIENT.infer(model_name=MODEL_NAME, inputs=INPUTS, outputs=OUTPUTS, headers={}) # Run Inference
    output = np.squeeze(result.as_numpy(OUTPUT_LAYER_NAME)) # Process the Outputs
    pred = postprocess_output(output) # Postprocess (Argmax)

    # Record Accuracy
    accuracy.append(pred == label) 

print(f"Testing Accuracy: {np.mean(accuracy)*100:.2f}%")

Step 5 - Again, keep experimenting, and I do encourage trying with different models.

And we are done! Congratulations on taking the first step towards model deployment! And trust me, you still have a long way to go! This is just the tip of the iceberg 😃

Feel free to try out the other model servers that are also available. Each has its own strengths and weaknesses which you will come to learn once you start using them.