How We Implemented the Face-with-Mask Detection Web App for Chrome

In the previous article, I discussed whether it is possible to use machine learning (in particular, face and mask detection) in the browser, approaches to detection, and optimization of all processes.

Today I want to give the technical details of the implementation.

Technologies

The primary language for development is TypeScript. The client application is written in React.js.

The application uses several neural networks to detect different events: face detection, mask detection. Each model/network runs in a separate thread (Web Worker). Neural networks are launched using TensorFlow.js, and Web Assembly or WebGL is used as a backend, which allows you to execute code at speed close to native. The choice of this or that backend depends on the size of the model (small models work faster on WebAssembly), but you should always test and choose what is faster for a particular model.

Receiving and displaying a video stream using WebRTC. The OpenCV.js library is used to work with images.

The following approach was implemented:

The main thread is only orchestrating all processes. It doesn't load the heavy OpenCV library and doesn't use TensorFlow.js. It gets images from the video stream and sends them for processing by web workers.

A new image is not sent to the worker until it informs the main thread that the worker is free and can process the next image. Thus a queue is not created, and we process the last image each time.

Initially, the image is sent for face recognition, in case the face is recognized; only then is the image sent for mask recognition. Each result of the worker is saved and can be displayed on the UI.

Performance

Receiving an image from a stream - 31 ms

Face detection preprocessing - 0-1 ms

Face detection - 51 ms

Face detection post-processing - 8 ms

Mask detection preprocessing - 2 ms

Mask detection - 11 ms

Mask detection post-processing - 0-1 ms

Total:

Face detection - 60 ms + 31 ms = 91 ms

Mask detection - 14 ms

In ~ 105 ms, we would know all the information from the image.

Face detection preprocessing is getting an image from a stream and sending it to a web worker.
Face detection post-processing - saving the result from the face detection worker and drawing it on the canvas.
Mask detection preprocessing - preparing a canvas with an aligned face image and transferring it to the web worker.
Mask detection post-processing - saving the results of mask detection.

Each model (face detection and mask detection) runs in a separate web worker, which loads the necessary libraries (OpenCV.js, Tensorflow.js, models).

We have 3 web workers:

Face detection

Mask detection

Worker-helper that can transform images uses heavy methods from OpenCV and TensorFlow.js. For example, to build a calibration matrix for multiple cameras.

Features and tricks that helped us in development and optimization

Web workers and how to work with them

A web worker is a way to run a script on a separate thread.

They allow running heavy processes in parallel with the main thread without blocking the UI. The main thread executes the orchestration logic; all heavy computation is running in the web workers. Web workers are supported in almost all browsers.

Features and limitations of web workers

Features:

Access only to a subset of JavaScript features

Access to
```
navigator
```
object

Read-only access to the
```
location
```
object

Possibility to use
```
XMLHttpRequest
```

Possibility to use

setTimeout()

clearTimeout()

и

setInterval()

clearInterval()

Application Cache

Importing external scripts using importScripts()

Creating other web workers

Limitations:

No access to DOM

No access to windows

No access to the document

No access to parent

To provide communication between the main thread and the web workers

postMessage

and

onmessage

the event handler is used.

If you look at the specification of the

postMessage()

method, you will notice that it accepts not only data but also a second argument - a transferable object.

worker.postMessage(message, [transfer]);

Let's see how using it will help us.

A transferable interface is an object that can be passed between different execution contexts, such as the main thread and web workers.

This interface is implemented in:

ImageBitmap

OffscreenCanvas

ArrayBuffer

MessagePort

If we want to transfer 500 MB of data to the worker, we can do it without the second argument, but the difference will be in the time transfer and memory usage.

Sending data without an argument will take 149 ms and 1042 MB for Google Chrome, in other browsers even more.

When you use the transfer argument, it will take 1ms and will decrease memory consumption by 2 times!

Since images are often transferred from the main thread to the web workers, it is important for us to do this as quickly and efficiently as possible for memory usage, and this feature helps us a lot with this.

Using OffscreenCanvas

The web worker does not have access to the DOM, so you cannot use canvas directly.

OffscreenCanvas

comes to the rescue.

Advantages:

Fully detached from the DOM

It can be used both in the main thread and in web workers

It has a transferable interface and does not load the main thread if rendering running in a web worker

Advantages of using requestAnimationFrame

requestAnimationFrame

allows you to receive images from the stream with maximum performance (60 FPS), and it is only limited by the camera's capability, not all cameras send video with such frequency.

The main advantages are:

Browser optimizes requestAnimationFrame calls with other animations and drawings.

Less power consumption, it's very important for mobile devices

It works without a call stack and doesn’t create a call queue.

Minimum call frequency 16.67 ms (1000 ms / 60 fps = 16.67 ms)

Call frequency can be controlled manually

Metrics of application

At first, using stats.js seemed to be a good idea for displaying application metrics, but after the count of metrics became 20+, the main flow of the application began to slow down due to the specifics of the browser. Each metric uses a canvas on which draws a graph (data are received very often there), and the browser calls re-render with high frequency, which negatively affects the application. As a result, the metrics are underestimated.

To avoid such a problem, it is better to abandon the use of "beauty" and simplify displaying the current and calculated average for the entire time by text. Updating a value in the DOM will be much faster than rendering graphics.

Memory leaks control

Quite often, during development, we encountered memory leaks on mobile devices, while on a desktop, it could work for a very long time.

In web workers, it is impossible to know how much memory it actually consumes (

performance.memory

does not work in web workers).

Based on this, we provided for the launch of our application through web workers and completely in the main thread. By running all our detection models on the main thread, we can take the memory consumption metrics and see where the memory leak is and fix it.

The main code of models in web workers

We got acquainted with the main tricks that were used in the application; now we will look at the implementation.

For working with web workers initially, comlink-loader was used. It's a very handy library that allows you to work with the worker as a class object without using the

onmessage

and

postMessage

methods and control the asynchronous code using async-await. All this was convenient until the application was launched on a tablet (Samsung Galaxy Tab S7), and suddenly it crashed after 2 minutes.

After analyzing all the code, no memory leaks were found, except for a black box of this library for working with workers. For some reason, the launched Tensorflow.js models were not cleared and stored somewhere inside this library.

It was decided to use a worker-loader, which allows you to work with web workers from pure js without unnecessary layers. And this solved the problem; the application works for days without crashes.

Face detection worker

Create web worker

this.faceDetectionWorker = workers.FaceRgbDetectionWorkerFactory.createWebWorker();

Create a message handler from a worker in the main thread

this.faceDetectionWorker.onmessage = async (event) => {
 if (event.data.type === 'load') {
   this.faceDetectionWorker.postMessage({
     type: 'init',
     backend,
     streamSettings,
     faceDetectionSettings,
     imageRatio: this.imageRatio,
   });
 } else if (event.data.type === 'init') {
   this.isFaceWorkerInit = event.data.status;

   // When both workers inited it is run processes to grab and process frames only
   if (this.isFaceWorkerInit && this.isMaskWorkerInit) {
     await this.grabFrame();
   }
 } else if (event.data.type === 'faceResults') {
   this.onFaceDetected(event);
 } else {
   throw new Error(`Type=${event.data.type} is not supported by RgbVideo for FaceRgbDatectionWorker`);
 }
};

Sending an image for face processing

this.faceDetectionWorker.postMessage(
 {
   type: 'detectFace',
   originalImageToProcess: this.lastImage,
   lastIndex: lastItem!.index,
 },
 [this.lastImage], // transferable object
);

Face detection web worker code

The init method initializes all the models, libraries, and canvas that are needed to work with.

export const init = async (data) => {
 const { backend, streamSettings, faceDetectionSettings, imageRatio } = data;

 flipHorizontal = streamSettings.flipHorizontal;
 faceMinWidth = faceDetectionSettings.faceMinWidth;
 faceMinWidthConversionFactor = faceDetectionSettings.faceMinWidthConversionFactor;
 predictionIOU = faceDetectionSettings.predictionIOU;
 recommendedLocation = faceDetectionSettings.useRecommendedLocation ? faceDetectionSettings.recommendedLocation : null;
 detectedFaceThumbnailSize = faceDetectionSettings.detectedFaceThumbnailSize;
 srcImageRatio = imageRatio;
 await tfc.setBackend(backend);
 await tfc.ready();

 const [blazeModel] = await Promise.all([
   blazeface.load({
     // The maximum number of faces returned by the model
     maxFaces: faceDetectionSettings.maxFaces,
     // The width of the input image
     inputWidth: faceDetectionSettings.faceDetectionImageMinWidth,
     // The height of the input image
     inputHeight: faceDetectionSettings.faceDetectionImageMinHeight,
     // The threshold for deciding whether boxes overlap too much
     iouThreshold: faceDetectionSettings.iouThreshold,
     // The threshold for deciding when to remove boxes based on score
     scoreThreshold: faceDetectionSettings.scoreThreshold,
   }),
   isOpenCvLoaded(),
 ]);

 faceDetection = new FaceDetection();
 originalImageToProcessCanvas = new OffscreenCanvas(srcImageRatio.videoWidth, srcImageRatio.videoHeight);
 originalImageToProcessCanvasCtx = originalImageToProcessCanvas.getContext('2d');

 resizedImageToProcessCanvas = new OffscreenCanvas(
   srcImageRatio.faceDetectionImageWidth,
   srcImageRatio.faceDetectionImageHeight,
 );
 resizedImageToProcessCanvasCtx = resizedImageToProcessCanvas.getContext('2d');
 return blazeModel;
};

The

isOpenCvLoaded

method is waiting for openCV to load

export const isOpenCvLoaded = () => {
 let timeoutId;

 const resolveOpenCvPromise = (resolve) => {
   if (timeoutId) {
     clearTimeout(timeoutId);
   }

   try {
     // eslint-disable-next-line no-undef
     if (cv && cv.Mat) {
       return resolve();
     } else {
       timeoutId = setTimeout(() => {
         resolveOpenCvPromise(resolve);
       }, OpenCvLoadedTimeoutInMs);
     }
   } catch {
     timeoutId = setTimeout(() => {
       resolveOpenCvPromise(resolve);
     }, OpenCvLoadedTimeoutInMs);
   }
 };

 return new Promise((resolve) => {
   resolveOpenCvPromise(resolve);
 });
};

Face detection method

export const detectFace = async (data, faceModel) => {
 let { originalImageToProcess, lastIndex } = data;
 const facesThumbnailsImageData = [];

 // Resize original image to the recommended BlazeFace resolution
 resizedImageToProcessCanvasCtx.drawImage(
   originalImageToProcess,
   0,
   0,
   srcImageRatio.faceDetectionImageWidth,
   srcImageRatio.faceDetectionImageHeight,
 );
 // Getting resized image
 let resizedImageDataToProcess = resizedImageToProcessCanvasCtx.getImageData(
   0,
   0,
   srcImageRatio.faceDetectionImageWidth,
   srcImageRatio.faceDetectionImageHeight,
 );
 // Detect faces by BlazeFace
 let predictions = await faceModel.estimateFaces(
   // The image to classify. Can be a tensor, DOM element image, video, or canvas
   resizedImageDataToProcess,
   // Whether to return tensors as opposed to values
   returnTensors,
   // Whether to flip/mirror the facial keypoints horizontally. Should be true for videos that are flipped by default (e.g. webcams)
   flipHorizontal,
   // Whether to annotate bounding boxes with additional properties such as landmarks and probability. Pass in `false` for faster inference if annotations are not needed
   annotateBoxes,
 );
 // Normalize predictions
 predictions = faceDetection.normalizePredictions(
   predictions,
   returnTensors,
   annotateBoxes,
   srcImageRatio.faceDetectionImageRatio,
 );
 // Filters initial predictions by the criteri that all landmarks should be in area of interest
 predictions = faceDetection.filterPredictionsByFullLandmarks(
   predictions,
   srcImageRatio.videoWidth,
   srcImageRatio.videoHeight,
 );
 // Filters predictions by min face width
 predictions = faceDetection.filterPredictionsByMinWidth(predictions, faceMinWidth, faceMinWidthConversionFactor);
 // Filters predictions by recommended location
 predictions = faceDetection.filterPredictionsByRecommendedLocation(predictions, predictionIOU, recommendedLocation);

 // If there are any predictions it is started faces thumbnails extraction according to the configured size
 if (predictions && predictions.length > 0) {
   // Draw initial original image
   originalImageToProcessCanvasCtx.drawImage(originalImageToProcess, 0, 0);
   const originalImageDataToProcess = originalImageToProcessCanvasCtx.getImageData(
     0,
     0,
     originalImageToProcess.width,
     originalImageToProcess.height,
   );

   // eslint-disable-next-line no-undef
   let srcImageData = cv.matFromImageData(originalImageDataToProcess);
   try {
     for (let i = 0; i < predictions.length; i++) {
       const prediction = predictions[i];
       const facesOriginalLandmarks = JSON.parse(JSON.stringify(prediction.originalLandmarks));

       if (flipHorizontal) {
         for (let j = 0; j < facesOriginalLandmarks.length; j++) {
           facesOriginalLandmarks[j][0] = srcImageRatio.videoWidth - facesOriginalLandmarks[j][0];
         }
       }

       // eslint-disable-next-line no-undef
       let dstImageData = new cv.Mat();
       try {
         // eslint-disable-next-line no-undef
         let thumbnailSize = new cv.Size(detectedFaceThumbnailSize, detectedFaceThumbnailSize);

         let transformation = getOneToOneFaceTransformationByTarget(detectedFaceThumbnailSize);

         // eslint-disable-next-line no-undef
         let similarityTransformation = getSimilarityTransformation(facesOriginalLandmarks, transformation);
         // eslint-disable-next-line no-undef
         let similarityTransformationMatrix = cv.matFromArray(3, 3, cv.CV_64F, similarityTransformation.data);

         try {
           // eslint-disable-next-line no-undef
           cv.warpPerspective(
             srcImageData,
             dstImageData,
             similarityTransformationMatrix,
             thumbnailSize,
             cv.INTER_LINEAR,
             cv.BORDER_CONSTANT,
             new cv.Scalar(127, 127, 127, 255),
           );

           facesThumbnailsImageData.push(
             new ImageData(
               new Uint8ClampedArray(dstImageData.data, dstImageData.cols, dstImageData.rows),
               detectedFaceThumbnailSize,
               detectedFaceThumbnailSize,
             ),
           );
         } finally {
           similarityTransformationMatrix.delete();
           similarityTransformationMatrix = null;
         }
       } finally {
         dstImageData.delete();
         dstImageData = null;
       }
     }
   } finally {
     srcImageData.delete();
     srcImageData = null;
   }
 }

 return { resizedImageDataToProcess, predictions, facesThumbnailsImageData, lastIndex };
};

The input is an image and an index for face matching and mask detection in the future.

Since blazeface accepts images with a maximum size of 128 px, the image from the camera must be reduced.

Calling the

faceModel.estimateFaces

method starts the image analysis using blazeface, and the predicted coordinates with the coordinates of the face, nose, ears, eyes, mouth area are returned to the main thread.

Before working with them, you need to restore the coordinates for the original image because we compressed it to 128 px.

Now you can use these data to decide whether the face is in the desired area or not. What is the minimum face size you need for subsequent identification.

The following code cuts the face out of the image and aligns it to identify the mask using openCV methods.

Mask detection

Model initialization and

webAssembly

backend

export const init = async (data) => {
 const { backend, streamSettings, maskDetectionsSettings, imageRatio } = data;

 flipHorizontal = streamSettings.flipHorizontal;
 detectedMaskThumbnailSize = maskDetectionsSettings.detectedMaskThumbnailSize;
 srcImageRatio = imageRatio;
 await tfc.setBackend(backend);
 await tfc.ready();
 const [maskModel] = await Promise.all([
   tfconv.loadGraphModel(
     `/rgb_mask_classification_first/MobileNetV${maskDetectionsSettings.mobileNetVersion}_${maskDetectionsSettings.mobileNetWeight}/${maskDetectionsSettings.mobileNetType}/model.json`,
   ),
 ]);

 detectedMaskThumbnailCanvas = new OffscreenCanvas(detectedMaskThumbnailSize, detectedMaskThumbnailSize);
 detectedMaskThumbnailCanvasCtx = detectedMaskThumbnailCanvas.getContext('2d');
 return maskModel;
};

The mask detection requires the coordinates of eyes, ears, nose, mouth, and the aligned image which is returned by the face detection worker.

this.maskDetectionWorker.postMessage({
 type: 'detectMask',
 prediction: lastItem!.data.predictions[0],
 imageDataToProcess,
 lastIndex: lastItem!.index,
});

Detection method

export const detectMask = async (data, maskModel) => {
 let { prediction, imageDataToProcess, lastIndex } = data;
 const masksScores = [];
 const maskLandmarks = JSON.parse(JSON.stringify(prediction.landmarks));

 if (flipHorizontal) {
   for (let j = 0; j < maskLandmarks.length; j++) {
     maskLandmarks[j][0] = srcImageRatio.faceDetectionImageWidth - maskLandmarks[j][0];
   }
 }
 // Draw thumbnail with mask
 detectedMaskThumbnailCanvasCtx.putImageData(imageDataToProcess, 0, 0);
 // Detect mask via NN
 let predictionTensor = tfc.tidy(() => {
   let maskDetectionSnapshotFromPixels = tfc.browser.fromPixels(detectedMaskThumbnailCanvas);
   let maskDetectionSnapshotFromPixelsFlot32 = tfc.cast(maskDetectionSnapshotFromPixels, 'float32');
   let expandedDims = maskDetectionSnapshotFromPixelsFlot32.expandDims(0);

   return maskModel.predict(expandedDims);
 });
 // Put mask detection result into the returned array
 try {
   masksScores.push(predictionTensor.dataSync()[0].toFixed(4));
 } finally {
   predictionTensor.dispose();
   predictionTensor = null;
 }

 return {
   masksScores,
   lastIndex,
 };
};

The result of the neural network is the probability that there is a mask, which is returned from the worker. It helps to increase and decrease the threshold of mask detection. By

lastIndex

, we can compare the face and the presence of a mask and display some information on a specific person on the screen.

Conclusion

I hope this article will help you to learn about the possibilities of working with ML in the browser and ways to optimize it. Most applications can be optimized using the tricks described above.