How We Implemented the Face-with-Mask Detection Web App for Chrome

Written by yantsishko | Published 2021/06/19
Tech Story Tags: javascript | tensorflowjs | face-mask-detection | computer-vision | image-detection | face-recognition | mask-recognition | programming

TLDR In the previous article, I discussed machine learning (in particular, face and mask detection) in the browser. I want to give the technical details of the implementation. The application uses several neural networks to detect different events: face detection, mask detection. Each model/network runs in a separate thread (Web Worker) Neural networks are launched using TensorFlow.js, and Web Assembly or WebGL is used as a backend. Each result of the worker is saved and can be displayed on the UI.via the TL;DR App

In the previous article, I discussed whether it is possible to use machine learning (in particular, face and mask detection) in the browser, approaches to detection, and optimization of all processes.
Today I want to give the technical details of the implementation.

Technologies

The primary language for development is TypeScript. The client application is written in React.js.
The application uses several neural networks to detect different events: face detection, mask detection. Each model/network runs in a separate thread (Web Worker). Neural networks are launched using TensorFlow.js, and Web Assembly or WebGL is used as a backend, which allows you to execute code at speed close to native. The choice of this or that backend depends on the size of the model (small models work faster on WebAssembly), but you should always test and choose what is faster for a particular model.
Receiving and displaying a video stream using WebRTC. The OpenCV.js library is used to work with images.
The following approach was implemented:
The main thread is only orchestrating all processes. It doesn't load the heavy OpenCV library and doesn't use TensorFlow.js. It gets images from the video stream and sends them for processing by web workers.
A new image is not sent to the worker until it informs the main thread that the worker is free and can process the next image. Thus a queue is not created, and we process the last image each time.
Initially, the image is sent for face recognition, in case the face is recognized; only then is the image sent for mask recognition. Each result of the worker is saved and can be displayed on the UI.

Performance

  • Receiving an image from a stream - 31 ms
  • Face detection preprocessing - 0-1 ms
  • Face detection - 51 ms
  • Face detection post-processing  - 8 ms
  • Mask detection preprocessing - 2 ms
  • Mask detection - 11 ms
  • Mask detection post-processing - 0-1 ms
Total: 
  • Face detection - 60 ms + 31 ms = 91 ms
  • Mask detection - 14 ms
In ~ 105 ms, we would know all the information from the image.
  1. Face detection preprocessing is getting an image from a stream and sending it to a web worker.
  2. Face detection post-processing - saving the result from the face detection worker and drawing it on the canvas.
  3. Mask detection preprocessing - preparing a canvas with an aligned face image and transferring it to the web worker.
  4. Mask detection post-processing - saving the results of mask detection.
Each model (face detection and mask detection) runs in a separate web worker, which loads the necessary libraries (OpenCV.js, Tensorflow.js, models).
We have 3 web workers:
  • Face detection
  • Mask detection
  • Worker-helper that can transform images uses heavy methods from OpenCV and TensorFlow.js. For example, to build a calibration matrix for multiple cameras.

Features and tricks that helped us in development and optimization

Web workers and how to work with them
A web worker is a way to run a script on a separate thread.
They allow running heavy processes in parallel with the main thread without blocking the UI. The main thread executes the orchestration logic; all heavy computation is running in the web workers. Web workers are supported in almost all browsers.

Features and limitations of web workers

Features:
  • Access only to a subset of JavaScript features
  • Access to
    navigator
    object
  • Read-only access to the
    location
    object
  • Possibility to use
    XMLHttpRequest
  • Possibility to use
    setTimeout()
     / 
    clearTimeout()
     и 
    setInterval()
     / 
    clearInterval()
  • Application Cache
  • Importing external scripts using importScripts()
  • Creating other web workers
Limitations:
  • No access to DOM
  • No access to windows
  • No access to the document
  • No access to parent
To provide communication between the main thread and the web workers
postMessage
and
onmessage
the event handler is used.
If you look at the specification of the
postMessage()
method, you will notice that it accepts not only data but also a second argument - a transferable object.
worker.postMessage(message, [transfer]);
Let's see how using it will help us.
A transferable interface is an object that can be passed between different execution contexts, such as the main thread and web workers.
This interface is implemented in:
  • ImageBitmap
  • OffscreenCanvas
  • ArrayBuffer
  • MessagePort
If we want to transfer 500 MB of data to the worker, we can do it without the second argument, but the difference will be in the time transfer and memory usage.
Sending data without an argument will take 149 ms and 1042 MB for Google Chrome, in other browsers even more.
When you use the transfer argument, it will take 1ms and will decrease memory consumption by 2 times!
Since images are often transferred from the main thread to the web workers, it is important for us to do this as quickly and efficiently as possible for memory usage, and this feature helps us a lot with this.

Using OffscreenCanvas

The web worker does not have access to the DOM, so you cannot use canvas directly.
OffscreenCanvas
comes to the rescue.
Advantages:
  • Fully detached from the DOM
  • It can be used both in the main thread and in web workers
  • It has a transferable interface and does not load the main thread if rendering running in a web worker

Advantages of using requestAnimationFrame

requestAnimationFrame
allows you to receive images from the stream with maximum performance (60 FPS), and it is only limited by the camera's capability, not all cameras send video with such frequency.
The main advantages are:
  • Browser optimizes requestAnimationFrame calls with other animations and drawings.
  • Less power consumption, it's very important for mobile devices
  • It works without a call stack and doesn’t create a call queue.
  • Minimum call frequency 16.67 ms (1000 ms / 60 fps = 16.67 ms)
  • Call frequency can be controlled manually

    Metrics of application

    At first, using stats.js seemed to be a good idea for displaying application metrics, but after the count of metrics became 20+, the main flow of the application began to slow down due to the specifics of the browser. Each metric uses a canvas on which draws a graph (data are received very often there), and the browser calls re-render with high frequency, which negatively affects the application. As a result, the metrics are underestimated.
    To avoid such a problem, it is better to abandon the use of "beauty" and simplify displaying the current and calculated average for the entire time by text. Updating a value in the DOM will be much faster than rendering graphics.

    Memory leaks control

    Quite often, during development, we encountered memory leaks on mobile devices, while on a desktop, it could work for a very long time.
    In web workers, it is impossible to know how much memory it actually consumes (
    performance.memory
    does not work in web workers).
    Based on this, we provided for the launch of our application through web workers and completely in the main thread. By running all our detection models on the main thread, we can take the memory consumption metrics and see where the memory leak is and fix it.

    The main code of models in web workers

    We got acquainted with the main tricks that were used in the application; now we will look at the implementation.
    For working with web workers initially, comlink-loader was used. It's a very handy library that allows you to work with the worker as a class object without using the
    onmessage
    and
    postMessage
    methods and control the asynchronous code using async-await. All this was convenient until the application was launched on a tablet (Samsung Galaxy Tab S7), and suddenly it crashed after 2 minutes.
    After analyzing all the code, no memory leaks were found, except for a black box of this library for working with workers. For some reason, the launched Tensorflow.js models were not cleared and stored somewhere inside this library.
    It was decided to use a worker-loader, which allows you to work with web workers from pure js without unnecessary layers. And this solved the problem; the application works for days without crashes.

    Face detection worker

    Create web worker
    this.faceDetectionWorker = workers.FaceRgbDetectionWorkerFactory.createWebWorker();
    Create a message handler from a worker in the main thread
    this.faceDetectionWorker.onmessage = async (event) => {
     if (event.data.type === 'load') {
       this.faceDetectionWorker.postMessage({
         type: 'init',
         backend,
         streamSettings,
         faceDetectionSettings,
         imageRatio: this.imageRatio,
       });
     } else if (event.data.type === 'init') {
       this.isFaceWorkerInit = event.data.status;
    
       // When both workers inited it is run processes to grab and process frames only
       if (this.isFaceWorkerInit && this.isMaskWorkerInit) {
         await this.grabFrame();
       }
     } else if (event.data.type === 'faceResults') {
       this.onFaceDetected(event);
     } else {
       throw new Error(`Type=${event.data.type} is not supported by RgbVideo for FaceRgbDatectionWorker`);
     }
    };
    
    Sending an image for face processing
    this.faceDetectionWorker.postMessage(
     {
       type: 'detectFace',
       originalImageToProcess: this.lastImage,
       lastIndex: lastItem!.index,
     },
     [this.lastImage], // transferable object
    );
    
    Face detection web worker code
    The init method initializes all the models, libraries, and canvas that are needed to work with.
    export const init = async (data) => {
     const { backend, streamSettings, faceDetectionSettings, imageRatio } = data;
    
     flipHorizontal = streamSettings.flipHorizontal;
     faceMinWidth = faceDetectionSettings.faceMinWidth;
     faceMinWidthConversionFactor = faceDetectionSettings.faceMinWidthConversionFactor;
     predictionIOU = faceDetectionSettings.predictionIOU;
     recommendedLocation = faceDetectionSettings.useRecommendedLocation ? faceDetectionSettings.recommendedLocation : null;
     detectedFaceThumbnailSize = faceDetectionSettings.detectedFaceThumbnailSize;
     srcImageRatio = imageRatio;
     await tfc.setBackend(backend);
     await tfc.ready();
    
     const [blazeModel] = await Promise.all([
       blazeface.load({
         // The maximum number of faces returned by the model
         maxFaces: faceDetectionSettings.maxFaces,
         // The width of the input image
         inputWidth: faceDetectionSettings.faceDetectionImageMinWidth,
         // The height of the input image
         inputHeight: faceDetectionSettings.faceDetectionImageMinHeight,
         // The threshold for deciding whether boxes overlap too much
         iouThreshold: faceDetectionSettings.iouThreshold,
         // The threshold for deciding when to remove boxes based on score
         scoreThreshold: faceDetectionSettings.scoreThreshold,
       }),
       isOpenCvLoaded(),
     ]);
    
     faceDetection = new FaceDetection();
     originalImageToProcessCanvas = new OffscreenCanvas(srcImageRatio.videoWidth, srcImageRatio.videoHeight);
     originalImageToProcessCanvasCtx = originalImageToProcessCanvas.getContext('2d');
    
     resizedImageToProcessCanvas = new OffscreenCanvas(
       srcImageRatio.faceDetectionImageWidth,
       srcImageRatio.faceDetectionImageHeight,
     );
     resizedImageToProcessCanvasCtx = resizedImageToProcessCanvas.getContext('2d');
     return blazeModel;
    };
    
    The
    isOpenCvLoaded
    method is waiting for openCV to load
    export const isOpenCvLoaded = () => {
     let timeoutId;
    
     const resolveOpenCvPromise = (resolve) => {
       if (timeoutId) {
         clearTimeout(timeoutId);
       }
    
       try {
         // eslint-disable-next-line no-undef
         if (cv && cv.Mat) {
           return resolve();
         } else {
           timeoutId = setTimeout(() => {
             resolveOpenCvPromise(resolve);
           }, OpenCvLoadedTimeoutInMs);
         }
       } catch {
         timeoutId = setTimeout(() => {
           resolveOpenCvPromise(resolve);
         }, OpenCvLoadedTimeoutInMs);
       }
     };
    
     return new Promise((resolve) => {
       resolveOpenCvPromise(resolve);
     });
    };
    
    Face detection method
    export const detectFace = async (data, faceModel) => {
     let { originalImageToProcess, lastIndex } = data;
     const facesThumbnailsImageData = [];
    
     // Resize original image to the recommended BlazeFace resolution
     resizedImageToProcessCanvasCtx.drawImage(
       originalImageToProcess,
       0,
       0,
       srcImageRatio.faceDetectionImageWidth,
       srcImageRatio.faceDetectionImageHeight,
     );
     // Getting resized image
     let resizedImageDataToProcess = resizedImageToProcessCanvasCtx.getImageData(
       0,
       0,
       srcImageRatio.faceDetectionImageWidth,
       srcImageRatio.faceDetectionImageHeight,
     );
     // Detect faces by BlazeFace
     let predictions = await faceModel.estimateFaces(
       // The image to classify. Can be a tensor, DOM element image, video, or canvas
       resizedImageDataToProcess,
       // Whether to return tensors as opposed to values
       returnTensors,
       // Whether to flip/mirror the facial keypoints horizontally. Should be true for videos that are flipped by default (e.g. webcams)
       flipHorizontal,
       // Whether to annotate bounding boxes with additional properties such as landmarks and probability. Pass in `false` for faster inference if annotations are not needed
       annotateBoxes,
     );
     // Normalize predictions
     predictions = faceDetection.normalizePredictions(
       predictions,
       returnTensors,
       annotateBoxes,
       srcImageRatio.faceDetectionImageRatio,
     );
     // Filters initial predictions by the criteri that all landmarks should be in area of interest
     predictions = faceDetection.filterPredictionsByFullLandmarks(
       predictions,
       srcImageRatio.videoWidth,
       srcImageRatio.videoHeight,
     );
     // Filters predictions by min face width
     predictions = faceDetection.filterPredictionsByMinWidth(predictions, faceMinWidth, faceMinWidthConversionFactor);
     // Filters predictions by recommended location
     predictions = faceDetection.filterPredictionsByRecommendedLocation(predictions, predictionIOU, recommendedLocation);
    
     // If there are any predictions it is started faces thumbnails extraction according to the configured size
     if (predictions && predictions.length > 0) {
       // Draw initial original image
       originalImageToProcessCanvasCtx.drawImage(originalImageToProcess, 0, 0);
       const originalImageDataToProcess = originalImageToProcessCanvasCtx.getImageData(
         0,
         0,
         originalImageToProcess.width,
         originalImageToProcess.height,
       );
    
       // eslint-disable-next-line no-undef
       let srcImageData = cv.matFromImageData(originalImageDataToProcess);
       try {
         for (let i = 0; i < predictions.length; i++) {
           const prediction = predictions[i];
           const facesOriginalLandmarks = JSON.parse(JSON.stringify(prediction.originalLandmarks));
    
           if (flipHorizontal) {
             for (let j = 0; j < facesOriginalLandmarks.length; j++) {
               facesOriginalLandmarks[j][0] = srcImageRatio.videoWidth - facesOriginalLandmarks[j][0];
             }
           }
    
           // eslint-disable-next-line no-undef
           let dstImageData = new cv.Mat();
           try {
             // eslint-disable-next-line no-undef
             let thumbnailSize = new cv.Size(detectedFaceThumbnailSize, detectedFaceThumbnailSize);
    
             let transformation = getOneToOneFaceTransformationByTarget(detectedFaceThumbnailSize);
    
             // eslint-disable-next-line no-undef
             let similarityTransformation = getSimilarityTransformation(facesOriginalLandmarks, transformation);
             // eslint-disable-next-line no-undef
             let similarityTransformationMatrix = cv.matFromArray(3, 3, cv.CV_64F, similarityTransformation.data);
    
             try {
               // eslint-disable-next-line no-undef
               cv.warpPerspective(
                 srcImageData,
                 dstImageData,
                 similarityTransformationMatrix,
                 thumbnailSize,
                 cv.INTER_LINEAR,
                 cv.BORDER_CONSTANT,
                 new cv.Scalar(127, 127, 127, 255),
               );
    
               facesThumbnailsImageData.push(
                 new ImageData(
                   new Uint8ClampedArray(dstImageData.data, dstImageData.cols, dstImageData.rows),
                   detectedFaceThumbnailSize,
                   detectedFaceThumbnailSize,
                 ),
               );
             } finally {
               similarityTransformationMatrix.delete();
               similarityTransformationMatrix = null;
             }
           } finally {
             dstImageData.delete();
             dstImageData = null;
           }
         }
       } finally {
         srcImageData.delete();
         srcImageData = null;
       }
     }
    
     return { resizedImageDataToProcess, predictions, facesThumbnailsImageData, lastIndex };
    };
    
    The input is an image and an index for face matching and mask detection in the future.
    Since blazeface accepts images with a maximum size of 128 px, the image from the camera must be reduced.
    Calling the
    faceModel.estimateFaces
    method starts the image analysis using blazeface, and the predicted coordinates with the coordinates of the face, nose, ears, eyes, mouth area are returned to the main thread.
    Before working with them, you need to restore the coordinates for the original image because we compressed it to 128 px.
    Now you can use these data to decide whether the face is in the desired area or not. What is the minimum face size you need for subsequent identification.
    The following code cuts the face out of the image and aligns it to identify the mask using openCV methods.

    Mask detection

    Model initialization and
    webAssembly
    backend
    export const init = async (data) => {
     const { backend, streamSettings, maskDetectionsSettings, imageRatio } = data;
    
     flipHorizontal = streamSettings.flipHorizontal;
     detectedMaskThumbnailSize = maskDetectionsSettings.detectedMaskThumbnailSize;
     srcImageRatio = imageRatio;
     await tfc.setBackend(backend);
     await tfc.ready();
     const [maskModel] = await Promise.all([
       tfconv.loadGraphModel(
         `/rgb_mask_classification_first/MobileNetV${maskDetectionsSettings.mobileNetVersion}_${maskDetectionsSettings.mobileNetWeight}/${maskDetectionsSettings.mobileNetType}/model.json`,
       ),
     ]);
    
     detectedMaskThumbnailCanvas = new OffscreenCanvas(detectedMaskThumbnailSize, detectedMaskThumbnailSize);
     detectedMaskThumbnailCanvasCtx = detectedMaskThumbnailCanvas.getContext('2d');
     return maskModel;
    };
    
    The mask detection requires the coordinates of eyes, ears, nose, mouth, and the aligned image which is returned by the face detection worker.
    this.maskDetectionWorker.postMessage({
     type: 'detectMask',
     prediction: lastItem!.data.predictions[0],
     imageDataToProcess,
     lastIndex: lastItem!.index,
    });
    
    Detection method
    export const detectMask = async (data, maskModel) => {
     let { prediction, imageDataToProcess, lastIndex } = data;
     const masksScores = [];
     const maskLandmarks = JSON.parse(JSON.stringify(prediction.landmarks));
    
     if (flipHorizontal) {
       for (let j = 0; j < maskLandmarks.length; j++) {
         maskLandmarks[j][0] = srcImageRatio.faceDetectionImageWidth - maskLandmarks[j][0];
       }
     }
     // Draw thumbnail with mask
     detectedMaskThumbnailCanvasCtx.putImageData(imageDataToProcess, 0, 0);
     // Detect mask via NN
     let predictionTensor = tfc.tidy(() => {
       let maskDetectionSnapshotFromPixels = tfc.browser.fromPixels(detectedMaskThumbnailCanvas);
       let maskDetectionSnapshotFromPixelsFlot32 = tfc.cast(maskDetectionSnapshotFromPixels, 'float32');
       let expandedDims = maskDetectionSnapshotFromPixelsFlot32.expandDims(0);
    
       return maskModel.predict(expandedDims);
     });
     // Put mask detection result into the returned array
     try {
       masksScores.push(predictionTensor.dataSync()[0].toFixed(4));
     } finally {
       predictionTensor.dispose();
       predictionTensor = null;
     }
    
     return {
       masksScores,
       lastIndex,
     };
    };
    
    The result of the neural network is the probability that there is a mask, which is returned from the worker. It helps to increase and decrease the threshold of mask detection. By
    lastIndex
    , we can compare the face and the presence of a mask and display some information on a specific person on the screen.

    Conclusion

    I hope this article will help you to learn about the possibilities of working with ML in the browser and ways to optimize it. Most applications can be optimized using the tricks described above.

    Written by yantsishko | Skilled Front-End engineer with 7 years of experience in developing Web and SmartTV applications
    Published by HackerNoon on 2021/06/19