We Built a Face and Mask Detection Web App for Google Chrome

Written by yantsishko | Published 2021/03/07
Tech Story Tags: javascript | tensorflowjs | face-mask-detection | computer-vision | face-recognition | mask-recognition | machine-learning | image-detection

TLDRvia the TL;DR App

Introduction

The main goal is an attempt to detect face and mask in the browser, instead of Python implementation at back-end. This application is a simple webapp / SPA application which contains JS code only and can send some data to a backend for next processing. But initial face and mask detection is done on the browser side, no Python backend implementation is needed.
At the current moment, the app works only in the Google Chrome browser.
In future articles, I will describe more technical details, the implementation of all our investigation results.
There are 2 approaches of how this can be done with browser implementation:
  1. TensorFlowJS
  2. ONNXJS
Both runtimes support WASM, WebGL, and CPU backends. But we will compare only WASM and WebGL, because CPU performance is very low and it can't be used in production. 
View the demo here

TensorFlow.js

At the official website, TensorFlow.js proposes some pre-trained and ready to be used models which includes the appropriate JS postprocessing. For real-time face detection, it is specified in the BlazeFace model, for which an online demo is available.
More info about BlazeFace can be found here.
We created a demo to play a bit with the runtime and model, identify any existing issues. The appropriate links to run the app with different models can be found below: 
WASM (face detection image size: 160x120; mask detection image size: 64x64px
WebGL (face detection image size: 160x120; mask detection image size: 64x64px


Performance Results: Getting Frames

We can get frames via the appropriate HTML API features. But this process frames consumes time as well. So we need to understand how much time we will spend on such activities. There are the appropriate timing metrics below.
Our goal is to detect faces as fast as possible so we should get each frame of the live stream and process. For this issue, we can use requestAnimationFrame which is called every 16.6ms or each frame.
Using grabFrame() method of the ImageCapture allows us to take a snapshot of the live video in a MediaStreamTrack and return a promise that resolves with an ImageBitmap containing the snapshot.
Single mode it is max performance of getting frame. Sync mode it's how often we can get frame with face detection.

Face Detection + Mask Detection Results:

Color scheme: < 6 fps red, 7-12 fps orange, 13-18 fps yellow, 19+ fps green
Results:
We excluded timing metrics from the application start. It is obvious that during the application startup and first runs of the model it will consume more resources/time. When the app is in a warm mode it is worth getting performance metrics only. Warm mode in our case it is just let the app do work 5 - 10 sec and then get performance metrics.
Possible inaccuracies in the gathered timings metrics are up to 50ms.
The BlazeFace model was developed especially for the mobile devices and helps to achieve good performance with the TFLite inference on Android  and IOS platforms (~50-200 FPS).
More info is here.
Meanwhile the dataset  for retraining the model from scratch is not available (Google Research Team did not share it).
So, there are two model types:
  • Front: Input size 128 x 128px, faster but lower accuracy.
  • Back: Input size 256 x 256px, higher accuracy but slower.
This means that the preparation of a detector model with three classes (clear face, face with mask, background) can be a time consuming task.

Images Size

The original image can be of any size according to the camera settings and business needs. But when we process the frames for face and mask detection the original frame is resized to the appropriate size depending on the model.
Image which is used by BlazeFace for face detection is specified with the size 128 x 128px. The original frame is resized to this size considering its proportion. Image which is used by mask detection is specified with the size 64 x 64px.
We chose the minimal resolutions for both images taking into account performance requirements and results. Such minimal images demonstrated the best performance results on PC and mobile devices. We use 64 x 64px for thumbnails to detect the mask because size 32 x 32px is not enough for mask detection with sufficient accuracy.

How to Get the Best Images to be Analyzed by the Application

With TensorFlow.js we have the following options to get the best images to be used in feature by the application:
  • BlazeFace allows you to configure the score for the detected images. We will set this score to high value (>0.9) to avoid any uncertain positives
  • BlazeFace allows you to configure a max amount of faces which are returned from the detection method. We can specify this option to 1 to return 1 face only; we can set it to 2 for example, and in case of 2 faces just to send a message that one person/face can be in front of camera only
  • BlazeFace returns bounding box and landmarks. According to the all-in-one device calibration we have all power in our hands to apply the appropriate checks for these results to be sure if the detected faces are in a good quality.
Such checks can be the following:
  1. Bounding box should in the calibrated box in X%
  2. All N or M from N landmarks should be in the calibrated box
  3. Width and height of the bounding box should be above the appropriate thresholds
  4. Checks with bounding boxes width/height and landmarks are extremely fast on the client side with JS
The selected images according to the specified rules above will be sent to next stages of the pipeline: mask detection. Such logic will provide faster face detection results till the moment when we are sure that the detected face is in a good quality.

Mask detection models sizes

For mask detection using trained  MobileNetV2 and MobileNetV3 with different types and multipliers.
We tend to use light or ultra-light models with TensorFlowJS in the browser (< 3Mb). The main reason is the fact that the WASM backend is faster with such models. It is specified in the official documentation and it is confirmed by our performance tests as well.

Additional Resources

  • WASM JS back-end: ~60Kb
  • OpenCVJS: 1.6Mb
  • Our SPA itself (+Tensorflow.js): up to 500Kb
  • BlazeFace model is ultra-light: 466Kb
For webapp time to interact (TTI) with ~3.5Mb of JS and binary + JSON of size from 1.5Mb to 6Mb (everything will be in internal network) will be >10s in cold mode; in warm mode the expected TTI - 4-5sec.
If it is used by web worker(s) and OpenCV.js is used in the worker(s) only it will significantly reduce the size of the main app with 800-900Kb JS, TTI will be 7-8s in cold mode; in warm mode - <5s.
Possible approaches to run neural network models in browser:
One thread implementation:
It is a default implementation for browsers. We run both models for face and mask detections in one thread. A  crucial point is to provide a good performance for both models to be run with any issues in one single thread. It is the current approach which was used to take the performance metrics above.
This approach has some limitations. If we would like to add additional models to be run in the pipeline they will be run in the same thread in async way but in sequential flow. It will decrease total performance metrics regarding frames processing.
Web Worker(s) usage to run models in different context and have parallelism in the browser:
It is used in the main JS thread to run BlazeFace model for face detection, but mask detection is running in a separate thread via web worker. With such implementation we can separate both models running and introduce parallelism for processing in the browser. It will have a positive influence on general UX feeling with the application. Web workers will load TFJS and OpenCV libraries, main JS thread - TF.JS only.
It means that the main app will start much faster and by doing it we can significantly reduce TTI time in the browser. Face detection will start more often, it will increase the FPS of the face detection process. As a result mask detection process will be run more often as well and FPS of this process will be increased as well. The expected improvements are up to ~20%. It means that the FPS and MS specified in the article above can be improved by this value.
With such an approach to run different models in separate contexts via web workers we can run more neural network models in the browser with good performanсe. The main factor here will be all-in-one device hardware characteristics to support such loading.
We have implemented such an approach in the app and it works. But we have some technical issues with postMessage callback when a web worker sends a message back to the main thread. For some reasons it introduces additional delay (up to 200ms on mobile devices) which kill the performance improvement which we achieved with parallelism (this issue actual only for vanilla JS, after implementation in React this issue was fixed)

Web Worker Investigation Results

Our idea is to use web workers (WW) to run every model in a separate worker/thread/context and achieve model processing parallelism in the browser. But we noticed that the callback function in WW is called with some delays. Below you can find the appropriate investigation measurements and conclusions on what factors it depends on.
Test params:
mobileNetVersion=V3
mobileNetVersionMultiplier = 0.75
mobileNetVersionType = float16
thumbnailSize=32px
backend = wasm
In the app we run BlazeFace in main thread and mask detection model in web worker.
Here are some measurements for WW processing time on different devices:
The results above demonstrate that for some reasons time to send a callback from Web Worker to the main thread depends on the model running or using TensorFlow method browser.fromPixels in this Web Worker.
If it is run with a model the callback on MAC OS is sent ~27ms, if model is not run - 5ms. This difference on MAC OS in 22ms can lead to 100-300ms delays on more weak devices and it will have an effect on general app performance with Web Worker usage.
Currently we don't understand why it happens but we know the fact that it has happened.

How to Increase Accuracy, Decrease False Positives, Run Access Control API for Stable Faces Only

We need to have more "context" to take the appropriate decision to inform to face detection and removing mask . Such context is the prev processed frames and their state (is face or not, is face or not, is mask or not). We implemented a "flying" segment to manage such additional context. The length of segment is configurable and depends on the current FPS, but in any case it should not be more than 200-300ms to not introduce visible delays.
Below is the appropriate scheme which describes the idea:

Conclusions:

As stronger device we have as better performance results regarding taking metrics it is demonstrated:
For PC we could get the following metrics:
  • Face detection only: >30fps
  • Face detection + mask detection: up yo 45fps
For mobile devices we could get the following metrics:
  • Face detection only: from 2.5fps to 12-15fps depends on mobile devices
  • Face detection + mask detection: from 2 to 12fps depends on mobile devices
  1. Pay attention that video is always real-time and its performance depends on device itself, but it always will be from 30fps.
  2. For mask detection model for the most cases the best results are demonstrated by MobileNetV2 0.35 model, types don't influence the performance metrics with visible effect.
  3. Size of the mask detection model depends on types. Since types don't influence performance metrics it is recommended to use uint16 or float16 models to have less model size in the browser and as a result faster TTI.
  4. WASM runtime demonstrates better results in comparison with WebGL for BlazeFace model. It is correlated with the official TensorFlowJS documentation regarding performance of small models (<3Mb):

    For most models, the WebGL backend will still outperform the WASM backend, however WASM can be faster for ultra-lite models (less than 3MB and 60M multiply-adds). In this scenario, the benefits of GPU parallelization are outweighed by the fixed overhead costs of executing WebGL shaders.
  5. TTI time is always better for WASM models in comparison with webgl with the same models configuration.
  6. The performance of TensorFlowJS runtime and models can be increased by using WASM extension to add SIMD instructions allowing multiple floating point operations to be vectorized and executed in parallel. Preliminary tests show that enabling these extensions brings 2-3x speedup over WASM today.

    More info is here. It is still an experimental feature which is not delivered with the runtime by default. Also after the release it will be available in the runtime by default or via additional configuration options.
  7. On the client side the expected size of the app is ~3.5Mb of encoded JS with all dependencies, 466Kb of BlazeFace model, from 1.1Mb to 5.6Mb of mask detection model. The expected TTI time for the application is >10sec for mobile devices in cold mode; in warm mode - ~5sec.
  8. With web workers usage OpenCV.js can be loaded into web workers only, it significantly reduce TTI for the main app.
  9. During our testing we noticed that the devices were very hot. Since it is expected 24/7 availability of the solution this fact can have a big influence on the final decision which devices to use. You need to take this information into account.
  10. It is also was noticed that it is better if devices are fully charged and they are connected to the charger during working. In this case the performance of devices according to our observations is better. Charging from USB doesn't allow to support battery on the same level in a long perspective, it should be used for electric charging. This information should be taken into account.
  11. In our view even 4-5 fps are enough for the current solution to provide good UX user perception. But it is important to not show bounding boxes or landmarks on the screen, but operate the whole video/screen size:
    Highlight screen when we captured person.
    Inform about mask / no mask factor via text messages or another highlighting of the screen.
  12. With such a user interacting the delays between realtime and our metadata on the screen will be 200-300ms. Such values will be considered by users of the system as non-critical delays.

Written by yantsishko | Skilled Front-End engineer with 7 years of experience in developing Web and SmartTV applications
Published by HackerNoon on 2021/03/07