Optical Character Recognition With Google Cloud Vision API

Lately, I’ve been working on a delivery tracking application on React/React Native. One of the features is to allow a user to take a picture of a shipping label on your phone, and translate that label into a tracking number we can work with on the back end. The goal of this feature is to minimize the time spent to manually input tracking numbers, which range from 13–26 characters long depending on the service. In my quest to implement this feature, I came across a technique known as optical character recognition and an incredible image recognition toolkit from Google. In this article, I’ll go over the basics of how this method works, and how to integrate it into your application.

What is Optical Character Recognition?

Optical Character Recognition (OCR for short) is a technique that converts digital images of text into machine-readable data. Our eyes read text on a given medium by recognizing the patterns of light and dark, translating those patterns into characters and words, and then attaching meaning to it. OCR attempts to mimic the way our visual system operates, and their detection algorithms are usually powered by neural networks.

There are two methods to perform OCR: matrix matching and feature detection. Matrix matching is the simpler of the two; it takes an image and compares it to an existing library of character matrices or templates to generate a match. Feature detection is more complex as it looks for general features like diagonal lines, curvatures, intersections, etc. and compares it to other features on the image within a certain distance.

Enter Google Cloud Vision API

Google Cloud Vision API enables developers to understand the content of an image by encapsulating powerful machine learning models in an easy to use REST API. It quickly classifies images into thousands of categories, detects individual objects and faces within images, and finds and reads printed words contained within images.

The Google Cloud Vision API takes incredibly complex machine learning models centered around image recognition and formats it in a simple REST API interface. It encompasses a broad selection of image recognition tools, and you can use it for real-world applications like categorizing images or moderating offensive content. For the purposes of this article, I’ll focus on the OCR module — which analyzes an image for text and subsequently parses it into data for use on our computers.

How it works

The tool first performs a layout analysis on the image to segment the location of the text. After the general location is detected, the OCR module then performs a text recognition analysis on the specified location to generate the text. Finally, errors are corrected at a post-processing step by feeding it through a language model or dictionary.

All of this is done through a convolutional neural network in which each neuron is only connected to a subset of neurons in each layer. Convolutional neural networks are a subset of neural networks and intended to imitate the hierarchical structure of our visual cortex in how we identify objects.

For example, in the picture below, we first recognize a feature such as an eye. We then recognize the nose, mouth, fur, and then eventually combine the features to form a mental model of a dog. This happens almost instantaneously in our visual system so it’s difficult to separate the steps in our head as we do this. The idea here is that what’s important isn’t the location of the feature in the image, but rather its location relative to other features within a spatial proximity.

If you want to learn more about how convolutional neural networks operate, these two articles do a fantastic job of breaking it down:

https://medium.freecodecamp.org/an-intuitive-guide-to-convolutional-neural-networks-260c2de0a050

https://medium.com/@ageitgey/machine-learning-is-fun-part-3-deep-learning-and-convolutional-neural-networks-f40359318721

Integrating with the platform

As we can expect, the algorithms behind these models are highly convoluted and difficult to understand. Luckily, all of that has been abstracted away by Google and packaged in an easy to use format for our convenience.

First, you need to sign up for their platform and get a developer API key. Afterward, it’s as simple as sending a POST request with the URL of the image you want to decode. Here’s a sample API call using axios:

Request body:

let body = {  "requests": [    {      "image": {        "source": {          "imageUri": "

API call:

axios.post('https://vision.googleapis.com/v1/images:annotate?key=YOUR_API_KEY_HERE', body).then((response) => console.log(response));

The response returns a JSON object with a variety of data points including the detected text, language, location of the text, etc.

Sample response:

{"responses": [{"textAnnotations": [{"locale": "en","description": "I HAVE NO IDEA\nWHAT TM DOING\n","boundingPoly": {"vertices": [{"x": 11,"y": 2},{"x": 257,"y": 2},{"x": 257,"y": 102},{"x": 11,"y": 102}]}},...

That’s it!

You now know how optical character recognition works well enough to understand its use cases. The OCR module from Google is extremely simple to set up and the possibilities are endless. This tool uses the same technology as Google’s image search, so you now have access to all of its capabilities. Give it a try and start building!

https://cloud.google.com/vision/docs/ocr

If you enjoyed this article, please click on the 👏🏻 and share to help others find it. Thanks for reading!