Building an accurate OCR Receipt Engine is an interesting engineering challenge because the problem cannot be solved by a deterministic solution. In reality, there are too many uncertainties (i.e., receipt format, language, origin country, picture quality, receipt angle, etc…) in what a receipt scanner API receives.

However, like most engineering problems, the path you take ultimately affects the outcome of the solution in the end.

This article outlines what I have found to be one of the most effective ways to build a receipt scanner API with optimized features such as being accurate, automatic, real-time, multi-lingual, and adaptive.

I will explain this solution based on our experience in building TAGGUN, a Receipt and Invoice Scanning API powered by machine learning.

Firstly, why build a Receipt Scanner API?

There are many ways that businesses capitalize off digitizing receipt processing. To name a few:

Streamlining business processes by minimizing hardcopies
Digital marketing
Accounting and financial operations
IT optimizations

Setting the Scene for NLP in Receipt OCR

Tech Giants like AWS, Microsoft, Google and IBM are actively competing with each other to offer the best machine learning and computer vision on the market. So, instead of reinventing the wheel and training a tesseract OCR model for ourselves, we take advantage of this healthy competition in the capitalist market and select the best computer vision OCR solutions to convert the image of the receipt to raw texts.

So, the true crux of a modern OCR receipt engine is its ability to convert syntactical data into semantic information. This function is more affiliated with NLP (Natural Language Processing).

NLP is the field of Machine Learning that allows computers to digest and understand written and spoken texts (ref. 1).

Both NLP and OCR are therefore the rudiments fuelling TAGGUN’s engine. I will now paint the picture of what we’ve used these to build the scanner.

5 Essential Layers of the Receipt Scanner

1) Taking Advantage of Multiple OCR Providers

Based on our testing, Microsoft Cognitive and Google Vision are 2 of the best OCR providers on the market. And the latest version from Microsoft Cognitive actually outperforms Google Vision. So, we recently switched to Microsoft as the main computer vision provider, after 3 years with Google. They each have their own benefits and trade-offs, and we set up our engine to optimize the result from both providers.

After the file is processed by OCR provider, the output is naturally the classic computer vision OCR result: raw text with coordinates and bounding box.

2) Classifying the Data

To improve the data extraction, a contextual awareness should be built around the file and request to predict the meta of the file.

E.g., Predicting the:

Type or format of the file (i.e., is it a receipt? An invoice? Screenshot? Email?)
Language
Geolocation (IP address or near param)
Range of the amount

3) Named Entity Recognition

This stage detects and extracts the most basic information from the text.

For example:

The decimals and amounts
The locations (city, state, and country)
The dates
Other numbers
Specialized Entities Extract

4) Specialised Entities Extraction

This phase of the Scan Receipt API is where the more complex information is identified and extracted.

A scenario to think about:

If there are five distinct amounts, how do you know which is the total amount? Or the tax amount?

As you can imagine, it becomes increasingly tricky as the content, format, and language of invoices and receipts become more variable.

Several different algorithms are run to determine the best result for each of the entities.

For Merchant Verification (VAT ID) and ABN:

The official sum method is followed to validate each number and improve accuracy.

For more complicated entities, such as Multi Tax Line items:

The recognition of patterns in the text is required. This is so grouped information can be accurately extracted (such as tax rate, gross tax amount, net tax amount).

Merchant Name Entity:

This can be trained (or fed back) for each account. So, accuracy can be improved, especially for each individual account over time.

To summarise some ideas of the specialized entities that could be processed in this stage:

Total Amount, Tax Amount, ABN, Multi Tax Line Items Merchant Verification, Merchant Name Receipt Number, Invoice Number IBAN Payment Type (i.e., credit card, cash, visa, MC, etc.) Fapiao Invoice Number and Code.

5) Data Enrichment

Public and helpful APIs are called as needed, to acquire supplementary information.

Examples of these are:

VAT information (for European clients)
ABN information (for Australian clients)
Location information and verification (using Google Places)
Normalization of merchant names.

6) JSON Format

The result outputted is a JSON format.

Because the JSON format is a common data format, developers can simply integrate the receipt OCR library into any software, with any programming language.

What’s also recommended, is building the engine to instantly return the result following the API request. TAGGUN has this feature to avoid developers needing to make additional requests or building additional webhook endpoints.

Behind the Scenes of an OCR Receipt and Invoice API Engine