Machine Learning is the Wrong Way to Extract Data From Most Documents

Written by sensible | Published 2022/07/26
Tech Story Tags: machine-learning | ai | artificial-intelligence | business-process-automation | workflow-automation | ocr | hackernoon-top-story | good-company

TLDRIn the late 1960s, the first OCR (optical character recognition) techniques turned scanned documents into raw text. Google, Microsoft, and Amazon provide high-quality OCR as part of their cloud services offerings. But documents remain underused in software toolchains, and valuable data languish in PDFs. The challenge has shifted from identifying text in documents to turning them into structured data suitable for direct consumption by software-based workflows or direct storage into a system of record. The best way to turn the vast majority of documents into. structured data is to use a next generation of powerful, flexible templates that find data in a document much as a person would.via the TL;DR App

Documents have spent decades stubbornly guarding their contents against software. In the late 1960s, the first OCR (optical character recognition) techniques turned scanned documents into raw text. By indexing and searching the text from these digitized documents, software sped up formerly laborious legal discovery and research projects.

Today, Google, Microsoft, and Amazon provide high-quality OCR as part of their cloud services offerings. But documents remain underused in software toolchains, and valuable data languish in trillions of PDFs. The challenge has shifted from identifying text in documents to turning them into structured data suitable for direct consumption by software-based workflows or direct storage into a system of record.

The prevailing assumption is that machine learning, often embellished as “AI”, is the best way to achieve this, superseding outdated and brittle template-based techniques. This assumption is misguided. The best way to turn the vast majority of documents into structured data is to use the next generation of powerful, flexible templates that find data in a document much as a person would.

The Promises and Failures of Machine Learning

The promise of machine learning is that you can train a model once on a large corpus of representative documents and then smoothly generalize to out-of-sample document layouts without retraining. For example, you want to train an ML model on company A, B, and C’s home insurance policies, and then extract the same data from similar documents issued by company Z. This is very difficult to achieve in practice for three reasons:

Document Extraction is an Unusually Granular Task for Machine Learning

Your goal is often to extract dozens or hundreds of individual data elements from each document. A model at the document level of granularity will frequently miss some of these values, and those errors are quite difficult to detect. Once your model attempts to extract those dozens or hundreds of data elements from out-of-sample document types, you get an explosion of opportunities for generalization failure.

Data Elements in Documents Typically Have a Hierarchical Relationship to One Another

While some simple documents might have a flat key/value ontology, most will have a substructure: think of a list of deficiencies in a home inspection report or the set of transactions in a bank statement. In some cases you’ll even encounter complex nested substructures: think of a list of insurance policies, each with a claims history. You either need your machine learning model to infer these hierarchies, or you need to manually parameterize the model with these hierarchies and the overall desired ontology before training.

A "Document" is a Vague Target for an Ml Project

A document is anything that fits on one or more sheets of paper and contains data! Documents are really just bags of diverse and arbitrary data representations. Tables, labels, free text, sections, images, headers and footers: you name it and a document can use it to encode data. There's no guarantee that two documents, even with the same semantics, will use the same representational tools.

It's no surprise that ML-based document parsing projects can take months, require tons of data up front, lead to unimpressive results, and in general be "grueling" (to directly quote a participant in one such project with a leading vendor in the space).

The Challenge With Templates

These issues strongly suggest that the appropriate angle of attack for structuring documents is at the data element level rather than the whole-document level. In other words, we need to extract data from tables, labels, and free text; not from a holistic “document”. And at the data element level, we need powerful tools to express the relationship between the universe of representational modes found in documents and the data structures useful to software.

So let's get back to templates.

Historically, templates have had an impoverished means of expressing that mapping between representational mode and data structure. For example, they might instruct: go to page 3 and return any text within these box coordinates. This breaks down immediately for any number of reasons, including if:

  • a scan is tilted
  • there's a cover page, or
  • the document author added an additional section before the target data.

None of these minor changes to the document layout would faze a human reader.

A Query Language for Documents

For software to successfully structure complex documents, you want a solution that sidesteps the battle of months-long ML projects versus brittle templates. Instead, let’s build a document-specific query language that (when appropriate) embeds ML at the data element, rather than document, level.

First, you want primitives (i.e., instructions) in the language that describe representational modes (like a label/value pair or repeating subsections) and stay resilient to typical layout variations. For example, if you say:

“Find a row starting with this word and grab the lowest dollar amount from it”

You want “row” recognition that’s resilient to whitespace variation, vertical jitter, cover pages, and document skew, and you want powerful type detection and filtering.

Second, for data representations with a visual or natural language component, such as tables, checkboxes, and paragraphs of free text, the primitives should embed ML. At this level of analysis, Google, Amazon, Microsoft, and OpenAI all have tools that work quite well off the shelf.

Time to Value as a North Star

Sensible takes just that approach: blending powerful and flexible templates with machine learning. With SenseML, our JSON-based query language for documents, you can extract structured data from most document layouts in minutes with just a single reference sample. No need for thousands of training documents and months spent tweaking algorithms, and no need to write hundreds of rules to account for tiny layout differences.

SenseML’s wide range of primitives allows you to quickly map representational modes to useful data structures, including complex nested substructures. In cases where the primitives do not use ML, they behave deterministically to provide strong behavior and accuracy guarantees. And even for the non-deterministic output of our ML-powered primitives, such as tables, validation rules can identify errors in the ML output.

What this means is that document parsing with Sensible is incredibly fast, transparent, and flexible. If you want to add a field to a template or fix an error, it's straightforward to do so.

The tradeoff for Sensible’s rapid time to value is that each meaningfully distinct document layout requires a separate template. But this tradeoff turns out to be not so bad in the real world. In most business use cases, there are a countable number of layouts (e.g., dozens of trucking carriers generating rate confirmations in the USA; a handful of software systems generating home inspection reports). Our customers don’t create thousands of document templates – most generate tremendous value with just a few.

Of course, for every widely used tax form, insurance policy, and verification of employment, collectively we only need to create a template once. That’s why we’ve introduced…

Sensible’s Open-source Library of Pre-built Templates

Our open-source Sensible Configuration Library is a collection of over 100 of the most frequently parsed document layouts, from auto insurance policies to ACORD forms, loss runs, tax forms, and more. If you have a document that's of broad interest, we'll do the onboarding for you and then make it freely available to the public. It will also be free for you to use for up to 150 extractions per month on our free account tier.


We believe that this hybrid approach is the path to transparently and efficiently solving the problem of turning documents into structured data for a wide range of industries, including logistics, financial services, insurance, and healthcare. If you'd like to join us on this journey and connect your documents to software,schedule a demo or sign up for a free account!


Written by sensible | Fast & flexible data extraction from documents.
Published by HackerNoon on 2022/07/26