Creating a Wrapper for Tesseract is Several Times Faster Than PyTesseract

Written by nuralem | Published 2022/10/31
Tech Story Tags: ocr | tesseract | python | rpa | python-programming | python3 | python-development | learn-to-code-python

TLDRThe basic idea is to use python’s built-in multiprocessing features to split documents into separate pages and run multiple tesseract engine instances for parallel page recognition. Tesseract uses one core to recognize images, in average cases, it will be enough, but if you have “heavy” documents, that have many sheets, it would be very slow. The technology is called OCR (Optical Character Recognition) One of the most popular and free OCR software is free and open source.via the TL;DR App

In this article, I want to share with you, how to create your python wrapper, that solves the basic problem of the tesseract engine – the small speed of recognizing multiple pages in one document.

The basic idea is to use python’s built-in multiprocessing features to split documents into separate pages and run multiple tesseract engine instances for parallel page recognition.

Tesseract uses one core to recognize images, in average cases, it will be enough, but if you have “heavy” documents, that have many sheets, it will be very slow, because tesseract by default will use only one CPU core.

Why do we need tesseract?

So, in some cases, when you cannot copy content from pdf-document (pdf has different formats itself, one of them – being a scanned pdf document, which is just an image of the actual text), we need software, that will recognize text from image.

This technology is called OCR (Optical Character Recognition). One of the most popular and free OCR software is tesseract. Tesseract originally was developed by Hewlett-Packard in 2006, then it was sold to Google. Google made tesseract free and open source. Today, the tesseract is being developed by a group of enthusiasts for free.

Set up the environment.

For developing our python wrapper, we need the latest version of python, currently, it is 3.11, download link: https://www.python.org/downloads/release/python-3110/

and pipenv (for the virtual environment).  When you finished downloading python, run it on your computer. Don’t forget to select “Add python.exe to PATH” for running it from your PowerShell or cmd.exe.

Next, install pipenv through the pip install pipenv command in cmd.exe. Finally, you will see “Successfully installed”. Also, we need to install pdf2image (pip install pdf2image) and download a poppler for it.

You will see the “Successfully created virtual environment” and its full path. Remember the full path, you will use it in VS code IDE. Open VS Code and select python interpreter with the combination ctrl + shift + p then select “Python: Select Interpreter”.

Select our newly created virtual environment from the menu.

Next, we need to download the tesseract engine (v5.2.0.20220712) and put it inside our project folder (create a tesseract folder inside the project folder.

After installation, we are ready to go.

Programming.

Create main.py file and paste this code.

import tempfile
from pdf2image import convert_from_path
import pytesseract
import time


pytesseract.pytesseract.tesseract_cmd=r'C:/Users/Nuriq/Desktop/python_wrapper/tesseract/tesseract.exe'


def start_ocr(pdf_path, poppler_path):
    full_raw_text = ""

    with tempfile.TemporaryDirectory() as path:
        images_from_path = convert_from_path(
            pdf_path=pdf_path, 
            output_folder=path, 
            paths_only=True, 
            fmt="jpeg", 
            poppler_path=poppler_path, 
            dpi=250, 
            grayscale=True
            )
        for img_path in images_from_path:
            full_raw_text += pytesseract.image_to_string(img_path)
    return full_raw_text


if __name__ == '__main__':
    start_time = time.time()
    full_raw_text = start_ocr(
        'C:/Users/Nuriq/Desktop/python_wrapper/PublicWaterMassMailing.pdf', 
        'C:/Users/Nuriq/Desktop/python_wrapper/poppler-0.68.0/bin'
        )
    print(full_raw_text)
    end_time = time.time()
    print(f"it took: {end_time-start_time}")

The pytesseract.pytesseract.tesseract_cmd line is the full path to our tesseract.exe engine. We use python “TemporaryDirectory” to store temporary files, which works faster than physically storing them on our HDD/SDD. To start using tesseract, we need to convert a single pdf document to images (tiff/jpeg files), where 1 page = 1 image.

Store it in the top folder and run full_raw_text += by tesseract.image_to_string(img_path), where img_path is a full path to the image.

Finally, we got a 51-second result to proceed with all 32 pages (pdf2image took 8 seconds to convert pdf to images). Then write the next code.

import tempfile
from pdf2image import convert_from_path
import pytesseract
import os
import time
import concurrent.futures
from concurrent.futures import ProcessPoolExecutor


pytesseract.pytesseract.tesseract_cmd = r'C:/Users/Nuriq/Desktop/python_wrapper/tesseract/tesseract.exe'


def start_ocr(pdf_path, poppler_path):
    images_from_path = []
    with tempfile.TemporaryDirectory() as path:
        images_from_path = convert_from_path(
            pdf_path=pdf_path, 
            output_folder=path, 
            paths_only=True, 
            fmt="jpeg", 
            poppler_path=poppler_path, 
            dpi=250, 
            grayscale=True
            )
        with ProcessPoolExecutor(max_workers=os.cpu_count()) as executor:
            tasks = {executor.submit(pytesseract.image_to_string, img_path): img_path for img_path in images_from_path}
            for future in concurrent.futures.as_completed(tasks):
                page_number = tasks[future]
                data = future.result(), page_number[-5]
                yield data


def sort_text(text):
    return(sorted(text, key = lambda x: x[1])) 


def pdf_to_string(pdf_path, poppler_path):
    full_raw_text = start_ocr(
        pdf_path, 
        poppler_path
        )
    full_text = ""
    text = sort_text(full_raw_text)
    for page_text, _ in text:
        full_text += page_text
    return full_text


if __name__ == '__main__':
    start_time = time.time()
    text = pdf_to_string(
        'C:/Users/Nuriq/Desktop/python_wrapper/PublicWaterMassMailing.pdf', 
        'C:/Users/Nuriq/Desktop/python_wrapper/poppler-0.68.0/bin'
        )
    print(text)
    end_time = time.time()
    print(f"it took: {end_time-start_time}")

Run it and finally, we got 7 seconds (on Core i7 12700), which is 7.3x faster than the previous code sample.

Conclusion.

Finally, using ProcessPoolExecutor we run 19 instances of tesseract for every file on the list and do parallel calculations. Then we save every process result, and we put the page number to it. After all, using the sort_text() function, we sort them and merge them into one text.


Written by nuralem | Robotics engineer
Published by HackerNoon on 2022/10/31