Extraction of Relevant Text From Scientific Papers Using Machine Learning

Written by me3an | Published 2022/07/18
Tech Story Tags: machine-learning | ml | ai | artificial-intelligence | intelligent-machines | unet | ocr | scientific-research

TLDRA huge potential in the domain of vital information extraction and summarization of scientific papers that I believe is under-researched. In this article, I’ll show you how a technique created for biological image segmentation UNet can be combined with Optical Character Recognition to extract relevant parts of a scientific paper. The intended sections are ones that are commonly present in scientific papers such as Title, Abstract, and Author/s. I will be using UNet to learn where the sections can be found and then pass learned information into an OCR.via the TL;DR App

Imagine having machines that can understand scientific papers and summarize its content. I mean… I’m talking about very sophisticated documents written by some of the smartest people that existed. Assisted by smart machines created in our image, I ask myself what problems can we solve and what diseases can we cure?

Figure-1: The dream of the thinking machines.

There’s a huge potential in the domain of vital information extraction and summarization of scientific papers that I believe is under-researched.

In this article, I’ll show you how a technique created for biological image segmentation UNet can be combined with Optical Character Recognition to extract relevant parts of a scientific paper. Due to time constraints and project scope, I’ll be covering textual extraction but not summarization.


Objective

In this project, I aim at extracting certain sections from scientific papers. The intended sections are ones that are commonly present in scientific papers such as Title, Abstract, and Author/s. I will be using UNet to learn where the sections can be found and then pass learned information into an OCR. The complete project can be found here.

Figure-2: Input vs final result. My input is the 1st-page paper converted into an image. Final result is a txt file containing the Title, Author/s, and Abstract section as shown above.

How did I do it?

I started with data gathering and collecting scientific papers. Since I am only interested in 3 sections namely Title, Abstract, and Authors, I took the first page of the 8+ pages of the collected papers. I then converted the pdf into images since my UNet model only accepts images. I’ve divided my dataset into a training set and a test set by 80/20 respectively. Next, I wrote a python script mask.py to mask the images. see figures below.

Figure-3: on the right is the original first page of a scientific paper, on the left is the result of mask.py for the corresponding paper.

The masked image is 1D image where each pixel has values 0 (black) and 1 (white). 0 means that the pixel is not important and 1 means the pixel is important. The white region of the image on the left masks the Title, Abstract and Author/s sections as being important.

We pass this mask and the original image to our UNet model for training. UNet is a popular architecture. You can find many implementations of the architecture. I’ve used the one by milesial. if you’ve already cloned my dagshub repository you don’t need to clone this repo . You can build the model by changing your working directory to Unet-OCR/Pytorch-UNetand then running train.py as follows:

python train.py --epoch 6 --batch-size 1 --learning-rate 0.000001

You can obviously set your own parameters. Otherwise, you can use my pre-trained model MODEL.pth which you’ll get upon running:

dvc pull -r origin

Make sure you set your DVC origin to my dagshub repository as per my Installation instruction below. My best model was able to get a training loss of 0.395 after 6 epochs and 1365 steps.

Finally, I used my testsetfor model prediction. My model generated a mask of important sections of an RGB image. Using a post-processing script postprocessing.py, I reconstructed RGB images out of their corresponding masks. Those images were then passed to Tesseract OCR to convert the images into text.

Getting started

Installation

Reproducing my environment and getting myself up and running is quite easy.

  1. Start by cloning my repository

git clone https://dagshub.com/Eman22S/Unet-OCR-2.0.git

2. Install DVC to track your dataset

pip install dvc

3. Install dagshub for data, model, and experiment logging/tracking.

pip install dagshub

4. Configure your remote DVC as follows. Dagshub will auto-generate the content in braces for you.

dvc remote add origin https://dagshub.com/Eman22S/Unet-OCR-2.0.dvc
dvc remote modify origin — local auth basic
dvc remote modify origin — local user {your-username}
dvc remote modify origin — local password {your-password}

If you feel confused about this step don’t worry!! Dagshub provides excellent documentation that even a monkey can easily follow.

5. If you want to pull my Model and dataset run this code. Otherwise, skip.

dvc pull -r origin

6. Install tesseract on your machine

sudo apt install tesseract-ocr-all -y

Running

  1. Start by training your model using

python train.py --amp

Make sure you changed the directory to installation-path/Unet-OCR-2.0/UNet-OCR/Pytorch-UNet . Also make sure you’re images and masks are in installation-path/Unet-OCR-2.0/UNet-OCR/Pytorch-UNet/data/imgs and installation-path/Unet-OCR-2.0/UNet-OCR/Pytorch-UNet/data/masks respectively. Or run dvc pull -r origin to use my training dataset.

2. Run prediction using the following code:

python predict.py -i {path to testset} -o {path to save generated file}

3. Convert predicted mask into Image:

python postprocess.py -i {path to masked image} -e {path to original image} -o {path to save output}

4. Image to text conversion using tesseract:

tesseract {path to the postprocessed image} {path with name of file } -l eng

What did I use?

There are many open source software out there nowadays for machine learning projects. Here’s a list of my favorite ones which I’ve used in this project.

  • Google colab notebook

  • MLflow

  • DVC

  • Dagshub

  • Python

  • ImageIo

  • Pythorch

  • matplotlib

  • Numpy

  • Pillow

  • torch

  • torchvision

  • tqdm

If your computer doesn’t have the right resources for training, don’t worry Google got your back! Google colab Notebook provides limited access to Tesla GPU for free. you can obviously upgrade your account to premium but the free version is not too bad. I’ve used the free version for this project. I’ve used daghub’s git tracker to track my model. I also used DVC to track my dataset.

For most machine learning projects, no language comes close to Python with its powerful data science tool kit. Pytorch is a Python framework that allows OOP style of Model building.

Project Pipeline

It’s a bit of a complicated project since I’m using two models so I thought a diagram could help. The diagram below shows an overview of the project pipeline.

Results

With git’s model tracking capabilities embedded in Dagshub, you can easily view and track the model parameters that gave you the best result.

                                                                Figure-5: Experiments table

My table shows I got better performance with the amp parameter turned false and epoch set to 5. check this link to view all my experiments.

I’ve added the prediction done on two models from different branches in the following gif image. That way it’s easier to visualize and compare two different models visually. This is made incredibly easy by the cool comparison feature Dagshub provides. You can compare datasets, code, and pretty much anything. Pretty cool huh?

Figure-6: Comparing results of predictions from different models in different branches.

The gif is revealing how the model’s prediction is not good so far. But looking at the experiment table in figure 6. We know this can be fixed with training for a few more epochs.

Conclusion

The biggest challenge of this project is the frequent data processing that is required. As you can see from the previous section, I have transformed my dataset several times, excluding the initial preprocessing. I’ve used Dagshub to assist me in tracking my dataset changes. Dagshub is an online repository that supports several version control tools namely: git, dvc and mlflow.

I’ve mostly taken advantage of the DVC tool. I can’t tell you how many times it came to save my day when I’ve mistakenly overridden images in testset , unetpred , postprocessed folders by another.

This is not to undermine the powerful parameters and metrics logger accessible via the Experiment tab in Dagshub. I’ve used it to learn which model parameters gave me the best validation dice score.

All the tools I used for this project are free and open source. Now that we have our final result in txt format under ocrresult , the next milestone is to build an NLP model for the summarization step. If you’re interested in collaborating on this area or have experience working on text summarization I would be happy to work together. Stay tuned for my next blog!

References

https://labs.armut.com/information-extraction-from-the-turkish-identity-card-44a1b4504cf?gi=8a5f76a04f02

https://arxiv.org/abs/1505.04597

https://github.com/tesseract-ocr/tesseract

https://dagshub.com/docs/


Written by me3an | Machine learning Engineer, Interested in NLP and computer vision
Published by HackerNoon on 2022/07/18