Improving Information Extraction with Deep Learning

7 min readMay 2, 2022

“Data is the new oil” — Clive Humby

Introduction to Information Extraction

Data is everywhere, we have identified it’s existence and so we only need to extract it. Data Extraction or Information extraction (IE) is the automated retrieval of data/information related to a specific domain from a specific medium. In today’s world, the most prevalent use cases of information extraction are in the Natural Language Processing (NLP) space. However, we can extract data from images, audio, even DNA, but there are significant challenges in doing so. These challenges are usually very unique to the medium that the data is trapped in and so solutions are just as unique.

When approaching an IE task, the information can be un-structed, semi-structured or structed and may or may not be machine-readable. We will look at IE from the point of view that having a machine-readable text based dataset is an extreme luxury using real examples that I have seen in Data Science consulting and freelancing.

Introduction to Deep Learning

A subset of Machine Learning, Deep Learning is the use of neural networks with enormous training datasets and highly customized architectures and methodologies to solve extremely challenging problems in classification, generation and prediction. Some types of Deep Learning include Convolutional Neural Networks (CNNs), Long Short Term Memory Networks (LSTMs), Generative Adversarial Networks (GANs), Restricted Boltzmann Machines (RBMs) and Multilayer Perceptron’s (MLPs). We can conclude (other than Data Science has no shortage of acronyms) that the amount of data used and the complexity of Deep Learning algorithms that they have the power to solve challenges in an unnerving fashion.

A single Information Extraction Use Case

In the following sections, we will look at the use of multiple Deep Learning techniques with novel uses for a single dataset. We will ignore the specifics of the dataset but will reveal it’s composition and the objective of this use case. Let’s call our dataset, SPDFs (scanned PDFs) provided by a major logistics company. SPDFs contain customer trucking and shipping information for millions of deliveries over a 30 year period (starting in the late 1970’s) with virtually no information provided about the data. These SPDFs were manually scanned in, one at a time, by hundreds of employees of an overseas data entry company.

Structuring the Extraction Process

We have a random collection of SPDFs for which we have no descriptive data so we can think of these as a single folder with file names SPDF-1.pdf through SPDF-10000000.pdf. Visually inspecting the files shows that they have different structural formats and so fields of data are located in different locations in both the single page and multipage documents and some documents have fields that others do not. What we want to have in the end is a database containing all the fields, structured so that the client can perform any number of statistical measures or analyses to support their business. Where do we begin?

Fig 1. — Example of Scanned PDF Document, slight CC Rotation

We have scanned PDFs (SPDFs) so we definitely will need a robust Optical Character Recognition (OCR) solution. Probably, we will need to become experts in Regular Expressions and Natural Language Processing. Also, we have to find good ways to move these data around AWS for our work to remain profitable. However, it turns out the most impactful skill we needed was an expertise in Deep Learning.

The first step we chose to focus on was discovering how many unique SPDF single-page formats we had to work with. We split our documents between single-page and multi-page files and got started.

Clustering SPDFs with Transfer Learning

At this point we can ignore the quality of the text, where it is located in the document and what it’s values are, for now we want to identify templates. A template is a SPDF with no words on the page, only the vertical and horizontal lines which are unique to every type of form. If we attempt to cluster documents that contain text, the text (noise in this case) will make it much more challenging for the clustering algorithm to find strong and reliable groupings.

We begin by storing the SPDFs as .tiff images as they require less storage space. A TIFF file is a graphics container that stores raster images in the Tagged Image File Format (TIFF). Taking a single image, we then open the image with OpenCV (Open Computer Vision is a package that provides real-time computer vision tools). Next we will perform the following sequence. Blur the image to the extent that the letters in words bleed into one another, perform blob analysis and identify regions where the text blobs exist, finally subtract the text blob regions from the original image. The resultant image will only contain the horizontal and vertical lines associated with the SPDF template.

These cleansed templates lack the required features for a clustering algorithm to group. Here we take these templates and run them through VGG16 to created a feature dense vector representing each image. Finally, these vectors are clustered using KMeans. There is little rhyme or reason to using KMeans, only that we decided to start with something well established and KMeans worked with over 98% accuracy. Observing the level of accuracy, the complexity of the original data and the simplicity of the clustering algorithm tells us that our data preprocessing was spot-on. The SPDFs fit to their respective groups with ease and we now have templates that have reliable positioning of fields.

Increasing Resolution With Generative Adversarial Networks

Generative Adversarial Networks

Generative Adversarial Networks have been in use since June of 2014 and since their inception the number of use cases has increased dramatically. Generative Adversarial Networks or GAN’s, take advantage of two Neural Networks (Generator & Discriminator) to generate fake information based on real information. We can think of the generated data as having the same statistical make-up of the real images.

Fig 3. — Generalized example of Generative Adversarial Network

The above figure starts with samples of Real Data which have been carefully chosen to support a specific use case. Then, random Noise is created and fed to the Generator creating Fake Data. The Real Data and Fake Data are both given to the Discriminator which decides if the generated fake samples appear real enough to be classified as real. It can take hundreds of thousands of samples of Real Data and up to a hundred hours of training before the generator is able to achieve this goal. However, once fully trained, a GAN can create information that otherwise wouldn’t exist.

Super Resolution GAN (SRGAN), is a type of GAN that can increase the resolution of images. It turns out many of the documents in our SPDF collection have a resolution too low for OCR to properly extract text without a large number of errors. To train our SRGAN, we used a higher resolution image and reduced it’s resolution as input into to the generator. Now our fake images that the generator is guessing are essentially up-sampled to “super” resolution. Our discriminator compares generated super resolution image text to the original high-resolution image text. The optimization function then applies it’s learned change to both the generator and the discriminator and after some time we had images that our OCR code would translate into text with limited errors. Actually, after running spell checking and some digit to letter conversation heuristics, we had perfect translation.

Finding Address Blocks with Object Detection

Object detection is a well known Deep Learning task that has been shown to have higher accuracy than human labeling. With object detection, we can automatically identify things, such as a cat, a car or a person, in images and video feeds. The use of these algorithms has been used in a myriad of applications and industries for everything from tagging images to helping drones navigate urban landscapes.

Now, even though we have our SPDFs clustered into groups and we can count on page elements being in predetermined locations it is still advantageous to identify certain elements and extract them in a specific way. We found that our OCR’d text would occasionally have address blocks in random positions, in most cases stamped askew to the page orientation, by a person upon receipt of a delivery.

Fig 4. — Skewed Address Stamped Block Example

In order to capture these with low to no errors, we used a prebuilt object detection algorithm and retrained it using snips of address blocks. This article does a great job explaining how one can retrain a robust generalized object detection algorithm for a specific detection task. This methodology was so successful that we used it many other applications to extract ‘un-extractable” information from complex mediums.

Conclusion

Above we reviewed multiple Deep Learning methods with custom applications to help extract information from a complex medium of Scanned PDFs (SPDFs). We found many more approaches not discussed here and a number of them were strong enough to become their own product and capabilities. If you found this article interesting and or insightful, we ask you to please follow this author and subscribe to their writing.

Please Follow me, Subscribe, and become a Member to get notified about more projects and new cases for Deep Learning.