Natural Language Processing, Adobe PDF Extract, and Deep PDF Intelligence

Raymond Camden
Adobe Tech Blog
Published in
5 min readNov 9, 2021

--

The Adobe PDF Extract API is a powerful tool to get information from your PDFs. This includes the layout and styling of your PDF, tabular data in easy to use CSV format, images, and raw text. All things considered, the raw text may be the least interesting aspect of the API. One useful possibility is to take the raw text and provide it to search engines (see Using PDFs with the Jamstack — Adding Search with Text Extraction). But another fascinating possibility for working with the text involves natural language processing, or NLP.

Broadly (very broadly, see the Wikipedia article for deeper context), NLP is about understanding the contents of text. Voice assistants are a great real world example of this. What makes Alexa and Google Voice devices so powerful is that they don’t just hear what you say, but they understand the intent of what you said. This is different from the raw text.

If I say “I work for Adobe”, that’s a different statement than saying “I live in an adobe house”. Understanding the differences between the same word takes machine learning, artificial intelligence, and other words that only really smart people get.

Organizations that deal with incoming PDFs, or dealing with a large history of existing documents, can use a combination of the PDF Extract API and NLP to better gain knowledge of what’s contained within their PDFs.

In this article, we’ll walk you through an example that demonstrates these two powerful features working together. Our demo application will scan a directory of PDFs. For each PDF, it will first extract the text from the PDF. Next, it will take that text to a NLP API. In both operations, we can save the results for faster processing later.

For our NLP API, we will be making use of a service from Diffbot. Diffbot has multiple APIs, but we will focus on their natural language service. Their API is rather simple to use, but provides a wealth of data, including:

  • Entities, or basically, subjects of a document.
  • The type of those entities, so for example, a name is an entity and the type is a person.
  • Facts revealed in the document, “The founder of company so and so was Joe So and So.”
  • Sentiment (how negative or positive something is).

Diffbot’s API can be also trained with custom data so that it can better parse your input. Check their docs for more information and their quick start is a great example. They provide a two week trial and do not require a credit card to sign up.

Here’s a simple example of how to call their API using Node.js:

Alright, so let’s build our demo. We have a folder of PDFs already, so my general process will be:

  • Get a list of PDFs.
  • For each, see if I already have got the text. Given a PDF named catsrule.pdf, I’ll look for catsrule.txt.
  • If I don’t, use our Extraction API to get the text.
  • For each, see if I already got the NLP results. Given a PDF named catsrule.pdf, I’ll look for catsrule.json.

Here’s the code that represents this logic, minus the actual implementations of either API (I’ve also removed the require statements):

By saving results, this script could be run multiple times as new PDFs are added. Now lets look at our API calls. First, getText:

For the most part, this is taken right from our Extract API docs. Our SDK returns a ZIP file, so we’re use a NPM package (AdmZip) to get the JSON result out of the ZIP. We then filter out the text elements from our JSON result to create one big honkin’ text string. The net result is — given a PDF filename as input, we get a text string back.

Now let’s look at the code to execute the natural language processing on the text:

This is virtually identical to the earlier example. Note that you can tweak the fields value to change what Diffbot does with your text. The result is an impressively large amount of data. Much like the PDF Extract API, the JSON can be hundreds of lines long. If you want to see the raw result, you can look at an example here. Warning: when formatted, that's roughly sixty-two thousand lines of data.

So now what? Great question! As a quick demo, I thought it would be nice to filter the NLP data down to a list of people mentioned in the PDF as well as categories. For people, I looked at the entities returned. Here is one example:

Note the value in allTypes which specifies this is a person. Here's an example of a non-person entity from a little company in Washington:

Categories are a bit simpler as they don’t need any filtering (outside of uniqueness):

Knowing where to get stuff, I wrote a script that scanned my PDF directories and for each PDF, attempted to ‘gather’ the data:

As before, we stripped out the require statements. The end result of this script is a JSON file with every PDF and a list of people and categories. For both, we filter to unique values. (This may be problematic for names of course, but I assume the chance of two people with the same name in a PDF to be slight.)

With this data, we can then build a web app to load it and render it on screen:

The code for the web application isn’t terribly interesting, but if you want to see it, or any other sample from this article, you may find it here: https://github.com/cfjedimaster/document-services-demos/tree/main/article_support/nlp

If you want to try this yourself, sign up for a free trial of PDF Extract API and checkout Diffbot for a deeper look at their awesome APIs!

--

--

Raymond Camden is a Senior Developer Evangelist for Adobe. He works on the Document Services tools, JavaScript, and the Jamstack.