Segment paragraphs and detect insights with Amazon Textract and Amazon Comprehend

Many companies extract data from scanned documents containing tables and forms, such as PDFs. Some examples are audit documents, tax documents, whitepapers, or customer review documents. For customer reviews, you might be extracting text such as product reviews, movie reviews, or feedback. Further understanding of the individual and overall sentiment of the user base from the extracted text can be very useful.

You can extract data through manual data entry, which is slow, expensive, and prone to errors. Alternatively you can use simple optical character recognition (OCR) techniques, which require manual configuration and changes for different inputs. The process of extracting meaningful information from this data is often manual, time-consuming, and may require expert knowledge and skills around data science, machine learning (ML), and natural language processing (NLP) techniques.

To overcome these manual processes, we have AWS AI services such as Amazon Textract and Amazon Comprehend. AWS pre-trained AI services provide ready-made intelligence for your applications and workflows. Because we use the same deep learning technology that powers Amazon.com, you get quality and accuracy from continuously learning APIs. And best of all, AI services on AWS don’t require ML experience.

Amazon Textract uses ML to extract data from documents such as printed text, handwriting, forms, and tables without the need for any manual effort or custom code. Amazon Textract extracts complete text from given documents and provides key information such as page numbers and bounding boxes.

Based on the document layout, you may need to separate paragraphs and headers into logical sections to get more insights from the document at a granular level. This is more useful than simply extracting all of the text. Amazon Textract provides information such as the bounding box location of each detected text and its size and indentation. This information can be very useful for segmenting text responses from Amazon Textract in the form of paragraphs.

In this post, we cover some key paragraph segmentation techniques to postprocess responses from Amazon Textract, and use Amazon Comprehend to generate insights such as sentiment and entity extraction:

Identify paragraphs by font sizes by postprocessing the Amazon Textract response
Identify paragraphs by indentation using bounding box information
Identify segments of the document or paragraphs based on the spacing between lines
Identify the paragraphs or statements in the document based on full stops

Gain insights from extracted paragraphs using Amazon Comprehend

After you segment the paragraphs using any of these techniques, you can gain further insights from the segmented text by using Amazon Comprehend for the following use cases:

Detecting key phrases in technical documents – For documents such as whitepapers and request for proposal documents, you can segment the document by paragraphs using the library provided in the post and then use Amazon Comprehend to detect key phrases.
Detecting named entities from financial and legal documents – In some use cases, you may want to identify key entities associated with paragraph headings and subheadings. For example, you can segment legal documents and financial documents by headings and paragraphs and detect named entities using Amazon Comprehend.
Sentiment analysis of product or movie reviews – You can perform sentiment analysis using Amazon Comprehend to check when the sentiments of a paragraph changes in product review documents and act accordingly if the reviews are negative.

In this post, we cover the sentiment analysis use case specifically.

We use two different sample movie review PDFs for this use case, which are available on GitHub. The document contains movie names as the headers for individual paragraphs and reviews as the paragraph content. We identify the overall sentiment of each movie as well as the sentiment for each review. However, testing an entire page as a single entity isn’t ideal for getting an overall sentiment. Therefore, we extract the text and identify reviewer names and comments and generate the sentiment of each review.

Solution overview

This solution uses the following AI services, serverless technologies, and managed services to implement a scalable and cost-effective architecture:

Amazon Comprehend – An NLP service that uses ML to find insights and relationships in text.
Amazon DynamoDB – A key-value and document database that delivers single-digit millisecond performance at any scale.
AWS Lambda – Runs code in response to triggers such as changes in data, shifts in system state, or user actions. Because Amazon S3 can directly trigger a Lambda function, you can build a variety of real-time serverless data-processing systems.
Amazon Simple Notification Service (Amazon SNS) – A fully managed messaging service that is used by Amazon Textract to notify upon completion of extraction process.
Amazon Simple Storage Service (Amazon S3) – Serves as an object store for your documents and allows for central management with fine-tuned access controls.
Amazon Textract – Uses ML to extract text and data from scanned documents in PDF, JPEG, or PNG formats.

The following diagram illustrates the architecture of the solution.

Our workflow includes the following steps:

A movie review document gets uploaded into the designated S3 bucket.
The upload triggers a Lambda function using Amazon S3 Event Notifications.
The Lambda function triggers an asynchronous Amazon Textract job to extract text from the input document. Amazon Textract runs the extraction process in the background.
When the process is complete, Amazon Textract sends an SNS notification. The notification message contains the job ID and the status of the job. The code for Steps 3 and 4 is in the file textraction-inovcation.py.
Lambda listens to the SNS notification and calls Amazon Textract to get the complete text extracted from document. Lambda uses the text and bounding box data provided by Amazon Textract. The code for the bounding box data extraction can be found in lambda-helper.py.
The Lambda function uses the bounding box data to identify the headers and paragraphs. We discuss two types of document formats in this post: a document with left indentation differences between headers and paragraphs, and a document with font size differences. The Lambda code that uses left indentation can be found in blog-code-format2.py and the code for font size differences can be found in blog-code-format1.py.
After the headers and paragraphs are identified, Lambda invokes Amazon Comprehend to get the sentiment. After the sentiment is identified, Lambda stores the information in DynamoDB.
DynamoDB stores the information extracted and insights identified for each document. The document name is the key and the insights and paragraphs are the values.

Deploy the architecture with AWS CloudFormation

You deploy an AWS CloudFormation template to provision the necessary AWS Identity and Access Management (IAM) roles, services, and components of the solution, including Amazon S3, Lambda, Amazon Textract, Amazon Comprehend.

Launch the following CloudFormation template and in the US East (N. Virginia) Region:

For BucketName, enter BucketName textract-demo-<date> (adding a date as a suffix makes the bucket name unique).
Choose Next.

In the Capabilities and transforms section, select all three check boxes to acknowledge that AWS CloudFormation may create IAM resources.
Choose Create stack.

This template uses AWS Serverless Application Model (AWS SAM), which simplifies how to define functions and APIs for serverless applications, and also has features for these services, like environment variables.

The following screenshot of the stack details page shows the status of the stack as CREATE_IN_PROGRESS. It can take up to 5 minutes for the status to change to CREATE_COMPLETE. When it’s complete, you can view the outputs on the Outputs tab.

Process a file through the pipeline

When the setup is complete, the next step is to walk through the process of uploading a file and validating the results after the file is processed through the pipeline.

To process a file and get the results, upload your documents to your new S3 bucket, then choose the S3 bucket URL corresponding to the s3BucketForTextractDemo key on the stack Outputs tab.

You can download the sample document used in this post from the GitHub repo and upload it to the s3BucketForTextractDemo S3 URL. For more information about uploading files, see How do I upload files and folders to an S3 bucket?

After the document is uploaded, the textraction-inovcation.py Lambda function is invoked. This function calls the Amazon Textract StartDocumentTextDetection API, which sets up an asynchronous job to detect text from the PDF you uploaded. The code uses the S3 object location, IAM role, and SNS topic created by the CloudFormation stack. The role ARN and SNS topic ARN were set as environment variables to the function by AWS CloudFormation. The code can be found in textract-post-processing-CFN.yml.

Postprocess the Amazon Textract response to segment paragraphs

When the document is submitted to Amazon Textract for text detection, we get pages, lines, words, or tables as a response. Amazon Textract also provides bounding box data, which is derived based on the position of the text in the document. The bounding box data provides information about where the text position from the left and top, the size of the characters, and the width of the text.

We can use the bounding box data to identify lots of segments of the document, for example, identifying paragraphs from a whitepaper, movie reviews, auditing documents, or items on a menu. After these segments are identified, you can use Amazon Comprehend to find sentiment or key phrases to get insights from the document. For example, we can identify the technologies or algorithms used in a whitepaper or understand the sentiment of each reviewer for a movie.

In this section, we demonstrate the following techniques to identify the paragraphs:

Identify paragraphs by font sizes by postprocessing the Amazon Textract response
Identify paragraphs by indentation using Amazon Textract bounding box information
Identify segments of the document or paragraphs based on the spacing between lines
Identify the paragraphs or statements in the document based on full stops

Identify headers and paragraphs based on font size

The first technique we discuss is identifying headers and paragraphs based on the font size. If the headers in your document are bigger than the text, you can use font size for the extraction. For example, see the following sample document, which you can download from GitHub.

First, we need to extract all the lines from the Amazon Textract response and the corresponding bounding box data to understand font size. Because the response has a lot of additional information, we’re only extracting lines and bounding box data. We separate the text with different font sizes and order them based on size to determine headers and paragraphs. This process of extracting headers is done as part of the get_headers_to_child_mapping method in the lambda-helpery.py function.

The step-by-step flow is as follows:

A Lambda function gets triggered by every file drop event using the textract-invocation function.
Amazon Textract completes the process of text detection and sends notification to the SNS topic.
The blog-code-format1.py function gets triggered based on the SNS notification.
Lambda uses the method get_text_results_from_textract from lambda-helper.py and extracts the complete text by calling Amazon Textract repeatedly for all the pages.
After the text is extracted, the method get_text_with_required_info identifies bounding box data and creates a mapping of line number, left indentation, and font size for each line of the total document text extracted.
We use the bounding box data to call the get_headers_to_child_mapping method to get the header information.
After the header information is collected, we use get_headers_and_their_line_numbers to get the line numbers of the headers.
After the headers and their line numbers are identified, the get_header_to_paragraph_data method gets the complete text for each paragraph and creates a mapping with each header and its corresponding paragraph text.
With the header and paragraph information collected, the update_paragraphs_info_in_dynamodb method invokes Amazon Comprehend for each paragraph and stores the information of the header and its corresponding paragraph text and sentiment information into DynamoDB.

Identify paragraphs based on indentation

As a second technique, we explain how to derive headers and paragraphs based on the left indentation of the text. In the following document, headers are aligned at the left of the page, and all the paragraph text is a bit further in the document. You can download this sample PDF on GitHub.

In this document, the main difference between the header and paragraph is left indentation. Similar to the process described earlier, first we need to get line numbers and indentation information. After we this information, all we have to do is separate the text based on the indentation and extract the text between two headers by using line numbers.

The step-by-step flow is as follows:

A Lambda function gets triggered whenever a file drop event occurs using the textract-invocation Lambda function.
Amazon Textract completes the process of text detection and sends a notification to the SNS topic.
The blog-code-format2.py function gets triggered based on the SNS notification.
Lambda uses the method get_text_results_from_textract from lambda-helper.py and extracts the complete text by calling Amazon Textract repeatedly for all the pages.
After the text is extracted, we use the method get_text_with_required_info to identify bounding box data and create a mapping of line number, left indentation, and font size for each line of the total document text extracted.
After the text is extracted, we use the method get_text_with_required_info to identify the text bounding box data.
The bounding box data get_header_info method is called to get the line numbers of all the headers.
After the headers and their line numbers are identified, we use the get_header_to_paragraph_data method to get the complete text for each paragraph and create a mapping with each header and its corresponding paragraph text.
With the header and paragraph information collected, we use the update_paragraphs_info_in_dynamodb method to invoke Amazon Comprehend for each paragraph and store the information of the header and its corresponding paragraph text and sentiment information into DynamoDB.

Identify paragraphs based on line spacing

Similar to the preceding approach, we can use line spaces to get the paragraphs only from a page. We calculate line spacing using the top indent. The difference in top indentation of the current line and the next line or previous line provides us with the line spacing. We can separate segments if the line spacing is bigger. The detailed code can be found on GitHub. You can also download the sample document from GitHub.

Identify segments or paragraphs based on full stops

We also provide a technique to extract segments or paragraphs of the document based on full stops. Consider preceding document as an example. After the Amazon Textract response is parsed and the lines are separated, we can iterate through each line and whenever we find a line that ends with a full stop, we can consider it as end of paragraph and any line thereafter is part of next paragraph. This is another helpful technique to identify various segments of the document. The code to perform this can be found on GitHub

Get the sentiment of paragraphs or segments of the page

As we described in the preceding processes, we can collect the text using various techniques. After the list of paragraphs are identified, we can use Amazon Comprehend to get the sentiment of the paragraph and key phrases of the text. Amazon Comprehend can give intelligent insights based on the text, which is very valuable to businesses because understanding the sentiment at each segment is very useful.

Query sentiments per paragraph in DynamoDB

After you process the file, you can query the results for each paragraph.

On the DynamoDB console, choose Tables in the navigation pane.

You should see two tables:

Textract-job-details – Contains information of the Amazon Textract processing job
Textract-post-process-data – Contains the sentiment of each paragraph header

Choose the Textract-post-process-data table.

You can see a mix of review sentiments.

Scan or query the table to find the negative customer reviews.

The DynamoDB table data looks like the following screenshot, file path, header, paragraph data, and sentiment for paragraph.

Conclusion

This post demonstrated how to extract and process data from a PDF and visualize it to review sentiments. We separated the headers and paragraphs via custom coding and ran sentiment analysis for each section separately.

Processing scanned image documents helps you uncover large amounts of data, which can provide meaningful insights. With managed ML services like Amazon Textract and Amazon Comprehend, you can gain insights into your previously undiscovered data. For example, you can build a custom application to get text from a scanned legal document, purchase receipts, and purchase orders.

If this post helps you or inspires you to solve a problem, we would love to hear about it! The code for this solution is available on the GitHub repo for you to use and extend. Contributions are always welcome!

About the Authors

Srinivasarao Daruna is a Data Lab Architect at Amazon Web Services and comes from strong big data and analytics background. In his role, he helps customers with architecture and solutions to their business problem. He enjoys learning new things and solving complex problems for customers.

Mona Mona is a Senior AI/ML Specialist Solutions Architect based out of Arlington, VA. She works with public sector customers and helps them adopt machine learning on a large scale. She is passionate about NLP and ML explainability areas in AI/ML and has published multiple blog posts on these topics in the AWS AI/ML Blogs.

Divyesh Sah is as a Sr. Enterprise Solutions Architect in AWS focusing on financial services customers, helping them with cloud transformation initiatives in the areas of migrations, application modernization, and cloud native solutions. He has over 18 years of technical experience specializing in AI/ML, databases, big data, containers, and BI and analytics. Prior to AWS, he has experience in areas of sales, program management, and professional services.

Sandeep Kariro is an Enterprise Solutions Architect in the Telecom space. Having worked in cloud technologies for over 7 years, Sandeep provides strategic and tactical guidance to enterprise customers around the world. Sandeep also has in-depth experience in data-centric design solutions optimal for cloud deployments while keeping cost, security, compliance, and operations as top design principles. He loves traveling around the world and has traveled to several countries around the globe in the last decade.

Vedere AI