Improve data extraction and document processing with Amazon Textract

Intelligent document processing (IDP) has seen widespread adoption across enterprise and government organizations. Gartner estimates the IDP market will grow more than 100% year over year, and is projected to reach $4.8 billion in 2022.

IDP helps transform structured, semi-structured, and unstructured data from a variety of document formats into actionable information. Processing unstructured data has become much easier with the advancements in optical character recognition (OCR), machine learning (ML), and natural language processing (NLP).

IDP techniques have grown tremendously, allowing us to extract, classify, identify, and process unstructured data. With AI/ML powered services such as Amazon Textract, Amazon Transcribe, and Amazon Comprehend, building an IDP solution has become much easier and doesn’t require specialized AI/ML skills.

In this post, we demonstrate how to use Amazon Textract to extract meaningful, actionable data from a wide range of complex multi-format PDF files. PDF files are challenging; they can have a variety of data elements like headers, footers, tables with data in multiple columns, images, graphs, and sentences and paragraphs in different formats. We explore the data extraction phase of IDP, and how it connects to the steps involved in a document process, such as ingestion, extraction, and postprocessing.

Solution overview

Amazon Textract provides various options for data extraction, based on your use case. You can use forms, tables, query-based extractions, handwriting recognition, invoices and receipts, identity documents, and more. All the extracted data is returned with bounding box coordinates. This solution uses Amazon Textract IDP CDK constructs to build the document processing workflow that handles Amazon Textract asynchronous invocation, raw response extraction, and persistence in Amazon Simple Storage Service (Amazon S3). This solution adds an Amazon Textract postprocessing component to the base workflow to handle paragraph-based text extraction.

The following diagram shows the document processing flow.

The document processing flow contains the following steps:

  1. The document extraction flow is initiated when a user uploads a PDF document to Amazon S3.
  2. An S3 object notification event triggered by new the S3 object with an uploads/ prefix, which triggers the AWS Step Functions asynchronous workflow.
  3. The AWS Lambda function SimpleAsyncWorkflow Decider validates the PDF document. This step prevents processing invalid documents.
  4. TextractAsync is an IDP CDK construct that abstracts the invocation of the Amazon Textract Async API, handling Amazon Simple Notification Service (Amazon SNS) messages and workflow processing. The following are some high-level steps:
    1. The construct invokes the asynchronous Amazon Textract StartDocumentTextDetection API.
    2. Amazon Textract processes the PDF file and publishes a completion status event to an Amazon SNS topic.
    3. Amazon Textract stores the paginated results in Amazon S3.
    4. Construct handles the Amazon Textract completion event, returns the paginated results output prefix to the main workflow.
  5. The Textract Postprocessor Lambda function uses the extracted content in the results Amazon S3 bucket to retrieve the document data. This function iterates through all the files, and extracts data using bounding boxes and other metadata. It performs various postprocessing optimizations to aggregate paragraph data, identify and ignore headers and footers, combine sentences spread across pages, process data in multiple columns, and more.
  6. The Textract Postprocessor Lambda function persists the aggregated paragraph data as a CSV file in Amazon S3.

Deploy the solution with the AWS CDK

To deploy the solution, launch the AWS Cloud Development Kit (AWS CDK) using AWS Cloud9 or from your local system. If you’re launching from your local system, you need to have the AWS CDK and Docker installed. Follow the instructions in the GitHub repo for deployment.

The stack creates the key components depicted in the architecture diagram.

Test the solution

The GitHub repo contains the following sample files:

  • sample_climate_change.pdf – Contains headers, footers, and sentences flowing across pages
  • sample_multicolumn.pdf – Contains data in two columns, headers, footers, and sentences flowing across pages

To test the solution, complete the following steps:

  1. Upload the sample PDF files to the S3 bucket created by the stack: The file upload triggers the Step Functions workflow via S3 event notification.
    aws s3 cp sample_climate_change.pdf s3://{bucketname}/uploads/sample_climate_change.pdf
    
    aws s3 cp sample_ multicolumn.pdf s3://{bucketname}/uploads/ sample_climate_ multicolumn.pdf

  2.  Open the Step Functions console to view the workflow status. You should find one workflow instance per document.
  3. Wait for all three steps to complete.
  4. On the Amazon S3 console, browse to the S3 prefix mentioned in the JSON path TextractTempOutputJsonPath. The below screenshot of the Amazon S3 console shows the Amazon Textract paginated results (in this case objects 1 and 2) created by Amazon Textract. The postprocessing task stores the extracted paragraphs from the sample PDF as extracted-text.csv.
  5. Download the extracted-text.csv file to view the extracted content.

The sample_climate_change.pdf file has sentences flowing across pages, as shown in the following screenshot.

The postprocessor identifies and ignores the header and footer, and combines the text across pages into one paragraph. The extracted text for the combined paragraph should look like:

“Impacts on this scale could spill over national borders, exacerbating the damage further. Rising sea levels and other climate-driven changes could drive millions of people to migrate: more than a fifth of Bangladesh could be under water with a 1m rise in sea levels, which is a possibility by the end of the century. Climate-related shocks have sparked violent conflict in the past, and conflict is a serious risk in areas such as West Africa, the Nile Basin and Central Asia.”

The sample_multi_column.pdf file has two columns of text with headers and footers, as shown in the following screenshot.

The postprocessor identifies and ignores the header and footer, processes the text in the columns from left to right, and combines incomplete sentences across pages. The extracted text should construct paragraphs from text in the left column and separate paragraphs from text in the right column. The last line in the right column is incomplete on that page and continues in the left column of the next page; the postprocessor should combine them as one paragraph.

Cost

With Amazon Textract, you pay as you go based on the number of pages in the document. Refer to Amazon Textract pricing for actual costs.

Clean up

When you’re finished experimenting with this solution, clean up your resources by using the AWS CloudFormation console to delete all the resources deployed in this example. This helps you avoid continuing costs in your account.

Conclusion

You can use the solution presented in this post to build an efficient document extraction workflow and process the extracted document according to your needs. If you’re building an intelligent document processing system, you can further process the extracted document using Amazon Comprehend to get more insights about the document.

For more information about Amazon Textract, visit Amazon Textract resources to find video resources and blog posts, and refer to Amazon Textract FAQs. For more information about the IDP reference architecture, refer to Intelligent Document Processing. Please share your thoughts with us in the comments section, or in the issues section of the project’s GitHub repository.


About the Author

Sathya Balakrishnan is a Sr. Customer Delivery Architect in the Professional Services team at AWS, specializing in data and ML solutions. He works with US federal financial clients. He is passionate about building pragmatic solutions to solve customers’ business problems. In his spare time, he enjoys watching movies and hiking with his family.

Read More