Build a receipt and invoice processing pipeline with Amazon Textract

In today’s business landscape, organizations are constantly seeking ways to optimize their financial processes, enhance efficiency, and drive cost savings. One area that holds significant potential for improvement is accounts payable. On a high level, the accounts payable process includes receiving and scanning invoices, extraction of the relevant data from scanned invoices, validation, approval, and archival. The second step (extraction) can be complex. Each invoice and receipt look different. The labels are imperfect and inconsistent. The most important pieces of information such as price, vendor name, vendor address, and payment terms are often not explicitly labeled and have to be interpreted based on context. The traditional approach of using human reviewers to extract the data is time-consuming, error-prone, and not scalable.

In this post, we show how to automate the accounts payable process using Amazon Textract for data extraction. We also provide a reference architecture to build an invoice automation pipeline that enables extraction, verification, archival, and intelligent search.

Solution overview

The following architecture diagram shows the stages of a receipt and invoice processing workflow. It starts with a document capture stage to securely collect and store scanned invoices and receipts. The next stage is the extraction phase, where you pass the collected invoices and receipts to the Amazon Textract AnalyzeExpense API to extract financially related relationships between text such as vendor name, invoice receipt date, order date, amount due, amount paid, and so on. In the next stage, you use predefined expense rules to determine if you should automatically approve or reject the receipt. Approved and rejected documents go to their respective folders within the Amazon Simple Storage Service (Amazon S3) bucket. For approved documents, you can search all the extracted fields and values using Amazon OpenSearch Service. You can visualize the indexed metadata using OpenSearch Dashboards. Approved documents are also set up to be moved to Amazon S3 Intelligent-Tiering for long-term retention and archival using S3 lifecycle policies.

The following sections take you through the process of creating the solution.

Prerequisites

To deploy this solution, you must have the following:

An AWS account.
An AWS Cloud9 environment. AWS Cloud9 is a cloud-based integrated development environment (IDE) that lets you write, run, and debug your code with just a browser. It includes a code editor, debugger, and terminal.

To create the AWS Cloud9 environment, provide a name and description. Keep everything else as default. Choose the IDE link on the AWS Cloud9 console to navigate to IDE. You’re now ready to use the AWS Cloud9 environment.

Deploy the solution

To set up the solution, you use the AWS Cloud Development Kit (AWS CDK) to deploy an AWS CloudFormation stack.

In your AWS Cloud9 IDE terminal, clone the GitHub repository and install the dependencies. Run the following commands to deploy the InvoiceProcessor stack:

git clone https://github.com/aws-samples/amazon-textract-invoice-processor.git
pip install -r requirements.txt
cdk bootstrap
cdk deploy

The deployment takes around 25 minutes with the default configuration settings from the GitHub repo. Additional output information is also available on the AWS CloudFormation console.

After the AWS CDK deployment is complete, create expense validation rules in an Amazon DynamoDB table. You can use the same AWS Cloud9 terminal to run the following commands:

aws dynamodb execute-statement --statement "INSERT INTO "$(aws cloudformation list-exports --query 'Exports[?Name==`InvoiceProcessorWorkflow-RulesTableName`].Value' --output text)" VALUE {'ruleId': 1, 'type': 'regex', 'field': 'INVOICE_RECEIPT_ID', 'check': '(?i)[0-9]{3}[a-z]{3}[0-9]{3}$', 'errorTxt': 'Receipt number is not valid. It is of the format: 123ABC456'}"
aws dynamodb execute-statement --statement "INSERT INTO "$(aws cloudformation list-exports --query 'Exports[?Name==`InvoiceProcessorWorkflow-RulesTableName`].Value' --output text)" VALUE {'ruleId': 2, 'type': 'regex', 'field': 'PO_NUMBER', 'check': '(?i)[a-z0-9]+$', 'errorTxt': 'PO number is not present'}"

In the S3 bucket that starts with invoiceprocessorworkflow-invoiceprocessorbucketf1-*, create an uploads folder.

In Amazon Cognito, you should already have an existing user pool called OpenSearchResourcesCognitoUserPool*. We use this user pool to create a new user.

On the Amazon Cognito console, navigate to the user pool OpenSearchResourcesCognitoUserPool*.
Create a new Amazon Cognito user.
Provide a user name and password of your choice and note them for later use.
Upload the documents random_invoice1 and random_invoice2 to the S3 uploads folder to start the workflows.

Now let’s dive into each of the document processing steps.

Document Capture

Customers handle invoices and receipts in a multitude of formats from different vendors. These documents are received through channels like hard copies, scanned copies uploaded to file storage, or shared storage devices. In the document capture stage, you store all scanned copies of receipts and invoices in a highly scalable storage such as in an S3 bucket.

Extraction

The next stage is the extraction phase, where you pass the collected invoices and receipts to the Amazon Textract AnalyzeExpense API to extract financially related relationships between text such as Vendor Name, Invoice Receipt Date, Order Date, Amount Due/Paid, etc.

AnalyzeExpense is an API dedicated to processing invoice and receipts documents. It is available both as a synchronous or asynchronous API. The synchronous API allows you to send images in bytes format, and the asynchronous API allows you to send files in JPG, PNG, TIFF, and PDF formats. The AnalyzeExpense API response consists of three distinct sections:

Summary fields – This section includes both normalized keys and the explicitly mentioned keys along with their values. AnalyzeExpense normalizes the keys for contact-related information such as vendor name and vendor address, tax ID-related keys such as tax payer ID, payment-related keys such as amount due and discount, and general keys such as invoice ID, delivery date, and account number. Keys that are not normalized still appear in the summary fields as key-value pairs. For a complete list of supported expense fields, refer to Analyzing Invoices and Receipts.
Line items – This section includes normalized line item keys such as item description, unit price, quantity, and product code.
OCR block – The block contains the raw text extract from the invoice page. The raw text extract can be used for postprocessing and identifying information that is not covered as part of the summary and line item fields.

This post uses the Amazon Textract IDP CDK constructs (AWS CDK components to define infrastructure for intelligent document processing (IDP) workflows), which allows you to build use case-specific, customizable IDP workflows. The constructs and samples are a collection of components to enable definition of IDP processes on AWS and published to GitHub. The main concepts used are the AWS CDK constructs, the actual AWS CDK stacks, and AWS Step Functions.

The following figure shows the Step Functions workflow.

The extraction workflow includes the following steps:

InvoiceProcessor-Decider – An AWS Lambda function that verifies if the input document format is supported by Amazon Textract. For more details about supported formats, refer to Input Documents.
DocumentSplitter – A Lambda function that generates 2,500-page (max) chunks from documents and can process large multi-page documents.
Map State – A Lambda function that processes each chunk in parallel.
TextractAsync – This task calls Amazon Textract using the asynchronous API following best practices with Amazon Simple Notification Service (Amazon SNS) notifications and uses OutputConfig to store the Amazon Textract JSON output to the S3 bucket you created earlier. It consists of two Lambda functions: one to submit the document for processing and one that is triggered on the SNS notification.
TextractAsyncToJSON2 – Because the TextractAsync task can produce multiple paginated output files, the TextractAsyncToJSON2 process combines them into one JSON file.

We discuss the details of the next three steps in the following sections.

Verification and approval

For the verification stage, the SetMetaData Lambda function verifies whether the uploaded file is a valid expense as per the rules configured previously in DynamoDB table. For this post, you use the following sample rules:

Verification is successful if INVOICE_RECEIPT_ID is present and matches the regex (?i)[0-9]{3}[a-z]{3}[0-9]{3}$ and if PO_NUMBER is present and matches the regex (?i)[a-z0-9]+$
Verification is un-successful if either PO_NUMBER or INVOICE_RECEIPT_ID is incorrect or missing in the document.

After the files are processed, the expense verification function moves the input files to either approved or declined folders in the same S3 bucket.

For the purposes of this solution, we use DynamoDB to store the expense validation rules. However, you can modify this solution to integrate with your own or commercial expense validation or management solutions.

Intelligent index and search

With the OpenSearchPushInvoke Lambda function, the extracted expense metadata is pushed to an OpenSearch Service index and is available for search.

The final TaskOpenSearchMapping step clears the context, which otherwise could exceed the Step Functions quota of maximum input or output size for a task, state, or workflow run.

After the OpenSearch Service index is created, you can search for keywords from the extracted text via OpenSearch Dashboards.

Archival, audit, and analytics

To manage the lifecycle and archival of invoices and receipts, you can configure S3 lifecycle rules to transition S3 objects from Standard to Intelligent-Tiering storage classes. S3 Intelligent-Tiering monitors access patterns and automatically moves objects to the Infrequent Access tier when they haven’t been accessed for 30 consecutive days. After 90 days of no access, the objects are moved to the Archive Instant Access tier without performance impact or operational overhead.

For auditing and analytics, this solution uses OpenSearch Service for running analytics on invoice requests. OpenSearch Service enables you to effortlessly ingest, secure, search, aggregate, view, and analyze data for a number of use cases, such as log analytics, application search, enterprise search, and more.

Log in to OpenSearch Dashboards and navigate to Stack Management, Saved objects, then choose Import. Choose the invoices.ndjson file from the cloned repository and choose Import. This prepopulates indexes and builds the visualization.

Refresh the page and navigate to Home, Dashboard, and open Invoices. You can now select and apply filters and expand the time window to explore past invoices.

Clean up

When you’re finished evaluating Amazon Textract for processing receipts and invoices, we recommend cleaning up any resources that you might have created. Complete the following steps:

Delete all content from the S3 bucket invoiceprocessorworkflow-invoiceprocessorbucketf1-*.
In AWS Cloud9, run the following commands to delete Amazon Cognito resources and CloudFormation stacks:

cognito_user_pool=$(aws cloudformation list-exports --query 'Exports[?Name==`InvoiceProcessorWorkflow-CognitoUserPoolId`].Value' --output text)
echo $cognito_user_pool
cdk destroy
aws cognito-idp delete-user-pool --user-pool-id $cognito_user_pool

Delete the AWS Cloud9 environment that you created from the AWS Cloud9 console.

Conclusion

In this post, we provided an overview of how we can build an invoice automation pipeline using Amazon Textract for data extraction and create a workflow for validation, archival, and search. We provided code samples on how to use the AnalyzeExpense API for extraction of critical fields from an invoice.

To get started, sign in to the Amazon Textract console to try this feature. To learn more about Amazon Textract capabilities, refer to the Amazon Textract Developer Guide or Textract Resources. To learn more about IDP, refer to the IDP with AWS AI services Part 1 and Part 2 posts.

About the Authors

Sushant Pradhan is a Sr. Solutions Architect at Amazon Web Services, helping enterprise customers. His interests and experience include containers, serverless technology, and DevOps. In his spare time, Sushant enjoys spending time outdoors with his family.

Shibin Michaelraj is a Sr. Product Manager with the AWS Textract team. He is focused on building AI/ML-based products for AWS customers.

Suprakash Dutta is a Sr. Solutions Architect at Amazon Web Services. He focuses on digital transformation strategy, application modernization and migration, data analytics, and machine learning. He is part of the AI/ML community at AWS and designs intelligent document processing solutions.

Maran Chandrasekaran is a Senior Solutions Architect at Amazon Web Services, working with our enterprise customers. Outside of work, he loves to travel and ride his motorcycle in Texas Hill Country.

Model Innovators: How Digital Twins Are Making Industries More Efficient

A manufacturing plant near Hsinchu, Taiwan’s Silicon Valley, is among facilities worldwide boosting energy efficiency with AI-enabled digital twins.

A virtual model can help streamline operations, maximizing throughput for its physical counterpart, say engineers at Wistron, a global designer and manufacturer of computers and electronics systems.

In the first of several use cases, the company built a digital copy of a room where NVIDIA DGX systems undergo thermal stress tests (pictured above). Early results were impressive.

Making Smart Simulations

Using NVIDIA Modulus, a framework for building AI models that understand the laws of physics, Wistron created digital twins that let them accurately predict the airflow and temperature in test facilities that must remain between 27 and 32 degrees C.

A simulation that would’ve taken nearly 15 hours with traditional methods on a CPU took just 3.3 seconds on an NVIDIA GPU running inference with an AI model developed using Modulus, a whopping 15,000x speedup.

The results were fed into tools and applications built by Wistron developers with NVIDIA Omniverse, a platform for creating 3D workflows and applications based on OpenUSD.

Image of Wistron’s digital twin of a computer test room — A bird’s-eye view of the model of Wistron’s computer test room.

With their Omniverse-powered software, Wistron created realistic and immersive simulations that operators interact with via VR headsets. And thanks to the AI models they developed using Modulus, the airflows in the simulation obey the laws of physics.

“Physics-informed models let us control the test process and the room’s temperature remotely in near real time, saving time and energy,” said John Lu, a manufacturing operations director at Wistron.

Specifically, Wistron combined separate models for predicting air temperature and airflow to eliminate risks of overheating in the test room. It also created a recommendation system to identify the best locations to test computer baseboards.

The digital twin, linked to thousands of networked sensors, enabled Wistron to increase the facility’s overall energy efficiency up to 10%. That amounts to using up to 121,600 kWh less electricity a year, reducing carbon emissions by a whopping 60,192 kilograms.

An Expanding Effort

Currently, the group is expanding its AI model to track more than a hundred variables in a space that holds 50 computer racks. The team is also simulating all the mechanical details of the servers and testers.

“The final model will help us optimize test scheduling as well as the energy efficiency of the facilities’ air conditioning system,” said Derek Lai, a Wistron technical supervisor with expertise in physics-informed neural networks.

Looking ahead, “The tools and applications we’re building with Omniverse help us improve the layout of our DGX factories to provide the best throughput, further improving efficiency,” said Liu.

Efficiently Generating Energy

Half a world away, Siemens Energy is demonstrating the power of digital industrialization using Modulus and Omniverse.

The Munich-based company, whose technology generates one-sixth of the world’s electricity, achieved a 10,000x speedup simulating a heat-recovery steam generator using a physics-informed AI model (see video below).

Using a digital twin to detect corrosion early on, these massive systems can reduce downtime by 70%, potentially saving the industry $1.7 billion annually compared to a standard simulation that took half a month.

“The reduced computational time enables us to develop energy-efficient digital twins for a sustainable, reliable and affordable energy ecosystem,” said Georg Rollmann, head of advanced analytics and AI at Siemens Energy.

Digital Twins Drive Science and Industry

Automotive companies are applying the technology to the design of new cars and manufacturing plants. Scientists are using it in fields as diverse as astrophysics, genomics and weather forecasting. It’s even being used to create a digital twin of Earth to understand and mitigate the impacts of climate change.

Every year, physics simulations, typically run on supercomputer-class systems, consume an estimated 200 billion CPU core hours and 4 terawatt hours of energy. Physics-informed AI is accelerating these complex workflows 200x on average, saving time, cost and energy.

For more insights, listen to a talk from GTC describing Wistron’s work and a panel about industries using generative AI.

Learn more about the impact accelerated computing is having on sustainability.

Into the Omniverse: Groundbreaking OpenUSD Advancements Put NVIDIA GTC Spotlight on Developers

Editor’s note: This post is part of Into the Omniverse, a series focused on how artists, developers and enterprises can transform their workflows using the latest advances in OpenUSD and NVIDIA Omniverse.

The Universal Scene Description framework, aka OpenUSD, has emerged as a game-changer for building virtual worlds and accelerating creative workflows. It can ease the handling of complex datasets, facilitate collaboration and enable seamless interoperability between 3D applications.

The latest news and demos from NVIDIA GTC, a global AI conference that ran last week, put on display the power developers gain from NVIDIA Omniverse — a platform of application programming interfaces (APIs) and software development kits (SDKs) that enable them to build 3D pipelines, tools, applications and services.

Newly announced NVIDIA Omniverse Cloud APIs, coming first to Microsoft Azure, allow developers to send their OpenUSD industrial scenes from content-creation applications to the NVIDIA Graphics Delivery Network.

Such a workflow was showcased in a demo featuring an interactive car configurator application, developed by computer-generated-imagery studio Katana using Omniverse, streamed in full fidelity to an Apple Vision Pro’s high-resolution display. A designer wearing the Vision Pro toggled through paint and trim options, and even entered the vehicle.

In a separate demo, Dassault Systèmes showcased, using its 3DEXCITE portfolio, a powerful web-based application for 3D data preparation supercharged with NVIDIA AI and Omniverse Cloud APIs to deliver new generative storytelling capabilities.

OpenUSD also played a part in the announcement of NVIDIA’s latest AI supercomputer, a powerful cluster based on the NVIDIA GB200 NVL72 liquid-cooled system, which was showcased as a digital twin in Omniverse.

Engineers unified and visualized multiple computer-aided design datasets with full physical accuracy and photorealism using OpenUSD through the Cadence Reality digital twin platform, powered by Omniverse APIs. The technologies together provided a powerful computing platform for developing OpenUSD-based 3D tools, workflows and applications.

Siemens announced it has integrated OpenUSD into its Xcelerator platform applications via Omniverse Cloud APIs, enabling its customers to unify their 3D data and services in digital twins with physically based rendering.

A demo showcased how ship manufacturer HD Hyundai used Siemens’ Teamcenter X, which is part of Xcelerator, to design digital twins of complex engineering projects, delivering accelerated collaboration, minimized workflow waste, time and cost savings, and reduced manufacturing defects.

OpenUSD Ecosystem Updates on Replay

The latest OpenUSD ecosystem updates shared at GTC include:

Ansys is adopting OpenUSD and Omniverse Cloud APIs to enable data interoperability and NVIDIA RTX visualization in technologies such as Ansys AVxcelerate for autonomous vehicles, Ansys Perceive EM for 6G simulation, and NVIDIA-accelerated solvers such as Ansys Fluent.
Dassault Systèmes is using OpenUSD, Omniverse Cloud APIs and Shutterstock 3D AI Services for generative storytelling in 3DEXCITE applications.
Continental is developing an OpenUSD-based digital twin platform to optimize factory operations and speed time to market.
Hexagon is integrating reality-capture sensors and digital-reality platforms with OpenUSD and Omniverse Cloud APIs for hyperrealistic simulation and visualization.
Media.Monks is adopting Omniverse for a generative AI- and OpenUSD-enabled content-creation pipeline for scalable hyper-personalization.
Microsoft is integrating Omniverse Cloud APIs with Microsoft Power BI, so factory operators can see real-time factory data overlaid on a 3D digital twin to speed up production.
Rockwell Automation is using OpenUSD and Omniverse Cloud APIs for RTX-enabled visualization in industrial automation and digital transformation.
Trimble is enabling interactive NVIDIA Omniverse RTX viewers with Trimble model data using OpenUSD and Omniverse Cloud APIs.
Wistron is building OpenUSD-based digital twins of NVIDIA DGX and HGX factories using custom software developed with Omniverse SDKs and APIs.
WPP is expanding its Omniverse Cloud-based OpenUSD and generative AI content-generation engine to the retail and consumer packaged goods sector.

Get Plugged In to the World of OpenUSD

Several GTC sessions expanded on the latest OpenUSD advancements. Register free to watch them on demand:

The Big Bang of OpenUSD: Hear from technical luminaries at Pixar, Adobe, Apple, Autodesk and NVIDIA on the potential of OpenUSD.
An Introduction to OpenUSD: Learn why OpenUSD is more than just a file format and how it can revolutionize 3D workflows.
Mastering USD and Adobe Substance 3D: Learn how to master real-world material capture, seamlessly integrate workflows with OpenUSD and enhance realism with Adobe Substance 3D.
Enabling 3D Geospatial Workflows for Industrial Digital Twins: Learn how Cesium is using OpenUSD to enable high-fidelity streaming and rendering for global-scale digital twins.
Digitalizing the World’s Largest Industries With OpenUSD and Generative AI: Explore how global industries are becoming software-defined.

Get started with NVIDIA Omniverse by downloading the standard license free, access OpenUSD resources, and learn how Omniverse Enterprise can connect your team. Stay up to date on Instagram, Medium and X. For more, join the Omniverse community on the forums, Discord server, Twitch and YouTube channels.

Featured image courtesy of Siemens, HD Hyundai.

NVIDIA Blackwell and Automotive Industry Innovators Dazzle at NVIDIA GTC

Generative AI, in the data center and in the car, is making vehicle experiences safer and more enjoyable.

The latest advancements in automotive technology were on display last week at NVIDIA GTC, a global AI conference that drew tens of thousands of business leaders, developers and researchers from around the world.

The event kicked off with NVIDIA founder and CEO Jensen Huang’s keynote, which included the announcement of the NVIDIA Blackwell platform — purpose-built to power a new era of AI computing.

The NVIDIA Blackwell GPU architecture will be integrated into the NVIDIA DRIVE Thor centralized car computer to enable generative AI applications and immersive in-vehicle experiences. Large language models will be able to run in the car, enabling an intelligent copilot that understands and speaks in natural language.

BYD, the world’s largest electric-vehicle maker, announced it will adopt DRIVE Thor as the AI brain of its future fleets. In addition, the company will use NVIDIA’s AI infrastructure for cloud-based AI development and training, and the NVIDIA Isaac and NVIDIA Omniverse platforms to develop tools and applications for virtual factory planning and retail configurators. Hyper, Nuro, Plus, Waabi, WeRide and XPENG are also adopting DRIVE Thor.

Learn more about the automotive ecosystem’s announcements at GTC:

Some of the latest NVIDIA-powered vehicles displayed on the exhibition floor included:

Aurora self-driving truck, already on the highways of Texas
Lucid Air long-range electric sedan
Mercedes-Benz Concept CLA Class, showcasing what’s to come
Nuro R3, a fully autonomous robotic delivery model
Polestar 3, the SUV for the electric age
Volvo Cars EX90, its new fully electric, flagship SUV
And WeRide’s Robobus, a new form of urban mobility.

The NVIDIA auto booth highlighted the wide adoption of the NVIDIA DRIVE platform, with displays featuring electronic control units from a variety of partners, including Bosch, Lenovo and ZEEKR.

A wide range of NVIDIA automotive partners, including Ansys, Foretellix, Lenovo, MediaTek, NODAR, OMNIVISION, Plus, Seyond, SoundHound, Voxel51 and Waabi, all made next-generation product announcements at GTC.

In addition, the automotive pavilion buzzed with interest in the latest lidar advancements from Luminar and Robosense, as well as Helm.ai’s software offerings for the level 2 to level 4 autonomous driving stack.

And other partners, such as Ford, Geely, General Motors, Jaguar Land Rover and Zoox, participated in dozens of sessions and panels covering topics such as building data center applications and developing safe autonomous vehicles. Watch the sessions on demand.

Learn more about the latest advancements in generative AI and automotive technology by watching Huang’s GTC keynote in replay.

Updating large language models by directly editing network layers

Automated method that uses gradients to identify salient layers prevents regression on previously seen data.Read More

Best practices for building secure applications with Amazon Transcribe

Amazon Transcribe is an AWS service that allows customers to convert speech to text in either batch or streaming mode. It uses machine learning–powered automatic speech recognition (ASR), automatic language identification, and post-processing technologies. Amazon Transcribe can be used for transcription of customer care calls, multiparty conference calls, and voicemail messages, as well as subtitle generation for recorded and live videos, to name just a few examples. In this blog post, you will learn how to power your applications with Amazon Transcribe capabilities in a way that meets your security requirements.

Some customers entrust Amazon Transcribe with data that is confidential and proprietary to their business. In other cases, audio content processed by Amazon Transcribe may contain sensitive data that needs to be protected to comply with local laws and regulations. Examples of such information are personally identifiable information (PII), personal health information (PHI), and payment card industry (PCI) data. In the following sections of the blog, we cover different mechanisms Amazon Transcribe has to protect customer data both in transit and at rest. We share the following seven security best practices to build applications with Amazon Transcribe that meet your security and compliance requirements:

Use data protection with Amazon Transcribe
Communicate over a private network path
Redact sensitive data if needed
Use IAM roles for applications and AWS services that require Amazon Transcribe access
Use tag-based access control
Use AWS monitoring tools
Enable AWS Config

The following best practices are general guidelines and don’t represent a complete security solution. Because these best practices might not be appropriate or sufficient for your environment, use them as helpful considerations rather than prescriptions.

Best practice 1 – Use data protection with Amazon Transcribe

Amazon Transcribe conforms to the AWS shared responsibility model, which differentiates AWS responsibility for security of the cloud from customer responsibility for security in the cloud.

AWS is responsible for protecting the global infrastructure that runs all of the AWS Cloud. As the customer, you are responsible for maintaining control over your content that is hosted on this infrastructure. This content includes the security configuration and management tasks for the AWS services that you use. For more information about data privacy, see the Data Privacy FAQ.

Protecting data in transit

Data encryption is used to make sure that data communication between your application and Amazon Transcribe remains confidential. The use of strong cryptographic algorithms protects data while it is being transmitted.

Amazon Transcribe can operate in one of the two modes:

Streaming transcriptions allow media stream transcription in real time
Batch transcription jobs allow transcription of audio files using asynchronous jobs.

In streaming transcription mode, client applications open a bidirectional streaming connection over HTTP/2 or WebSockets. An application sends an audio stream to Amazon Transcribe, and the service responds with a stream of text in real time. Both HTTP/2 and WebSockets streaming connections are established over Transport Layer Security (TLS), which is a widely accepted cryptographic protocol. TLS provides authentication and encryption of data in transit using AWS certificates. We recommend using TLS 1.2 or later.

In batch transcription mode, an audio file first needs to be put in an Amazon Simple Storage Service (Amazon S3) bucket. Then a batch transcription job referencing the S3 URI of this file is created in Amazon Transcribe. Both Amazon Transcribe in batch mode and Amazon S3 use HTTP/1.1 over TLS to protect data in transit.

All requests to Amazon Transcribe over HTTP and WebSockets must be authenticated using AWS Signature Version 4. It is recommended to use Signature Version 4 to authenticate HTTP requests to Amazon S3 as well, although authentication with older Signature Version 2 is also possible in some AWS Regions. Applications must have valid credentials to sign API requests to AWS services.

Protecting data at rest

Amazon Transcribe in batch mode uses S3 buckets to store both the input audio file and the output transcription file. Customers use an S3 bucket to store the input audio file, and it is highly recommended to enable encryption on this bucket. Amazon Transcribe supports the following S3 encryption methods:

Both methods encrypt customer data as it is written to disks and decrypt it when you access it using one of the strongest block cyphers available: 256-bit Advanced Encryption Standard (AES-256) GCM.When using SSE-S3, encryption keys are managed and regularly rotated by the Amazon S3 service. For additional security and compliance, SSE-KMS provides customers with control over encryption keys via AWS Key Management Service (AWS KMS). AWS KMS gives additional access controls because you have to have permissions to use the appropriate KMS keys in order to encrypt and decrypt objects in S3 buckets configured with SSE-KMS. Also, SSE-KMS provides customers with an audit trail capability that keeps records of who used your KMS keys and when.

The output transcription can be stored in the same or a different customer-owned S3 bucket. In this case, the same SSE-S3 and SSE-KMS encryption options apply. Another option for Amazon Transcribe output in batch mode is using a service-managed S3 bucket. Then output data is put in a secure S3 bucket managed by Amazon Transcribe service, and you are provided with a temporary URI that can be used to download your transcript.

Amazon Transcribe uses encrypted Amazon Elastic Block Store (Amazon EBS) volumes to temporarily store customer data during media processing. The customer data is cleaned up for both complete and failure cases.

Best practice 2 – Communicate over a private network path

Many customers rely on encryption in transit to securely communicate with Amazon Transcribe over the Internet. However, for some applications, data encryption in transit may not be sufficient to meet security requirements. In some cases, data is required to not traverse public networks such as the internet. Also, there may be a requirement for the application to be deployed in a private environment not connected to the internet. To meet these requirements, use interface VPC endpoints powered by AWS PrivateLink.

The following architectural diagram demonstrates a use case where an application is deployed on Amazon EC2. The EC2 instance that is running the application does not have access to the internet and is communicating with Amazon Transcribe and Amazon S3 via interface VPC endpoints.

In some scenarios, the application that is communicating with Amazon Transcribe may be deployed in an on-premises data center. There may be additional security or compliance requirements that mandate that data exchanged with Amazon Transcribe must not transit public networks such as the internet. In this case, private connectivity via AWS Direct Connect can be used. The following diagram shows an architecture that allows an on-premises application to communicate with Amazon Transcribe without any connectivity to the internet.

Best practice 3 – Redact sensitive data if needed

Some use cases and regulatory environments may require the removal of sensitive data from transcripts and audio files. Amazon Transcribe supports identifying and redacting personally identifiable information (PII) such as names, addresses, Social Security numbers, and so on. This capability can be used to enable customers to achieve payment card industry (PCI) compliance by redacting PII such as credit or debit card number, expiration date, and three-digit card verification code (CVV). Transcripts with redacted information will have PII replaced with placeholders in square brackets indicating what type of PII was redacted. Streaming transcriptions support the additional capability to only identify PII and label it without redaction. The types of PII redacted by Amazon Transcribe vary between batch and streaming transcriptions. Refer to Redacting PII in your batch job and Redacting or identifying PII in a real-time stream for more details.

The specialized Amazon Transcribe Call Analytics APIs have a built-in capability to redact PII in both text transcripts and audio files. This API uses specialized speech-to-text and natural language processing (NLP) models trained specifically to understand customer service and sales calls. For other use cases, you can use this solution to redact PII from audio files with Amazon Transcribe.

Additional Amazon Transcribe security best practices

Best practice 4 – Use IAM roles for applications and AWS services that require Amazon Transcribe access. When you use a role, you don’t have to distribute long-term credentials, such as passwords or access keys, to an EC2 instance or AWS service. IAM roles can supply temporary permissions that applications can use when they make requests to AWS resources.

Best Practice 5 – Use tag-based access control. You can use tags to control access within your AWS accounts. In Amazon Transcribe, tags can be added to transcription jobs, custom vocabularies, custom vocabulary filters, and custom language models.

Best Practice 6 – Use AWS monitoring tools. Monitoring is an important part of maintaining the reliability, security, availability, and performance of Amazon Transcribe and your AWS solutions. You can monitor Amazon Transcribe using AWS CloudTrail and Amazon CloudWatch.

Best Practice 7 – Enable AWS Config. AWS Config enables you to assess, audit, and evaluate the configurations of your AWS resources. Using AWS Config, you can review changes in configurations and relationships between AWS resources, investigate detailed resource configuration histories, and determine your overall compliance against the configurations specified in your internal guidelines. This can help you simplify compliance auditing, security analysis, change management, and operational troubleshooting.

Compliance validation for Amazon Transcribe

Applications that you build on AWS may be subject to compliance programs, such as SOC, PCI, FedRAMP, and HIPAA. AWS uses third-party auditors to evaluate its services for compliance with various programs. AWS Artifact allows you to download third-party audit reports.

To find out if an AWS service is within the scope of specific compliance programs, refer to AWS Services in Scope by Compliance Program. For additional information and resources that AWS provides to help customers with compliance, refer to Compliance validation for Amazon Transcribe and AWS compliance resources.

Conclusion

In this post, you have learned about various security mechanisms, best practices, and architectural patterns available for you to build secure applications with Amazon Transcribe. You can protect your sensitive data both in transit and at rest with strong encryption. PII redaction can be used to enable removal of personal information from your transcripts if you do not want to process and store it. VPC endpoints and Direct Connect allow you to establish private connectivity between your application and the Amazon Transcribe service. We also provided references that will help you validate compliance of your application using Amazon Transcribe with programs such as SOC, PCI, FedRAMP, and HIPAA.

As next steps, check out Getting started with Amazon Transcribe to quickly start using the service. Refer to Amazon Transcribe documentation to dive deeper into the service details. And follow Amazon Transcribe on the AWS Machine Learning Blog to keep up to date with new capabilities and use cases for Amazon Transcribe.

About the Author

Alex Bulatkin is a Solutions Architect at AWS. He enjoys helping communication service providers build innovative solutions in AWS that are redefining the telecom industry. He is passionate about working with customers on bringing the power of AWS AI services into their applications. Alex is based in the Denver metropolitan area and likes to hike, ski, and snowboard.

Beyond the Mud: Datasets, Benchmarks, and Methods for Computer Vision in Off-Road Racing

TL;DR: Off-the-shelf text spotting and re-identification models fail in basic off-road racing settings, even more so during muddy events. Making matters worse, there aren’t any public datasets to tune or improve models in this domain. To this end, we introduce datasets, benchmarks, and methods for the challenging off-road racing setting.

In the dynamic world of sports analytics, machine learning (ML) systems play a pivotal role, transforming vast arrays of visual data into actionable insights. These systems are adept at navigating through thousands of photos to tag athletes, enabling fans and participants alike to swiftly locate images of specific racers or moments from events. This technology has seamlessly integrated into various sports, significantly enhancing the spectator experience and operational efficiency. Yet, not all sports environments cater equally to the capabilities of current ML models. Off-road motorcycle racing, characterized by its unpredictable and untamed wilderness settings, poses unique challenges that push the boundaries of what existing computer vision systems can handle.

Imagine the conditions under which off-road races are conducted: racers blitz through waist-deep mud holes, endure torrential rains, navigate through blinding dust clouds, and much more. Such extreme environmental factors introduce variables like mud occlusion, complex poses (racers frequently crash), glare, motion blur, and variable lighting conditions, which significantly degrade the performance of conventional text spotting and person re-identification (ReID) models. Typical models, trained on more ‘sterile’ conditions, falter when faced with the task of identifying racers and their numbers in the chaotic and mud-splattered scenes typical of off-road racing events. Take, for example, these images of the same racer, taken only minutes apart:

**Figure 1**: Four images of a single racer taken during the same event. Accurately matching a rider throughout the event is extremely difficult due to the very high variation in appearance caused by mud, odd poses, and much more.

The lack of public datasets tailored to these rugged conditions exacerbates the problem, leaving researchers and practitioners without the resources needed to tune and enhance models for better performance in off-road racing, or equally unconstrained, scenarios. Recognizing this gap, our work aims to bridge it by introducing new datasets and benchmarks specifically designed for the challenging setting of off-road motorcycle racing. This blog post will delve into the unique challenges presented by off-road racing environments, describe our efforts in creating datasets that capture these conditions, and discuss methods and benchmarks for improving computer vision models to robustly handle the extreme variability inherent in off-road racing. I’ll even give a brief overview of some new weakly supervised methods for improving models in these challenging areas, with very little labeled data. Join in as we explore the uncharted territories of machine learning applications in off-road motorcycle racing, pushing the limits of what’s possible in sports analytics and beyond.

The Challenges

**Figure 2**: More examples of the challenging conditions presented by off-road racing, causing the performance of existing models and methods to fall below an acceptable threshold.

Off-road motorcycle racing is an adrenaline-pumping sport that takes athletes and their machines through some of the most challenging terrains nature have to offer. Unlike the relatively predictable environments of track racing or urban marathons, off-road racing is fraught with unpredictability and extreme conditions. The very essence of what makes it thrilling for participants and spectators alike—mud, dust, water, uneven terrain—presents a formidable challenge for computer vision systems. Here, we delve into the specific hurdles that these conditions pose for text spotting and re-identification models in off-road racing scenarios.

Dirt is pervasive in off-road racing, manifesting itself as mud or dust. As races progress, vehicles and riders become increasingly coated in dirt, which can obscure critical identifying features such as racer numbers or distinguishing gear colors. The dynamic nature of off-road racing means that athletes are rarely in simple, upright poses. Instead, they navigate the course through jumps, sharp turns, and even crashes. The outdoor settings of off-road races often move rapidly from deep dark forests to bright glaring fields, thus introducing variable lighting conditions. Similarly, the high speeds at which racers move combined with the stylistic choices of some photographers can lead to motion blur. In each of these cases, traditional (OCR) and re-identification (ReID) models, trained primarily on clean, unobstructed images, struggle to recognize text or identify individuals.

The Datasets and Initial Benchmarks

To tackle the formidable challenges presented by off-road motorcycle racing, we embarked on a mission to create and introduce datasets that accurately capture the essence and extremities of this sport. Recognizing the gap in existing computer vision resources, our datasets—off-road Racer Number Dataset (RND) and MUddy Racer re-iDentification Dataset (MUDD)—are meticulously curated to serve as a robust foundation for developing and benchmarking models capable of operating in the harsh, unpredictable conditions of off-road racing. These datasets, as well as benchmarking code, are publically available for both of these datasets. You can find RND here and MUDD here.

Figure 3 details the text spotting results on the RND dataset. Results are broken down by the various types of occlusion in the dataset. Even on the cleanest data (i.e. the data with no occlusion), the best fine-tuned models reach a maximum E2E F1 score of 0.6, leaving a lot to be desired. Introducing any of the aforementioned challenges (i.e.) reduces this even further, down to the worse end-to-end F1 score of 0.29. The models tested were the Yet Another Mask Text Spotter (YAMTS) and Swin Text Spotter, and YAMTS was consistently the best performing. Fine-tuning reduces the negative effect of the various occlusion types (i.e. the blue bar changes less as a percentage of performance than the orange across the various occlusions), yet occlusion still causes significant performance degradation.

**Figure** 3: Text Detection and Recognition results on the RND dataset (higher is better). Results are broken down by the types of occlusion present in the data. The left plot details the detection performance, whereas the right plot details the end-to-end text recognition F1 score, where a prediction is correct only if the detection and predicted text match exactly. While vanilla fine-tuning helps a lot and reduces the negative effects of occlusion, mud still remains an unsolved challenge. More advancements are needed for high performing off-road racing OCR techniques.

Figure 4 breaks down the performance of our best ReID models. In the standard ReID evaluation setting, a sample from a query set is used to return a ranking over a gallery set. We report the rank1 accuracy along with the mean average precision (mAP). Figure 2 looks at two variations of the query and gallery sets, one query set of all the muddy images, and one without, and the same for the gallery set. In the simplest setting (No Mud -> No Mud), model performance is getting reasonably good, around 0.9 mAP. However, mud drops this performance by as much as 30%. The models tested were the Omni-Scale Network (OSNet) and Resnet 50. Figure 4 reports results from OSNet as it was most performant.

**Figure** 4: Rank@1 accuracy and mean average precision (mAP) on the MUDD dataset. The query and gallery sets are broken into two groups based on the presence of mud. While the ReID model performs well among clean data, mud causes a performance drop of as much as 30%.

In summary, the off-road racing setting is difficult, even in the best case. Once dirt and mud enter the equation, models require advancement before they reach the threshold of usability in a real-world application.

Improving These Systems

A “Mud-Like” Data Augmentation

Figure 5: Examples of a new data augmentation strategy, referred to as speckling. The idea behind this is to imitate the “splotchy” nature of mud splatters. This data augmentation improves re-id and text spotting performance by 4% and 7% respectively.

The first step in building robustness to mud is to introduce a data augmentation strategy: speckling. As shown in previous examples, mud often accumulates in small chunks. To emulate this, we introduce speckling, where we randomly change many small patches of the input imagery into the pixel mean. This is similar to random erasing but at a much smaller scale with a large number of patches being erased in each image. This technique leads to a 4% improvement in Rank-1 accuracy for person re-identification on the MUDD dataset, and while it does not meaningfully affect the detection F1 score of text spotting on RND, it does improve the end-to-end F1 score by 7%. While we also use the standard color jitter data augmentation to help robustness to the color changes induced as a racer gets dirty, more research is needed to determine if a more specific color augmentation can prove useful.

Learning from Weak Labels

Another intricacy of sports imagery that we can take advantage of is the natural groupings that often exist. For example, prior marathon imagery has been manually grouped by humans, such that each group (which we will refer to as a bag) consists of images that all contain a specific individual. However, which specific individual is the one of interest in each image is unknown. In motorcycle racing, we have the same data, as well as customer purchase history. Most customers purchase photos of a single racer, therefore the list of purchased photos again becomes a bag of a specific individual, although which individuals in the image is unknown. This type of label is visualized in Figure 4.

Figure 6: Weakly labeled person re-identification. Each bag represents all crops of individuals in a group of photos, where each photo group is known to contain photos containing one specific individual. However, each photo also contains multiple people, and there is no way to tell which person in each image is the one of interest.

We introduce Contrastive Multiple Instance Learning (CMIL) to address this challenge. This method works by generating bag representations from all of the instance representations that comprise that bag. Then, the bag representations are used to optimize a model via triplet loss or classification loss. In other words, we optimize the model to accurately classify bags, not individuals. This does not align with our test time goal, however, of classifying individuals. But surprisingly, our bag classification models naturally generate useful individual representations. Figure 5 gives an overview of the CMIL model. On the MUDD dataset, CMIL improves over the next-best weakly labeled person re-identification methodology by 4% rank-1 accuracy, and over a model that trusts the bag-level labels to be accurate person-level labels by over 20%.

Figure 7: The CMIL method which enables learning effective person (racer) re-identification models, even when only given data weakly labeled in bags as described in Figure 6. The key idea in making this possible is to compare *bag embeddings* instead of instance embeddings.

Conclusion

Off-road racing poses major challenges to existing text spotting and person re-identification methods and models, rendering them unfit for practical application. Our first steps at improving computer performance in these areas include introducing two datasets for the corresponding problems, introducing a new data augmentation technique, and bringing contrastive learning to the multiple instance learning framework. We hope that these initial works spur more innovation in off-road applications.

For more information, you can find the papers and code this blog post is based on here:
– Beyond the Mud: Datasets and Benchmarks for Computer Vision in Off-Road Racing (code)
– Contrastive Multiple Instance Learning for Weakly Supervised Person ReID (code)

Sora: First Impressions

We have gained valuable feedback from the creative community, helping us to improve our model.OpenAI Blog

Boost your content editing with Contentful and Amazon Bedrock

This post is co-written with Matt Middleton from Contentful.

Today, jointly with Contentful, we are announcing the launch of the AI Content Generator powered by Amazon Bedrock.

The AI Content Generator powered by Amazon Bedrock is an app available on the Contentful Marketplace that allows users to create, rewrite, summarize, and translate content using cutting-edge generative artificial intelligence (AI) models available and accessible through Amazon Bedrock in a simple and secure manner. This app helps content producers reduce their lead time to publishing content while enhancing the quality and consistency of the content produced.

Contentful is an intelligent composable content platform that allows the creation and management of content to build digital experiences. Contentful is an AWS customer and partner.

Amazon Bedrock is a fully managed service that offers quick access to a choice of industry-leading large language models and other foundation models from AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon, along with a broad set of capabilities to build generative AI applications, simplifying development while supporting privacy and security.

With this newly launched app, Contentful customers can take advantage of the range of models provided by Amazon Bedrock directly from within Contentful. They can pick the model that best suits their desired language style, creativity, speed, and budget.

Use case

Contentful customers use the platform to scale content across global experiences, brands, and channels. A frequent task a content editor may have is to rewrite existing content to make it fit for another channel, for example to make it fit a smaller screen by shortening it. This is a task that the AI Content Generator powered by Amazon Bedrock can now do automatically:

First, open an existing content entry. The app is available in the sidebar.
When you choose Rewrite, a dialog asks you to choose the fields for input and output, in this case Body and Body Short, respectively.
You can describe the modifications that should be done to the existing content; in this case, we choose the pre-provided option Shorter.
Choose Generate to invoke Amazon Bedrock to perform the desired modification.
Finally, you can modify the generated text and then choose Apply to apply the text to the specified output field.

Getting started

To get started, you need to sign up for Contentful, which is free. Next, install the AI Content Generator powered by Amazon Bedrock app by visiting the Contentful Marketplace and choosing Install now.

The installation dialog asks you for an AWS Region where you want to use Amazon Bedrock, as well as an AWS access key ID and secret access key for authentication. To help you meet your organization’s security rules, we have published an AWS CloudFormation template that creates an AWS Identity and Access Management (IAM) user with the minimum privileges you need for this app.

Using generative AI at scale can become expensive. To help prevent surprises in your AWS bill, we have published another CloudFormation template that deploys a budgeting application into your AWS account. You’re able to define a soft limit, which invokes an email notification, and a hard limit, which disables access to Amazon Bedrock entirely for the IAM user you created.

During app installation, you’re able to provide company-specific information such as branding guidelines, which will always be applied when interacting with the app. At the end of the installation dialog, make sure you assign the app to all content types where you may need it. This will enable the app in the sidebar of your content editing experience.

Conclusion

With the AI Content Generator powered by Amazon Bedrock, content teams can unlock powerful tools to save time, reduce feedback loops, and increase creativity. Contentful customers can use generative AI to generate content on demand, transform content to shift voice and tone, translate and localize content for global markets, and even make sure that content remains on brand. Generative AI also plays a critical role in eliminating the “blank page” problem, where a digital team spends more time figuring out where to start than actually creating great content.

Bringing Amazon Bedrock to Contentful means that digital teams can now use a range of leading large language models to unlock their creativity, create more efficiently, and reach their customers in the most impactful way.

About the Authors

Ulrich Hinze is a Solutions Architect at AWS. He partners with software companies to architect and implement cloud-based solutions on AWS. Before joining AWS, he worked for AWS customers and partners in software engineering, consulting, and architecture roles for 8+ years.

Aris Tsakpinis is a Specialist Solutions Architect for AI & Machine Learning with a special focus on natural language processing (NLP), large language models (LLMs), and generative AI. In his free time he is pursuing a PhD in ML Engineering at University of Regensburg, focussing on applied NLP in the science domain.

Matt Middleton is the Senior Product Partner Ecosystem Manager at Contentful. He runs the strategy and operations of Contentful’s Technology Partner Program. Matt’s background includes eCommerce, enterprise search, personalization, and marketing automation.

A Multi-signal Large Language Model for Device-directed Speech Detection

We present an architecture for device-directed speech detection that treats the task as a text-generation problem. We use a multi-modal fusion approach that combines acoustic information from the recorded audio waveform with text and confidence information obtained from an automatic speech recognition system. The audio waveform is represented as a sequence of continuous embeddings by an audio encoder and presented as a prefix token to a pretrained large language model (LLM). We demonstrate that using multi-modal information within LLMs yields equal error rate improvements over text-only and…Apple Machine Learning Research

Solution overview

Prerequisites

Deploy the solution

Document Capture

Extraction

Verification and approval

Intelligent index and search

Archival, audit, and analytics

Clean up

Conclusion

About the Authors

Making Smart Simulations

An Expanding Effort

Efficiently Generating Energy

Digital Twins Drive Science and Industry

OpenUSD Ecosystem Updates on Replay

Get Plugged In to the World of OpenUSD

Best practice 1 – Use data protection with Amazon Transcribe

Protecting data in transit

Protecting data at rest

Best practice 2 – Communicate over a private network path

Best practice 3 – Redact sensitive data if needed

Additional Amazon Transcribe security best practices

Compliance validation for Amazon Transcribe

Conclusion

About the Author

The Challenges

The Datasets and Initial Benchmarks

Improving These Systems

A “Mud-Like” Data Augmentation

Learning from Weak Labels

Conclusion

Use case

Getting started

Conclusion

About the Authors

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.