Build well-architected IDP solutions with a custom lens – Part 4: Performance efficiency

Build well-architected IDP solutions with a custom lens – Part 4: Performance efficiency

When a customer has a production-ready intelligent document processing (IDP) workload, we often receive requests for a Well-Architected review. To build an enterprise solution, developer resources, cost, time and user-experience have to be balanced to achieve the desired business outcome. The AWS Well-Architected Framework provides a systematic way for organizations to learn operational and architectural best practices for designing and operating reliable, secure, efficient, cost-effective, and sustainable workloads in the cloud.

The IDP Well-Architected Custom Lens follows the AWS Well-Architected Framework, reviewing the solution with six pillars with the granularity of a specific AI or machine learning (ML) use case, and providing the guidance to tackle common challenges. The IDP Well-Architected Custom Lens in the Well-Architected Tool contains questions regarding each of the pillars. By answering these questions, you can identify potential risks and resolve them by following your improvement plan.

This post focuses on the Performance Efficiency pillar of the IDP workload. We dive deep into designing and implementing the solution to optimize for throughput, latency, and overall performance. We start with discussing some common indicators that you should conduct a Well-Architected review, and introduce the fundamental approaches with design principles. Then we go through each focus area from a technical perspective.

To follow along with this post, you should be familiar with the previous posts in this series (Part 1 and Part 2) and the guidelines in Guidance for Intelligent Document Processing on AWS. These resources introduce common AWS services for IDP workloads and suggested workflows. With this knowledge, you’re now ready to learn more about productionizing your workload.

Common indicators

The following are common indicators that you should conduct a Well-Architected Framework review for the Performance Efficiency pillar:

  • High latency – When the latency of optical character recognition (OCR), entity recognition, or the end-to-end workflow takes longer than your previous benchmark, this may be an indicator that the architecture design doesn’t cover load testing or error handling.
  • Frequent throttling – You may experience throttling by AWS services like Amazon Textract due to request limits. This means that the architecture needs to be adjusted by reviewing the architecture workflow, synchronous and asynchronous implementation, transactions per second (TPS) calculation, and more.
  • Debugging difficulties – When there’s a document process failure, you may not have an effective way to identify where the error is located in the workflow, which service it’s related to, and why the failure occurred. This means the system lacks visibility into logs and failures. Consider revisiting the logging design of the telemetry data and adding infrastructure as code (IaC), such as document processing pipelines, to the solution.
Indicators Description Architectural Gap
High Latency OCR, entity recognition, or end-to-end workflow latency exceeds previous benchmark
  • Load Testing
  • Error Handling
Frequent Throttling Throttling by AWS services like Amazon Textract due to request limits
  • Sync vs Async
  • TPS calculation
Hard to Debug No visibility into location, cause, and reason for document processing failures
  • Logging Design
  • Document Processing Pipelines

Design principles

In this post, we discuss three design principles: delegating complex AI tasks, IaC architectures, and serverless architectures. When you encounter a trade-off between two implementations, you can revisit the design principles with the business priorities of your organization so that you can make decisions effectively.

  • Delegating complex AI tasks – You can enable faster AI adoption in your organization by offloading the ML model development lifecycle to managed services and taking advantage of the model development and infrastructure provided by AWS. Rather than requiring your data science and IT teams to build and maintain AI models, you can use pre-trained AI services that can automate tasks for you. This allows your teams to focus on higher-value work that differentiates your business, while the cloud provider handles the complexity of training, deploying, and scaling the AI models.
  • IaC architectures – When running an IDP solution, the solution includes multiple AI services to perform the end-to-end workflow chronologically. You can architect the solution with workflow pipelines using AWS Step Functions to enhance fault tolerance, parallel processing, visibility, and scalability. These advantages can enable you to optimize the usage and cost of underlying AI services.
  • Serverless architectures – IDP is often an event-driven solution, initiated by user uploads or scheduled jobs. The solution can be horizontally scaled out by increasing the call rates for the AI services, AWS Lambda, and other services involved. A serverless approach provides scalability without over-provisioning resources, preventing unnecessary expenses. The monitoring behind the serverless design assists in detecting performance issues.
Figure 1.The benefit when applying design principles. By author.

Figure 1.The benefit when applying design principles.

With these three design principles in mind, organizations can establish an effective foundation for AI/ML adoption on cloud platforms. By delegating complexity, implementing resilient infrastructure, and designing for scale, organizations can optimize their AI/ML solutions.

In the following sections, we discuss how to address common challenges in regards to technical focus areas.

Focus areas

When reviewing performance efficiency, we review the solution from five focus areas: architecture design, data management, error handling, system monitoring, and model monitoring. With these focus areas, you can conduct an architecture review from different aspects to enhance the effectivity, observability, and scalability of the three components of an AI/ML project, data, model, or business goal.

Architecture design

By going through the questions in this focus area, you will review the existing workflow to see if it follows best practices. The suggested workflow provides a common pattern that organizations can follow and prevents trial-and-error costs.

Based on the proposed architecture, the workflow follows the six stages of data capture, classification, extraction, enrichment, review and validation, and consumption. In the common indicators we discussed earlier, two out of three come from architecture design problems. This is because when you start a project with an improvised approach, you may meet project restraints when trying to align your infrastructure to your solution. With the architecture design review, the improvised design can be decoupled as stages, and each of them can be reevaluated and reordered.

You can save time, money, and labor by implementing classifications in your workflow, and documents go to downstream applications and APIs based on document type. This enhances the observability of the document process and makes the solution straightforward to maintain when adding new document types.

Data management

Performance of an IDP solution includes latency, throughput, and the end-to-end user experience. How to manage the document and its extracted information in the solution is the key to data consistency, security, and privacy. Additionally, the solution must handle high data volumes with low latency and high throughput.

When going through the questions of this focus area, you will review the document workflow. This includes data ingestion, data preprocessing, converting documents to document types accepted by Amazon Textract, handling incoming document streams, routing documents by type, and implementing access control and retention policies.

For example, by storing a document in the different processed phases, you can reverse processing to the previous step if needed. The data lifecycle ensures the reliability and compliance for the workload. By using the Amazon Textract Service Quotas Calculator (see the following screenshot), asynchronous features on Amazon Textract, Lambda, Step Functions, Amazon Simple Queue Service (Amazon SQS), and Amazon Simple Notification Service (Amazon SNS), organizations can automate and scale document processing tasks to meet specific workload needs.

 Figure 2. Amazon Textract Service Quota Calculator. By author.

Figure 2. Amazon Textract Service Quota Calculator.

Error handling

Robust error handling is critical for tracking the document process status, and it provides the operation team time to react to any abnormal behaviors, such as unexpected document volumes, new document types, or other unplanned issues from third-party services. From the organization’s perspective, proper error handling can enhance system uptime and performance.

You can break down error handling into two key aspects:

  • AWS service configuration – You can implement retry logic with exponential backoff to handle transient errors like throttling. When you start processing by calling an asynchronous Start* operation, such as StartDocumentTextDetection, you can specify that the completion status of the request is published to an SNS topic in the NotificationChannel configuration. This helps you avoid throttling limits on API calls due to polling the Get* APIs. You can also implement alarms in Amazon CloudWatch and triggers to alert when unusual error spikes occur.
  • Error report enhancement – This includes detailed messages with an appropriate level of detail by error type and descriptions of error handling responses. With the proper error handling setup, systems can be more resilient by implementing common patterns like automatically retrying intermittent errors, using circuit breakers to handle cascading failures, and monitoring services to gain insight into errors. This allows the solution to balance between retry limits and prevents never-ending circuit loops.

Model monitoring

The performance of ML models is monitored for degradation over time. As data and system conditions change, the model performance and efficiency metrics are tracked to ensure retraining is performed when needed.

The ML model in an IDP workflow can be an OCR model, entity recognition model, or classification model. The model can come from an AWS AI service, an open source model on Amazon SageMaker, Amazon Bedrock, or other third-party services. You must understand the limitations and use cases of each service in order to identify ways to improve the model with human feedback and enhance service performance over time.

A common approach is using service logs to understand different levels of accuracy. These logs can help the data science team identify and understand any need for model retraining. Your organization can choose the retraining mechanism—it can be quarterly, monthly, or based on science metrics, such as when accuracy drops below a given threshold.

The goal of monitoring is not just detecting issues, but closing the loop to continuously refine models and keep the IDP solution performing as the external environment evolves.

System monitoring

After you deploy the IDP solution in production, it’s important to monitor key metrics and automation performance to identify areas for improvement. The metrics should include business metrics and technical metrics. This allows the company to evaluate the system’s performance, identify issues, and make improvements to models, rules, and workflows over time to increase the automation rate to understand the operational impact.

On the business side, metrics like extraction accuracy for important fields, overall automation rate indicating the percentage of documents processed without human intervention, and average processing time per document are paramount. These business metrics help quantify the end-user experience and operational efficiency gains.

Technical metrics including error and exception rates occurring throughout the workflow are essential to track from an engineering perspective. The technical metrics can also monitor at each level from end to end and provide a comprehensive view of a complex workload. You can break the metrics down into different levels, such as solution level, end-to-end workflow level, document type level, document level, entity recognition level, and OCR level.

Now that you have reviewed all the questions in this pillar, you can assess the other pillars and develop an improvement plan for your IDP workload.

Conclusion

In this post, we discussed common indicators that you may need to perform a Well-Architected Framework review for the Performance Efficiency pillar for your IDP workload. We then walked through design principles to provide a high-level overview and discuss the solution goal. By following these suggestions in reference to the IDP Well-Architected Custom Lens and by reviewing the questions by focus area, you should now have a project improvement plan.

To learn more about the IDP Well-Architected Custom Lens, explore the following posts in this series:


About the Authors

Mia Chang is a ML Specialist Solutions Architect for Amazon Web Services. She works with customers in EMEA and shares best practices for running AI/ML workloads on the cloud with her background in applied mathematics, computer science, and AI/ML. She focuses on NLP-specific workloads, and shares her experience as a conference speaker and a book author. In her free time, she enjoys hiking, board games, and brewing coffee.

Brijesh Pati is an Enterprise Solutions Architect at AWS. His primary focus is helping enterprise customers adopt cloud technologies for their workloads. He has a background in application development and enterprise architecture and has worked with customers from various industries such as sports, finance, energy and professional services. His interests include serverless architectures and AI/ML.

Rui Cardoso is a partner solutions architect at Amazon Web Services (AWS). He is focusing on AI/ML and IoT. He works with AWS Partners and support them in developing solutions in AWS. When not working, he enjoys cycling, hiking and learning new things.

Tim Condello is a senior artificial intelligence (AI) and machine learning (ML) specialist solutions architect at Amazon Web Services (AWS). His focus is natural language processing and computer vision. Tim enjoys taking customer ideas and turning them into scalable solutions.

Sherry Ding is a senior artificial intelligence (AI) and machine learning (ML) specialist solutions architect at Amazon Web Services (AWS). She has extensive experience in machine learning with a PhD degree in computer science. She mainly works with public sector customers on various AI/ML related business challenges, helping them accelerate their machine learning journey on the AWS Cloud. When not helping customers, she enjoys outdoor activities.

Suyin Wang is an AI/ML Specialist Solutions Architect at AWS. She has an interdisciplinary education background in Machine Learning, Financial Information Service and Economics, along with years of experience in building Data Science and Machine Learning applications that solved real-world business problems. She enjoys helping customers identify the right business questions and building the right AI/ML solutions. In her spare time, she loves singing and cooking.

Read More

Build well-architected IDP solutions with a custom lens – Part 5: Cost optimization

Build well-architected IDP solutions with a custom lens – Part 5: Cost optimization

Building a production-ready solution in the cloud involves a series of trade-off between resources, time, customer expectation, and business outcome. The AWS Well-Architected Framework helps you understand the benefits and risks of decisions you make while building workloads on AWS.

An intelligent document processing (IDP) project usually combines optical character recognition (OCR) and natural language processing (NLP) to read and understand a document and extract specific terms or words. The IDP Well-Architected Custom Lens outlines the steps for performing an AWS Well-Architected review, and helps you assess and identify the risks in your IDP workloads. It also provides guidance to tackle common challenges, enabling you to architect your IDP workloads according to best practices.

This post focuses on the Cost Optimization pillar of the IDP solution. A cost-optimized workload fully utilizes all resources, achieves an outcome at the lowest possible price point, and meets your functional requirements. We start with an introduction of the Cost Optimization pillar and design principles, and then dive deep into the four focus areas: financial management, resource provision, data management, and cost monitoring. By reading this post, you will learn about the Cost Optimization pillar in the Well-Architected Framework with the IDP case study.

Design principles

Cost optimization is a continual process of refinement and improvement over the span of a workload’s lifecycle. The practices in this post can help you build and operate cost-aware IDP workloads that achieve business outcomes while minimizing costs and allowing your organization to maximize its return on investment.

Several principles can help you to improve cost optimization. Let’s consider different project phases. For example, during the project planning phase, you should invest in cloud financial management skills and tools, and align finance and tech teams to incorporate both business and technology perspectives. In the project development phase, we recommend adopting a consumption model and adjusting usage dynamically. When you’re ready for production, always monitor and analyze the spending.

Keep the following in mind as we discuss best practices:

  • Implement cloud financial management – To achieve financial success and accelerate business value realization with your IDP solution, you must invest in cloud financial management. Your organization must dedicate the necessary time and resources for building capability in this new domain of technology and usage management.
  • Cultivate a partnership between technology and finance – Involve finance and technology teams in cost and usage discussions while building your IDP solution and at all stages of your cloud journey. Teams should regularly meet and discuss topics such as organizational goals and targets with your IDP solution, current state of cost and usage, and financial and accounting practices.
  • Adopt a consumption model and adjust dynamically – Provision resources and manage data with cost awareness, and manage your project stage and environment with cost optimization over time. Pay only for the resources you consume, and increase or decrease usage depending on business requirements. For example, development and test environments for your IDP solution are typically only used for 8 hours a day during the work week. By stopping development and test environment resources when not in use, such as outside of the 40 working hours per week, you can reduce costs by 75% compared to running them continuously for 168 hours per week.
  • Monitor, attribute, and analyze expenditure – Measure the business output of the workload and the costs associated with delivery. Use this data to understand the gains you make from increasing output, increasing functionality, and reducing cost with your IDP workflow. AWS provides tools such as Amazon CloudWatch, tags, and AWS CloudTrail to make it straightforward to accurately identify the cost and usage of workloads, make sure you utilize resources to measure return on investment (ROI), and enable workload owners to optimize their resources and reduce costs.

Focus areas

The design principles and best practices of the Cost Optimization pillar are based on insights gathered from our customers and our IDP technical specialist communities. Use them as guidance and support for your design decisions, and align these with the business requirements of your IDP solution. Applying the IDP Well-Architected Custom Lens helps you validate the resilience and efficiency of your IDP solution, and provides recommendations to address any gaps you might identify.

You might have encountered cases when the financial team independently performs financial planning for your cloud usage, which turned out to be disrupted by the technical complexity. It’s also possible to ignore resource and data management while provisioning services, thereby creating unexpected cost items on your billings. In this post, we help you navigate through these situations and provide guidelines for cost optimization with your IDP solution, so you don’t have learn these lessons in a costly way. The following are four best practices areas for cost optimization of an IDP solution in the cloud: financial management, resource provisioning, data management, and cost monitoring.

Financial management

Establishing a team that can take responsibility for cost optimization is critical for successful adoption of cloud technology, and this is true for building an IDP solution as well. Relevant teams in both technology and finance within your organization must be involved in cost and usage discussions at all stages when building your IDP solution and along your cloud journey. The following are some key implementation steps to establish a dedicated cloud financial management team:

  • Define key members – Make sure that all relevant parts of your organization contribute and have a stake in cost management. Most importantly, you need to establish collaboration between finance and technology. Consider the following general groups, and include members with domain expertise in financial and business areas, as well as in technology, to integrate the knowledge for better financial management:
    • Financial leads – CFOs, financial controllers, financial planners, business analysts, procurement, sourcing, and accounts payable must understand the cloud model of consumption, purchasing options, and the monthly invoicing process. Finance needs to partner with technology teams to create and socialize an IT value story, helping business teams understand how technology spend is linked to business outcomes.
    • Technology leads – Technology leads (including product and application owners) must be aware of financial requirements (for example, budget constraints) as well as business requirements (for example, service level agreements). This allows the workload to be implemented to achieve the desired goals of the organization.
  • Define goals and metrics – The function needs to deliver value to the organization in different ways. These goals are defined and will continually evolve as the organization evolves. This function also needs to regularly report to the organization on the organization’s cost optimization capability.
  • Establish regular cadence – The group should come together regularly to review their goals and metrics. A typical cadence involves reviewing the state of the organization, any programs or services currently running, and overall financial and optimization metrics.

Resource provisioning

Given the various configurations and pricing models of AWS services as part of the IDP solution, you should only provision resources based on what you need and adjust your provisioning over time to align with your business requirement or development stage. Additionally, make sure you take advantage of free services offered by AWS to lower your overall cost. When provisioning resources for your IDP solution, consider the following best practices:

  • Decide between asynchronous inference or synchronous inference – You should adopt synchronous inference for real-time processing of a single document. Choose asynchronous jobs to analyze large documents or multiple documents in one batch, because asynchronous jobs handle large batches more cost-effectively.
  • Manage Amazon Comprehend endpoint inference units – Depending on your needs, you can adjust the throughput of your Amazon Comprehend endpoint after creating it. This can be achieved by updating the endpoint’s inference units (IUs). If you’re not actively using the endpoint for an extended period, you should set up an auto scaling policy to reduce your costs. If you’re no longer using an endpoint, you can delete the endpoint to avoid incurring additional cost.
  • Manage Amazon SageMaker endpoints – Similarly, for organizations that aim for inference type selection and endpoints running time management, you can deploy open source models on Amazon SageMaker. SageMaker provides different options for model inferences, and you can delete endpoints that aren’t being used or set up an auto scaling policy to reduce your costs on model endpoints.

Data management

Data plays a key role throughout your IDP solution, from building and delivering. Starting with the initial ingestion, data is pushed across different stages of processing, and eventually is returned as output to end-users. It’s important to understand how your choice of data management will impact the overall IDP solution cost. Consider the following best practices:

  • Adopt Amazon S3 Intelligent-Tiering – The Amazon S3 Intelligent-Tiering storage class is designed to optimize storage costs in Amazon Simple Storage Service (Amazon S3) by automatically moving data to the most cost-effective access tier when access patterns change, without operational overhead or impact on performance. There are two ways to move data into S3 Intelligent-Tiering:
    • Directly PUT data into S3 Intelligent-Tiering by specifying INTELLIGENT_TIERING in the x-amz-storage-class header.
    • Define S3 Lifecycle configurations to transition objects from S3 Standard or S3 Standard-Infrequent Access to S3 Intelligent-Tiering.
  • Enforce data retention policies throughout the IDP workflow – Use S3 Lifecycle configurations on an S3 bucket to define actions for Amazon S3 to take during an object’s lifecycle, as well as deletion at the end of the object’s lifecycle, based on your business requirements.
  • Split documents into single pages for specific FeatureType processingFeatureType is a parameter for the Document Analysis API calls (both synchronous and asynchronous) in Amazon Textract. As of this writing, it includes the following values: TABLES, FORMS, QUERIES, SIGNATURES, and LAYOUT. Amazon Textract charges based on the number of pages and images processed. Not all pages might include the information you need to extract. Splitting documents into single pages and only focusing on the pages with the FeatureType you need can help avoid unnecessary processing, thereby reducing your overall cost.

So far, we’ve discussed best practices on the implementation and deployment of your IDP solution. When your IDP solution is deployed and ready for production, cost monitoring is an important area for you to observe and control the cost directly. In the following section, we discuss how to best perform cost monitoring with your IDP solution.

Cost monitoring

Cost optimization begins with a granular understanding of the breakdown in cost and usage; the ability to model and forecast future spend, usage, and features; and the implementation of sufficient mechanisms to align cost and usage to your organization’s objectives. To improve the cost optimization of your IDP solution, follow these best practices.

Design cost monitoring for the lifetime of IDP workflow

Define and implement a method to track resources and their associations with the IDP system over their lifetime. You can use tagging to identify the workload or function of the resource:

  • Implement a tagging scheme – Implement a tagging scheme that identifies the workload the resource belongs to, verifying that all resources within the workload are tagged accordingly. Tagging helps you categorize resources by purpose, team, environment, or other criteria relevant to your business. For more detail on tagging use cases, strategies, and techniques, see Best Practices for Tagging AWS Resources.
    • Tagging at the service level allows for more granular monitoring and control of your cost. For example, with Amazon Comprehend in an IDP workflow, you can use tags on Amazon Comprehend analysis jobs, custom classification models, custom entity recognition models, and endpoints to organize your Amazon Comprehend resources and providing tag-based cost monitoring and control.
    • When tagging at the service level isn’t applicable, you can navigate to other resources for cost allocation reporting. For example, because Amazon Textract charges on a one-page basis, you can track the number of synchronous API calls to Amazon Textract for cost calculations (each synchronous API call maps to one page of the document). If you have large documents and want to utilize asynchronous APIs, you can use open source libraries to count the number of pages, or use Amazon Athena to write queries and extract the information from your CloudTrail logs to extract the page information for cost tracking.
  • Implement workload throughput or output monitoring – Implement workload throughput monitoring or alarming, initiating on either input requests or output completions. Configure it to provide notifications when workload requests or outputs drop to zero, indicating the workload resources are no longer used. Incorporate a time factor if the workload periodically drops to zero under normal conditions.
  • Group AWS resources – Create groups for AWS resources. You can use AWS resource groups to organize and manage your AWS resources that are in the same Region. You can add tags to most of your resources to help identify and sort your resources within your organization. Use Tag Editor to add tags to supported resources in bulk. Consider using AWS Service Catalog to create, manage, and distribute portfolios of approved products to end-users and manage the product lifecycle.

Use monitoring tools

AWS offers a variety of tools and resources to monitor the cost and usage of your IDP solution. The following is a list of AWS tools that help with cost monitoring and control:

  • AWS Budgets – Configure AWS Budgets on all accounts for your workload. Set budgets for the overall account spend and budgets for the workloads by using tags. Configure notifications in AWS Budgets to receive alerts for when you exceed your budgeted amounts or when your estimated costs exceed your budgets.
  • AWS Cost Explorer – Configure AWS Cost Explorer for your workload and accounts to visualize your cost data for further analysis. Create a dashboard for the workload that tracks overall spend, key usage metrics for the workload, and forecasts of future costs based on your historical cost data.
  • AWS Cost Anomaly Detection – Use AWS Cost Anomaly Detection for your accounts, core services, or cost categories you created to monitor your cost and usage and detect unusual spends. You can receive alerts individually in aggregated reports, and receive alerts in an email or an Amazon Simple Notification Service (Amazon SNS) topic, which allows you to analyze and determine the root cause of the anomaly and identify the factor that is driving the cost increase.
  • Advanced tools – Optionally, you can create custom tools for your organization that provide additional detail and granularity. You can implement advanced analysis capabilities using Athena and dashboards using Amazon QuickSight. Consider using Cloud Intelligence Dashboards for preconfigured, advanced dashboards. You can also work with AWS Partners and adopt their cloud management solutions to activate cloud bill monitoring and optimization in one convenient location.

Cost attribution and analysis

The process of categorizing costs is crucial in budgeting, accounting, financial reporting, decision-making, benchmarking, and project management. By classifying and categorizing expenses, teams can gain a better understanding of the types of costs they will incur throughout their cloud journey, helping them make informed decisions and manage budgets effectively. To improve the cost attribution and analysis of your IDP solution, follow these best practices:

  • Define your organization’s categories – Meet with stakeholders to define categories that reflect your organization’s structure and requirements. These will directly map to the structure of existing financial categories, such as business unit, budget, cost center, or department.
  • Define your functional categories – Meet with stakeholders to define categories that reflect the functions within your business. This may be your IDP workload or application names and the type of environment, such as production, testing, or development.
  • Define AWS cost categories – You can create cost categories to organize your cost and usage information. Use AWS Cost Categories to map your AWS costs and usage into meaningful categories. With cost categories, you can organize your costs using a rule-based engine.

Conclusion

In this post, we shared design principles, focus areas, and best practices for cost optimization in your IDP workflow.

To learn more about the IDP Well-Architected Custom Lens, explore the following posts in this series:

AWS is committed to the IDP Well-Architected Lens as a living tool. As IDP solutions and related AWS AI services evolve, and as new AWS services become available, we will update the IDP Well-Architected Lens accordingly.

To get started with IDP on AWS, refer to Guidance for Intelligent Document Processing on AWS to design and build your IDP application. For a deeper dive into end-to-end solutions that cover data ingestion, classification, extraction, enrichment, verification and validation, and consumption, refer to Intelligent document processing with AWS AI services: Part 1 and Part 2. Additionally, Intelligent document processing with Amazon Textract, Amazon Bedrock, and LangChain covers how to extend a new or existing IDP architecture with large language models (LLMs). You’ll learn you can integrate Amazon Textract with LangChain as a document loader, use Amazon Bedrock to extract data from documents, and use generative AI capabilities within the various IDP phases.

If you require additional expert guidance, contact your AWS account team to engage an IDP Specialist Solutions Architect.


About the Authors

Suyin Wang is an AI/ML Specialist Solutions Architect at AWS. She has an interdisciplinary education background in Machine Learning, Financial Information Service and Economics, along with years of experience in building Data Science and Machine Learning applications that solved real-world business problems. She enjoys helping customers identify the right business questions and building the right AI/ML solutions. In her spare time, she loves singing and cooking.

Brijesh Pati is an Enterprise Solutions Architect at AWS. His primary focus is helping enterprise customers adopt cloud technologies for their workloads. He has a background in application development and enterprise architecture and has worked with customers from various industries such as sports, finance, energy and professional services. His interests include serverless architectures and AI/ML.

Mia Chang is a ML Specialist Solutions Architect for Amazon Web Services. She works with customers in EMEA and shares best practices for running AI/ML workloads on the cloud with her background in applied mathematics, computer science, and AI/ML. She focuses on NLP-specific workloads, and shares her experience as a conference speaker and a book author. In her free time, she enjoys hiking, board games, and brewing coffee.

Rui Cardoso is a partner solutions architect at Amazon Web Services (AWS). He is focusing on AI/ML and IoT. He works with AWS Partners and support them in developing solutions in AWS. When not working, he enjoys cycling, hiking and learning new things.

Tim Condello is a senior artificial intelligence (AI) and machine learning (ML) specialist solutions architect at Amazon Web Services (AWS). His focus is natural language processing and computer vision. Tim enjoys taking customer ideas and turning them into scalable solutions.

Sherry Ding is a senior artificial intelligence (AI) and machine learning (ML) specialist solutions architect at Amazon Web Services (AWS). She has extensive experience in machine learning with a PhD degree in computer science. She mainly works with public sector customers on various AI/ML related business challenges, helping them accelerate their machine learning journey on the AWS Cloud. When not helping customers, she enjoys outdoor activities.

Read More

Build well-architected IDP solutions with a custom lens – Part 6: Sustainability

Build well-architected IDP solutions with a custom lens – Part 6: Sustainability

An intelligent document processing (IDP) project typically combines optical character recognition (OCR) and natural language processing (NLP) to automatically read and understand documents. Customers across all industries run IDP workloads on AWS to deliver business value by automating use cases such as KYC forms, tax documents, invoices, insurance claims, delivery reports, inventory reports, and more. IDP workflows on AWS can help you extract business insights from your documents, reduce manual effort, and process documents faster and with higher accuracy.

Building a production-ready IDP solution in the cloud requires a series of trade-offs between cost, availability, processing speed, and sustainability. This post provides guidance and best practices on how to improve the sustainability of your IDP workflow using Amazon Textract, Amazon Comprehend, and the IDP Well-Architected Custom Lens.

The AWS Well-Architected Framework helps you understand the benefits and risks of decisions made while building workloads on AWS. The AWS Well-Architected Custom Lenses complement the Well-Architected Framework with more industry-, domain-, or workflow-specific content. By using the Well-Architected Framework and the IDP Well-Architected Custom Lens, you will learn about operational and architectural best practices for designing and operating reliable, secure, efficient, cost-effective, and sustainable workloads in the cloud.

The IDP Well-Architected Custom Lens provides you with guidance on how to address common challenges in IDP workflows that we see in the field. By answering a series of questions in the Well-Architected Tool, you will be able to identify the potential risks and address them by following the improvement plan.

This post focuses on the Sustainability pillar of the IDP custom lens. The Sustainability pillar focuses on designing and implementing the solution to minimize the environmental impact of your workload and minimize waste by adhering to the following design principles: understand your impact, maximize resource utilization and use managed services, and anticipate change and prepare for improvements. These principles help you stay focused as you dive into the focus areas: achieving business results with sustainability in mind, effectively managing your data and its lifecycle, and being ready for and driving continuous improvement.

Design principles

The Sustainability pillar focuses on designing and implementing the solution through the following design principles:

  • Understand your impact – Measure the sustainability impact of your IDP workload and model the future impact of your workload. Include all sources of impact, including the impact of customer use of your products. This also includes the impact of IDP that enables digitization and allows your company or customers to complete paperless processes. Establish key performance indicators (KPIs) for your IDP workload to evaluate ways to improve productivity and efficiency while reducing environmental impact.
  • Maximize resource utilization and use managed services – Minimize idle resources, processing, and storage to reduce the total energy required to run your IDP workload. AWS operates at scale, so sharing services across a broad customer base helps maximize resource utilization, which maximizes energy efficiency and reduces the amount of infrastructure needed to support IDP workloads. With AWS managed services, you can minimize the impact of your IDP workload on compute, networking, and storage.
  • Anticipate change and prepare for improvements – Anticipate change and support the upstream improvements your partners and suppliers make to help you reduce the impact of your IDP workloads. Continuously monitor and evaluate new, more efficient hardware and software offerings. Design for flexibility to lower barriers for introducing changes and allow for the rapid adoption of new efficient technologies.

Focus areas

The design principles and best practices of the Sustainability pillar are based on insights gathered from our customers and our IDP technical specialist communities. You can use them as guidance to support your design decisions and align your IDP solution with your business and sustainability requirements.

The following are the focus areas for sustainability of IDP solutions in the cloud: achieve business results with sustainability in mind, effectively manage your data and its lifecycle, and be ready for and drive continuous improvement.

Achieve business results with sustainability in mind

To determine the best Regions for your business needs and sustainability goals, we recommend the following steps:

  • Evaluate and shortlist potential Regions – Start by shortlisting potential Regions for your workload based on your business requirements, including compliance, cost, and latency. Newer services and features are deployed to Regions gradually. Refer to List of AWS Services Available by Region to check which Regions have the services and features you need to run your IDP workload.
  • Choose a Region powered by 100% renewable energy – From your shortlist, identify Regions close to Amazon’s renewable energy projects and Regions where, in 2022, the electricity consumed was attributable to 100% renewable energy. Based on the Greenhouse Gas (GHG) Protocol, there are two methods for tracking emissions from electricity production: market-based and location-based. Companies can choose one of these methods based on their sustainability policies to track and compare their emissions from year to year. Amazon uses the market-based model to report our emissions. To reduce your carbon footprint, select a Region where, in 2022, the electricity consumed was attributable to 100% renewable energy.

Effectively manage your data and its lifecycle

Data plays a key role throughout your IDP solution. Starting with the initial data ingestion, data is pushed through various stages of processing, and finally returned as output to end-users. It’s important to understand how data management choices will affect the overall IDP solution and its sustainability. Storing and accessing data efficiently, in addition to reducing idle storage resources, results in a more efficient and sustainable architecture. When considering different storage mechanisms, remember that you’re making tradeoffs between resource efficiency, access latency, and reliability. This means you’ll need to select your management pattern accordingly. In this section, we discuss some best practices for data management.

Create and ingest only relevant data

To optimize your storage footprint for sustainability, evaluate what data is needed to meet your business objectives and create and ingest only relevant data along your IDP workflow.

Store only relevant data

When designing your IDP workflow, consider for each step in your workflow which intermediate data outputs need to be stored. In most IDP workflows, it’s not necessary to store the data used or created in each intermediate step because it can be easily reproduced. To improve sustainability, only store data that is not easily reproducible. If you need to store intermediate results, consider whether they qualify for a lifecycle rule that archives and deletes them more quickly than data with stricter retention requirements.

Preserve data across computing environments such as development and staging. Implement mechanisms to enforce a data lifecycle management process including archiving and deletion and continuously identify unused data and delete it.

To optimize your data ingest and storage, consider the optimal data resolution that satisfies the use case. Amazon Textract requires at least 150 DPI. If your document isn’t in a supported Amazon Textract format (PDF, TIFF, JPEG, and PNG) and you need to convert it, experiment to find the optimal resolution for best results rather than choosing the maximum resolution.

Use the right technology to store data

For IDP workflows, most of the data is likely to be documents. Amazon Simple Storage Service (Amazon S3) is an object storage built to store and retrieve any amount of data from anywhere, making it well suited for IDP workflows. Using different Amazon S3 storage tiers is a key component of optimizing storage for sustainability.

When considering different storage mechanisms, remember that you’re making trade-offs between resource efficiency, access latency, and reliability. That means you’ll need to select your management pattern accordingly. By storing less volatile data on technologies designed for efficient long-term storage, you can optimize your storage footprint. For archiving data or storing data that changes slowly, Amazon S3 Glacier and Amazon S3 Glacier Deep Archive are available. Depending on your data classification and workflow, you can choose Amazon S3 One Zone-IA, which reduces power and server capacity by storing data within a single Availability Zone.

Actively manage your data lifecycle according to your sustainability goals

Managing your data lifecycle means optimizing your storage footprint. For IDP workflows, first identify your data retention requirements. Based on to your retention requirements, create Amazon S3 Lifecycle configurations that automatically transfer objects to a different storage class based on your predefined rules. For data with no retention requirements and unknown or changing access patterns, use Amazon S3 Intelligent-Tiering to monitor access patterns and automatically move objects between tiers.

Continuously optimize your storage footprint by using the right tools

Over time, the data usage and access pattern in your IDP workflow may change. Tools like Amazon S3 Storage Lens deliver visibility into storage usage and activity trends, and even make recommendations for improvements. You can use this information to further lower the environmental impact of storing data.

Enable data and compute proximity

As you make your IDP workflow available to more customers, the amount of data traveling over the network will increase. Similarly, the larger the size of the data and the greater the distance a packet must travel, the more resources are required to transmit it.

Reducing the amount of data sent over the network and optimizing the path a packet takes will result in more efficient data transfer. Setting up data storage close to data processing helps optimize sustainability at the network layer. Ensure that the Region used to store the data is the same Region where you have deployed your IDP workflow. This approach helps minimize the time and cost of transferring data to the computing environment.

Be ready for and drive continuous improvement

Improving sustainability for your IDP workflow is a continuous process that requires flexible architectures and automation to support smaller, frequent improvements. When your architecture is loosely coupled and uses serverless and managed services, you can enable new features without difficulty and replace components to improve sustainability and gain performance efficiencies. In this section, we share some best practices.

Improve safely and continuously through automation

Using automation to deploy all changes reduces the potential for human error and enables you to test before making production changes to ensure your plans are complete. Automate your software delivery process using continuous integration and continuous delivery (CI/CD) pipelines to test and deploy potential improvements to reduce effort and limit errors caused by manual processes. Define changes using infrastructure as code (IaC): all configurations should be defined declaratively and stored in a source control system like AWS CodeCommit, just like application code. Infrastructure provisioning, orchestration, and deployment should also support IaC.

Use serverless services for workflow orchestration

IDP workflows are typically characterized by high peaks and periods of inactivity (such as outside of business hours), and are mostly driven by events (for example, when a new document is uploaded). This makes them a good fit for serverless solutions. AWS serverless services can help you build a scalable solution for IDP workflows quickly and sustainably. Services such as AWS Lambda, AWS Step Functions, and Amazon EventBridge help orchestrate your workflow driven by events and minimize idle resources to improve sustainability.

Use an event-driven architecture

Using AWS serverless services to implement an event-driven approach will allow you to build scalable, fault-tolerant IDP workflows and minimize idle resources.

For example, you can configure Amazon S3 to start a new workflow when a new document is uploaded. Amazon S3 can trigger EventBridge or call a Lambda function to start an Amazon Textract detection job. You can use Amazon Simple Notification Service (Amazon SNS) topics for event fanout or to send job completion messages. You can use Amazon Simple Queue Service (Amazon SQS) for reliable and durable communication between microservices, such as invoking a Lambda function to read Amazon Textract output and then calling a custom Amazon Comprehend classifier to classify a document.

Use managed services like Amazon Textract and Amazon Comprehend

You can perform IDP using a self-hosted custom model or managed services such as Amazon Textract and Amazon Comprehend. By using managed services instead of your custom model, you can reduce the effort required to develop, train, and retrain your custom model. Managed services use shared resources, reducing the energy required to build and maintain an IDP solution and improving sustainability.

Review AWS blog posts to stay informed about feature updates

There are several blog posts and resources available to help you stay on top of AWS announcements and learn about new features that may improve your IDP workload.
AWS re:Post is a community-driven Q&A service designed to help AWS customers remove technical roadblocks, accelerate innovation, and enhance operations. AWS re:Post has over 40 topics, including a community dedicated to AWS Well-Architected. AWS also has service-specific blogs to help you to stay up to date for Amazon Textract and Amazon Comprehend.

Conclusion

In this post, we shared design principles, focus areas, and best practices for optimizing sustainability in your IDP workflow. To learn more about sustainability in the cloud, refer to the following series on Optimizing your AWS Infrastructure for Sustainability, Part I: Compute, Part II: Storage, and Part III: Networking.

To learn more about the IDP Well-Architected Custom Lens, explore the following posts in this series:

AWS is committed to the IDP Well-Architected Lens as a living tool. As IDP solutions and related AWS AI services evolve, and as new AWS services become available, we will update the IDP Well-Architected Lens accordingly.

To get started with IDP on AWS, refer to Guidance for Intelligent Document Processing on AWS to design and build your IDP application. For a deeper dive into end-to-end solutions that cover data ingestion, classification, extraction, enrichment, verification and validation, and consumption, refer to Intelligent document processing with AWS AI services: Part 1 and Part 2. Additionally, Intelligent document processing with Amazon Textract, Amazon Bedrock, and LangChain covers how to extend a new or existing IDP architecture with large language models (LLMs). You’ll learn how you can integrate Amazon Textract with LangChain as a document loader, use Amazon Bedrock to extract data from documents, and use generative AI capabilities within the various IDP phases.

If you require additional expert guidance, contact your AWS account team to engage an IDP Specialist Solutions Architect.


About the Author

Christian Denich is a Global Customer Solutions Manager at AWS. He is passionate about automotive, AI/ML and developer productivity. He supports some the world’s largest automotive brands on their cloud journey, encompassing cloud and business strategy as well as technology. Before joining AWS, Christian worked at BMW Group in both hardware and software development in various projects including connected navigation.

Read More

How Amazon Search M5 saved 30% for LLM training cost by using AWS Trainium

How Amazon Search M5 saved 30% for LLM training cost by using AWS Trainium

For decades, Amazon has pioneered and innovated machine learning (ML), bringing delightful experiences to its customers. From the earliest days, Amazon has used ML for various use cases such as book recommendations, search, and fraud detection. Similar to the rest of the industry, the advancements of accelerated hardware have allowed Amazon teams to pursue model architectures using neural networks and deep learning (DL).

The M5 program within Amazon Search owns the discovery learning strategy for Amazon and builds large-scale models across multi-lingual, multi-locale, multi-entity, multitask, and multi-modal such as text, image, and video. The M5 program has been serving universal embeddings and large-scale foundation models to hundreds of ML teams across Amazon while maintaining strict controls over cost optimization. In order to achieve this, the M5 team regularly evaluates new techniques to reduce cost.

Like many ML organizations, accelerators are largely used to accelerate DL training and inference. When AWS launched purpose-built accelerators with the first release of AWS Inferentia in 2020, the M5 team quickly began to utilize them to more efficiently deploy production workloads, saving both cost and reducing latency. Last year, AWS launched its AWS Trainium accelerators, which optimize performance per cost for developing and building next generation DL models. In this post, we discuss how M5 was able to reduce the cost to train their models by 30%, and share some of the best practices we learned along the way.

Trainium instances

With the advances in purpose-built accelerators, Amazon also provides compelling accelerators in the form of AWS Inferentia and Trainium. As their names imply, these chips are optimized to exceed the needs of inference and training workloads, respectively. For large-scale training of foundation models that reach billions of parameters in size, Trainium Trn1 and Trn1n instances are ideal choices due to their characteristics. Trn1 instances are powered by the state-of-the-art NeuronCore-v2, and have a copious amount of accelerator compute and memory. Trn1n instances can also be chosen for a greater amount of networking bandwidth (1,600 Gbs), so are ideally suited for performant training with cost optimization in mind.

To use accelerators, you need a software layer to support them. With Trn and Inf chips, the AWS Neuron SDK unlocks Amazon purpose-built accelerators with the help of PyTorch XLA. PyTorch XLA converts PyTorch’s eager mode to lazy mode graph-based implementation. These graphs are then used and further compiled to be used with the accelerator. PyTorch Neuron (part of the Neuron SDK) enables PyTorch users to train their models on Trainium NeuronCores with a few lines of code.

Model and workload

The M5 team trains and deploys foundational models and universal representations to assist various teams across Amazon in bringing delight to Amazon.com customers. One such model is a text encoder model followed by a multi-layer perceptron (MLP) with explicit or implicit feature interactions defined by the neural network architecture with hundreds of millions of trainable parameters. This model is trained on billions of tokens, and is used to generate millions of embeddings in an offline batch inference setting. These embeddings are inputs to a customer-facing tier-1 Amazon service.

The infrastructure for the production pipeline uses AWS Batch with fair share queuing strategies, using an EFA-enabled multi-node trn1.32xlarge cluster as the compute for model training. Functionally, the production pipeline performs incremental model training, evaluation of trained model, and offline batch inference on the trained model, all using PyTorch as the underlying DL library.

Goals

Delighting our customers is a foremost tenet. Given the customer-facing nature of the pipeline, it’s critical that all service-level agreements (SLAs) be met without regressions. We identified two critical acceptance criteria to adapt our existing GPU production pipeline and transition it to Trainium:

  • Model quality – The quality of our models directly impacts customer experience. We require that there should be less than 0.1% difference in model quality between GPU and Trainium.
  • Training throughput – We iteratively train our models periodically to provide the freshest experience to our customers. We require that model convergence must be achieved within a predefined period of time (such as 1 week) to meet our production SLAs.

In the following sections, we share our journey of working backward from this criteria, and our learnings to support Amazon-scale production workloads.

Training script

Before starting with model training, we need to make changes to the training script to make it XLA compliant. Given the size of the model, we use distributed data parallel (DDP) to train the model. DDP allows us to increase the throughput of model training by scaling up the number of machines used to run model training, without any code changes. We followed the instructions provided in the Neuron PyTorch MLP training tutorial to add XLA-specific constructs in our training scripts. These code changes are straightforward to implement. The following are some significant technical learnings from the exercise that greatly improved our model throughput:

  • Placement of xm.mark_step()xm.mark_step() compiles and runs the lazily collected computation graphs. Invoking mark_step too many times will lead to a larger number of small graphs, whereas invoking it too few times will lead to few, but large graphs. Depending on your application, the throughput and implementation of your model training will vary based on your placement of xm.mark_step(). Our implementation places one xm.mark_step() after a forward and backward pass, and one after the optimizer step.
  • Data loader wrapping with XLA multiprocessing device loader – This is a critical step that can be easily missed. The multiprocessing device loader torch_xla.distributed.parallel_loader.MpDeviceLoader loads training data on each XLA device with options to preload and overlap data loading with device runs for improving throughput. The device loader also invokes xm.mark_step() and is therefore able to build graphs for data loading to device from host.

Compilation for Trainium

Traditionally, the model development cycle with GPUs involves making changes to the model or training script and directly running it on the GPU device. Accelerators such as Trainium that use XLA require an additional step before model training can be run on the accelerator. XLA computation graphs can only be run after they have been compiled. Generally, there are two ways to perform this compilation: Ahead of Time (AOT), where you trace and compile all graphs first and then run them, or Just In Time (JIT), where graphs are traced, compiled, and run as they are encountered. The Neuron SDK provides both of these out of the box. Typically, AOT compilation is performed first. Graphs are then run after this compilation. If new graphs are encountered, the Neuron runtime invokes a JIT compilation before running them. To perform AOT compilation, the Neuron SDK provides neuron_parallel_compile, a compilation utility that extracts graphs from a trial run of the training script and performs parallel AOT compilation.

An important aspect of AOT compilation is to ensure that no new computation graphs are created over the course of training. One source of new computation graphs (and therefore recompilations) is dynamic shapes of the training batches during model training. We found that using static shapes and fixed-size batches eliminates training time compilations and greatly improves training throughput without any effect on model accuracy. By enforcing such constraints on training, we observed that only 4–5 steps of model training, one step of model validation, and checkpointing the model one time is required for tracing all the graphs during AOT compilation. It’s important to note that the Neuron SDK is constantly evolving, and in the future will support dynamic shapes as well.

Furthermore, the compiled graphs are stored in the Neuron Persistent Cache on disk or in an Amazon Simple Storage Service (Amazon S3) bucket. This is especially useful for production workloads where model architecture and training configuration doesn’t change. Therefore, the overhead of compilation is incurred just one time. Using the cache is as simple as setting an environment flag:

export NEURON_COMPILE_CACHE_URL="s3://BUCKET/KEY"

The Neuron compiler also provides three compiler-level optimization options (O1, O2, O3) to balance compilation time and model run throughput. O1 enables core optimizations on the compute graph and minimizes compilation time, O3 provides improved model run throughput at the cost of higher compilation time, and O2 (default option) is a balance between the two. For our use case, we used the O1 optimization and observed an 86% reduction in compilation time with no change to model accuracy metrics, while observing approximately a 5–7% reduction in throughput compared to the default optimization (O2). Depending on the use case, you can choose different levels of optimization.

To summarize, we used the following flags for compilation:

NEURON_CC_FLAGS="--target trn1 --auto-cast all --auto-cast-type bf16 --model-type transformer --optlevel O1"

Checkpoint compatibility

When compilation is successfully complete, we can proceed to train our models on Trainium. As mentioned earlier, we incrementally train our models, meaning we load a previously trained model checkpoint and continue training with new data. PyTorch and PyTorch XLA allow seamless transitioning between accelerators through checkpoint interoperability. Having the flexibility of moving between GPU and Trainium enabled us to seamlessly load the previous GPU model and train on Trainium machines. This was critical to ensure that we can initialize our model with the best previously trained model without any production downtime or loss in model accuracy.

Because the GPU model was saved using standard PyTorch model saving utilities, we were able to use the PyTorch checkpoint loading utility to load the GPU model on Trainium devices.

For example, on GPU/CPU, you can save the model with the following code:

torch.save(model.state_dict(), PATH)

Then you load the model back on Trainium:

import torch_xla.core.xla_model as xm
xla_device = xm.xla_device()
model = MyModel(*args, **kwargs)
model.load_state_dict(torch.load(PATH))
model.to(xla_device)

Similarly, you can save the model on Trainium with the following code:

import torch_xla.core.xla_model as xm
# automatically moves the data to CPU for the master device
xm.save(model.state_dict(), PATH) 

And load the model back on GPU/CPU:

model = MyModel(*args, **kwargs)
model.load_state_dict(torch.load(PATH))
model.to(device) # can be any device

In fact, because we use DDP for model training, the model loading is agnostic of the number of machines used to train the previous checkpoint. This allows us to horizontally scale the Trn1 fleet with no code changes or adverse effects to model training. These PyTorch-based checkpoints can be directly used or even torch-scripted for inference use cases on AWS Inferentia2 or other accelerators.

Operational stability

It cannot be emphasized enough that running workloads in production requires multiple SLAs to be met. For our use case, apart from the model quality and training throughput SLAs, it’s imperative that the production pipeline be operationally stable, meaning minimal downtime and disruptions during model training, evaluation, and inference.

As with the existing GPU based pipeline, we added numerous mechanisms to make the pipeline operationally stable. Before starting model training, we run multiple sanity tests to assess the health of the machines. These tests generally include simple tensor operations to verify the health of the accelerator devices. We have observed that for distributed training, it’s important to run tests to verify collective communication between instances as well. We used the NCCOM test suite from the Neuron SDK to achieve this, running a variety of operations such as all-gather, all-reduce, and reduce-scatter.

Even after following the suggestions we’ve mentioned, we have observed that transient issues are inevitable in any pipeline, irrespective of the underlying accelerator. To build resiliency in any training pipeline, we recommend building in retry mechanisms to resolve these potential issues. We use AWS Batch automated retries to retry jobs that encounter a transient failure during model training. These restarts can be costly if a failure is encountered towards the end of training. To counter this problem, we have adapted our training scripts to load a previously trained model checkpoint and continue training from that point. With this functionality, we are able to aggressively restart failed training jobs with minimal overhead.

With these resiliency mechanisms in place, we were able to achieve 98.5% success rates for our workloads on Trn1, comparable to our existing GPU pipeline success rates.

Results

To validate the accuracy of our models, we initialized two models from the same GPU checkpoint, and trained one on Trainium and the other on a comparable GPU. Both models were trained with the same training hyperparameters. The dataset used for metrics calculation is a holdout dataset, and we evaluate the model’s accuracy on this dataset every N global steps. X-axis is the global step, and Y-axis is the model accuracy. We observed less than 0.1% difference in the model accuracy at each point in the following graph.

Furthermore, to evaluate the cost-effectiveness of the model training, we prefer to compare the wall clock time taken to reach model convergence. We believe this provides a more practical view of cost savings compared to measures such as cost per token, achieved FLOPS/dollar, and other factors. Considering the training time of trn1.32xl and comparable Amazon Elastic Compute Cloud (Amazon EC2) instances, we have observed that Trainium offers up to 30% cheaper cost to model convergence.

Conclusion

There are many factors to consider when evaluating different accelerators for your DL workloads. Some of the most important are model quality, throughput, cost, and availability. It is paramount to ensure that your model quality and throughput are not sacrificed based on the accelerator you choose.

Thanks to our partnership and collaboration with the Annapurna Neuron team, the Amazon Search M5 team has been able to save up to 30% in cost by moving to Trainium. The team is able to use Trainium and achieve model quality and throughput parity with comparable accelerators in the market. Checkpoint interoperability and minimal code changes with support for XLA have allowed M5 to choose between multiple accelerators for their workloads. This has enabled the M5 team to take advantage of the large compute power of Trainium, and build accelerator agnostic solutions to delight Amazon.com customers. From an operational standpoint, Trainium has been proven capable of supporting tier-1 services at Amazon scale. The M5 team continues to move more workloads to Trainium to provide the best models for Amazon at the lowest costs.

In summary, the M5 team has been able to perform cost-effective, production-grade ML training by adding Trainium to the fleet of accelerators. We encourage you to take a look at Trainium and other Neuron devices like AWS Inferentia to reap the benefits of purpose-built Amazon silicon for ML workloads. Get started easily with one of the many tutorials featuring different models, like Llama 2, available on Trainium.


About the Authors

Jerry Mannil is a software engineer at Amazon Search. He works on improving the efficiency, robustness and scalibility of the distributed training infrastructure.

Ken Su is a software engineer at Amazon Search. He works on improving training efficiency and scalable distributed training workflow. Outside work, he likes hiking and tennis.

RJ is an Engineer within Amazon. He builds and optimizes systems for distributed systems for training and works on optimizing adopting systems to reduce latency for ML Inference. Outside work, he is exploring using Generative AI for building food recipes.

Abhinandan Patni is a Senior Software Engineer at Amazon Search. He focuses on building systems and tooling for scalable distributed deep learning training and real time inference.

James Park is a Solutions Architect at Amazon Web Services. He works with Amazon.com to design, build, and deploy technology solutions on AWS, and has a particular interest in AI and machine learning. In h is spare time he enjoys seeking out new cultures, new experiences,  and staying up to date with the latest technology trends. You can find him on LinkedIn.

Read More

Geospatial generative AI with Amazon Bedrock and Amazon Location Service

Geospatial generative AI with Amazon Bedrock and Amazon Location Service

Today, geospatial workflows typically consist of loading data, transforming it, and then producing visual insights like maps, text, or charts. Generative AI can automate these tasks through autonomous agents. In this post, we discuss how to use foundation models from Amazon Bedrock to power agents to complete geospatial tasks. These agents can perform various tasks and answer questions using location-based services like geocoding available through Amazon Location Service. We also share sample code that uses an agent to bridge the capabilities of Amazon Bedrock with Amazon Location. Additionally, we discuss the design considerations that went into building it.

Amazon Bedrock is a fully managed service that offers an easy-to-use API for accessing foundation models for text, image, and embedding. Amazon Location offers an API for maps, places, and routing with data provided by trusted third parties such as Esri, HERE, Grab, and OpenStreetMap. If you need full control of your infrastructure, you can use Amazon SageMaker JumpStart, which gives you the ability to deploy foundation models and has access to hundreds of models.

Solution overview

In the realm of large language models (LLMs), an agent is an entity that can autonomously reason and complete tasks with an LLM’s help. This allows LLMs to go beyond text generation to conduct conversations and complete domain-specific tasks. To guide this behavior, we employ reasoning patterns. According to the research paper Large Language Models are Zero-Shot Reasoners, LLMs excel at high-level reasoning, despite having a knowledge cutoff.

We selected Claude 2 as our foundational model from Amazon Bedrock with the aim of creating a geospatial agent capable of handling geospatial tasks. The overarching concept was straightforward: think like a geospatial data scientist. The task involved writing Python code to read data, transform it, and then visualize it in an interesting map. We utilized a prompting pattern known as Plan-and-Solve Prompting for this purpose.

Using a Plan-and-Solve strategy allows for multi-step reasoning and developing a high-level plan as the first task. This works well for our load, transform, and visualize workflow, and is the high-level plan our agent will use. Each of these subtasks are sent to Claude 2 to solve separately.

We devised an example task to create a price heatmap of Airbnb listings in New York. To plan a path to complete the task, the agent needs to understand the dataset. The agent needs to know the columns in the dataset and the type of data in those columns. We generate a summary from the dataset so the agent can plan for the task provided by the user, in this case, generating a heatmap.

Prerequisites

There are a few prerequisites to deploy the demo. You’ll need access to an AWS account with an access key or AWS Identity and Access Management (IAM) role with permissions to Amazon Bedrock and Amazon Location. You will need to create a map, a place index, and an Amazon Location API key using the Amazon Location console. You will also need access to either a local or virtual environment where Docker is installed. In our demonstration, we use an Amazon Elastic Compute Cloud (Amazon EC2) instance running Amazon Linux with Docker installed.

Read and summarize the data

To give the agent context about the dataset, we prompt Claude 2 to write Python code that reads the data and provides a summary relevant to our task. The following are a few of the prompts we included. The full list is available in the prompts.py file in the project.

Your Role: You are a geospatial data analyzer designed to analyze data schema from arbitrary geospatial datasets.
Your Task: You are provided a set of file URLs. You need to generate a Python function that meets the following:
    Read each file using geopandas. Each file could be CSV, shape file, or GeoJSON.
    Use built-in function resolved_file_url= get_data_file_url(file_url, session_id) to get downloadable URLs.

From these prompts, Claude 2 generated the following Python code:

import geopandas as gpd
import pandas as pd
import numpy as np
 
def analyze_geospatial_data(file_urls):
    dataframes = []
    for file_url in file_urls:
        resolved_file_url = get_data_file_url(file_url, 'a5b188fe-91f7-4f92-92fe-c4030721a427', 'local')
        if file_url.endswith('.csv'):
            df = pd.read_csv(resolved_file_url)
        elif file_url.endswith('.shp'):
            df = gpd.read_file(resolved_file_url)
        elif file_url.endswith('.parquet'):
            df = pd.read_parquet(resolved_file_url)
        elif file_url.endswith('.geojson'):
            df = gpd.read_file(resolved_file_url)
        else:
            raise ValueError('Unsupported file format')
 
        df = df.sample(n=3, random_state=42)
        dataframes.append({
            'file_url': file_url, 
            'resolved_file_url': resolved_file_url,
            'data_frame': df,
            'column_names': list(df.columns)
        })
    return dataframes
 
file_urls = ['agent://airbnb_listings_price.csv']
dataframes = analyze_geospatial_data(file_urls)

The agent:// prefix tells our CLI that this file is stored inside the session storage folder. Upon running the CLI, it will create a geospatial-agent-session-storage folder to store local data. The agent now recognizes the uploaded Airbnb data and its column names. Next, let’s ask Claude for some hints to generate a heatmap using these columns. We will pass in the following prompts.

Your Role: You are a geospatial data analyzer designed to analyze data schema from arbitrary geospatial data sets.
Your Task: You are provided with a table with some rows data. Your task is to generate a summary that describes the data in the table following the requirements below.
    The first sentence should be summary of the data in the table from the aspect of the user action.
    If there is no geometry column in the table, the second sentence should note column names that can be used to generate a geometry column in geopandas.

Claude 2 replies with a hint

Summary: The table contains columns for listing id, name, host id, host name, location details like neighborhood, latitude, and longitude that can be used to generate a geometry column for plotting. Price column contains listing price that can be used to generate a heatmap of listing prices.

Plan a solution

Now we can ask Claude 2 to draft a high-level plan. We can use a graph library for Python called NetworkX to map out the steps for solving our problem.

Similar to our last prompt, we pass in a few requirements to guide Claude 2:

Your Role: You are a geospatial data scientist and an expert Python developer.
Your Task: A series of consecutive steps.
Your Reply needs to meet these requirements:
    Create a single NetworkX graph.
    No disconnected components allowed.
    First operations are data loading.
    Use geopandas for spatial data.

From these prompts, Claude 2 generated the following Python code:

import networkx as nx
G = nx.DiGraph()
# Load the data
G.add_node("airbnb_data_url", node_type="data", data_path="agent://airbnb_listings_price.csv", 
           description="URL for Airbnb listing price data")
G.add_node("load_data", node_type="operation", operation_type="load",
           description="Load the Airbnb listing price data")
G.add_edge("airbnb_data_url", "load_data")
G.add_node("airbnb_df", node_type="data", 
           description="Pandas DataFrame containing Airbnb listing prices")
G.add_edge("load_data", "airbnb_df")
# Transform the data
G.add_node("add_geometry", node_type="operation", operation_type="transform",
           description="Create a geometry column using latitude and longitude")
G.add_edge("airbnb_df", "add_geometry")
G.add_node("airbnb_gdf", node_type="data",
           description="GeoDataFrame with geometry column added")
G.add_edge("add_geometry", "airbnb_gdf")
# Visualize the data
G.add_node("plot_heatmap", node_type="operation", operation_type="visualization",
           description="Generate a heatmap of listing prices")
G.add_edge("airbnb_gdf", "plot_heatmap")
G.add_node("price_heatmap", node_type="data", 
           description="The heatmap plot showing Airbnb listing prices")
G.add_edge("plot_heatmap", "price_heatmap")

In this NetworkX graph, Claude 2 breaks down the process into three main segments:

  • Loading data – Importing the Airbnb listing prices from a given URL into a Pandas DataFrame
  • Transforming data – Creating a geometry column based on the latitude and longitude coordinates
  • Visualizing data – Generating a heatmap to display the pricing of Airbnb listings

This approach allows for a clear and straightforward implementation of the geospatial task at hand. We can use GraphViz to visualize the following workflow.

Graph

Implement the plan

Now that Claude 2 has provided us with a plan, it’s time to bring it to life. For each step, we prompt Claude 2 to write the corresponding code. To keep Claude 2 focused, we supply high-level requirements for each task. Let’s dive into the code that Claude 2 generated for each individual phase.

Load the data

To load the Airbnb listing price data into a Pandas DataFrame, we create a prompt and pass in some parameters. The “Load Airbnb data” in the Operation_task is referencing the Load Data node in our graph we created earlier.

Your Role: You are a geospatial data scientist and an expert Python developer.
Operation_task: You need to generate a Python function to do: Load Airbnb data.
Your Reply needs to meet these requirements:
    The function definition is: 
    load_data(airbnb_data_url='agent://airbnb_listings_price.csv').
    When doing spatial joins, retain at least 1 geometry column.
    Point function requires importing shapely library.

From these prompts, Claude 2 generated the following Python code:

import pandas as pd
def load_data(airbnb_data_url='agent://airbnb_listings_price.csv'):
    """
    Load Airbnb listing price data
    """
    file_url = get_data_file_url(airbnb_data_url, session_id, storage_mode) 
    airbnb_df = pd.read_csv(file_url)
    return airbnb_df

Transform the data

Next, Claude 2 generates the code to add a geometry column to our DataFrame using latitude and longitude. For this prompt, we pass in the following requirements:

Your Role: You are a geospatial data scientist and an expert Python developer.
Operation_task: You need to generate a Python function to do: 
    Add geometry column from latitude/longitude.
    Your Reply needs to meet these requirements:
    The function definition is: add_geometry(filtered_df).
    The function return line is: return airbnb_gdf.

From these prompts, Claude 2 generated the following Python code:

import geopandas as gpd
from shapely.geometry import Point
def add_geometry(airbnb_df):
    """
    Add geometry column from latitude and longitude
    """
    airbnb_df['geometry'] = airbnb_df.apply(lambda row: Point(row['longitude'], row['latitude']), axis=1)
    airbnb_gdf = gpd.GeoDataFrame(airbnb_df, geometry='geometry')
 
    return airbnb_gdf

Visualize the data

Finally, Claude 2 builds a heatmap visualization using pydeck, which is a Python library for spatial rendering. For this prompt, we pass in the following requirements:

Your Role: You are a geospatial data scientist and an expert Python developer.
Operation_task: You need to generate a Python function to do: Generate heatmap from price.
Your Reply needs to meet these requirements:
    The function definition is: generate_heatmap(airbnb_gdf).
    The function return line is: return airbnb_heatmap.

From these prompts, Claude 2 generated the following Python code:

import pandas as pd
import geopandas as gpd
import pydeck as pdk
 
def plot_heatmap(airbnb_gdf):
    """
    Plot heatmap of listing prices
    """
    price_heatmap = pdk.Deck(
        map_style=location_map_style(),
        initial_view_state=pdk.ViewState(
            latitude=airbnb_gdf['latitude'].mean(),
            longitude=airbnb_gdf['longitude'].mean(),
            zoom=11,
            pitch=50,
        ),
        layers=[
            pdk.Layer(
                'HexagonLayer',
                data=airbnb_gdf,
                get_position=['longitude', 'latitude'],
                radius=100,
                elevation_scale=4,
                elevation_range=[0, 1000],
                pickable=True,
                extruded=True,
            ),
            pdk.Layer(
                'ScatterplotLayer',
                data=airbnb_gdf,
                get_position=['longitude', 'latitude'],
                get_color='[200, 30, 0, 160]',
                get_radius=200,
            ),
        ],
    )
 
    # Save heatmap HTML
    price_heatmap.to_html(get_local_file_path('airbnb_heatmap.html', session_id, task_name))
 
    return price_heatmap

When Claude 2 returns a response, it also includes some helpful notes explaining how each function meets the provided requirements. For example, for the heatmap visualization, Claude 2 noted the following:

"This function generates a heatmap of Airbnb listing prices using pydeck and saves the resulting HTML locally. It fulfills the requirements specified in the prompt."

Assemble the generated code

Now that Claude 2 has created the individual building blocks, it’s time to put it all together. The agent automatically assembles all these snippets into a single Python file. This script calls each of our functions in sequence, streamlining the entire process.

The final step looks like the following code:

session_id = "a5b188fe-91f7-4f92-92fe-c4030721a427"
task_name = "1694813661_airbnb_listings_price_heatmap"
storage_mode = "local"
# Sequentially invoke the functions
airbnb_df = load_data(airbnb_data_url='agent://airbnb_listings_price.csv')
airbnb_gdf = add_geometry(airbnb_df)
price_heatmap = plot_heatmap(airbnb_gdf)

After the script is complete, we can see that Claude 2 has created an HTML file with the code to visualize our heatmap. The following image shows New York on an Amazon Location basemap with a heatmap visualizing Airbnb listing prices.

Heat Map Visualization

Use Amazon Location with Amazon Bedrock

Although our Plan-and-Solve agent can handle this geospatial task, we need to take a slightly different approach for tasks like geocoding an address. For this, we can use a strategy called ReAct, where we combine reasoning and acting with our LLM.

In the ReAct pattern, the agent reasons and acts based on customer input and the tools at its disposal. To equip this Claude 2-powered agent with the capability to geocode, we developed a geocoding tool. This tool uses the Amazon Location Places API, specifically the SearchPlaceIndexForText method, to convert an address into its geographic coordinates.

Agent: Hi! I'm Agent Smith, your conversational geospatial assistant. How can I assist you today?
You: >? Hello, can you give me the coordinates for 112 E 11th St, Austin, TX 78701?
Agent: The coordinates for 112 E 11th St, Austin, TX 78701 are longitude -97.740590981087 and latitude 30.274118017533.

Within this brief exchange, the agent deciphers your intent to geocode an address, activates the geocoding tool, and returns the latitude and longitude.

Whether it’s plotting a heatmap or geocoding an address, Claude 2 combined with agents like ReAct and Plan and Solve can simplify geospatial workflows.

Deploy the demo

To get started, complete the following steps:

  1. Clone the following repository either to your local machine or to an EC2 instance. You may need to run aws configure --profile <profilename> and set a default Region; this application was tested using us-east-1.
git clone https://github.com/aws-samples/amazon-location-geospatial-agent/

Now that we have the repository cloned, we configure our environment variables.

  1. Change directories into the cloned project folder:
cd amazon-location-geospatial-agent
  1. Edit the .env file using your preferred text editor:
vim .env
  1. Add your map name, place index name, and API key:
API_KEY_NAME=AgentAPIKey
MAP_NAME=AgentMap
PLACE_INDEX_NAME=AgentPlaceIndex
  1. Run the following command to build your container:
docker build -t agent .
  1. Run the following command to run and connect to your Docker container:
docker run --rm -it -v ~/.aws:/root/.aws --entrypoint bash agent
  1. Grab the Airbnb dataset:
apt install -y wget
wget http://data.insideairbnb.com/united-states/ny/new-york-city/2023-10-01/visualisations/listings.csv
cp listings.csv data/listings.csv
  1. Run the following command to create a session. We use sessions to isolate unique chat environments.
SESSION_ID="3c18d48c-9c9b-488f-8229-e2e8016fa851" FILE_NAME="listings.csv" make create-session

Now you’re ready to start the application.

  1. Run the following command to begin the chat application:
poetry run agent --session-id 3c18d48c-9c9b-488f-8229-e2e8016fa851 --profile <profilename>

You will be greeted with a chat prompt.

  1. You can begin by asking the following question:
I've uploaded the file listings.csv. Draw a heatmap of Airbnb listing price.

The agent grabs the Airbnb_listings_price.csv file we have downloaded to the /data folder and parses it into a geospatial DataFrame. Then it generates the code to transform the data as well as the code for the visualization. Finally, it creates an HTML file that will be written in the /data folder, which you can open to visualize the heatmap in a browser.

Another example uses the Amazon Location Places API to geocode an address. If we ask the agent to geocode the address 112 E 11th St, Austin, TX 78701, we will get a response as shown in the following image.

Example Interaction

Conclusion

In this post, we provided a brief overview of Amazon Bedrock and Amazon Location, and how you can use them together to analyze and visualize geospatial data. We also walked through Plan-and-Solve and ReAct and how we used them in our agent.

Our example only scratches the surface. Try downloading our sample code and adding your own agents and tools for your geospatial tasks.


About the authors

Jeff Demuth is a solutions architect who joined Amazon Web Services (AWS) in 2016. He focuses on the geospatial community and is passionate about geographic information systems (GIS) and technology. Outside of work, Jeff enjoys traveling, building Internet of Things (IoT) applications, and tinkering with the latest gadgets.

Swagata Prateek is a Senior Software Engineer working in Amazon Location Service at Amazon Web Services (AWS) where he focuses on Generative AI and geospatial.

Read More

How Amazon Music uses SageMaker with NVIDIA to optimize ML training and inference performance and cost

How Amazon Music uses SageMaker with NVIDIA to optimize ML training and inference performance and cost

In the dynamic world of streaming on Amazon Music, every search for a song, podcast, or playlist holds a story, a mood, or a flood of emotions waiting to be unveiled. These searches serve as a gateway to new discoveries, cherished experiences, and lasting memories. The search bar is not just about finding a song; it’s about the millions of active users starting their personal journey into the rich and diverse world that Amazon Music has to offer.

Delivering a superior customer experience to instantly find the music that users search for requires a platform that is both smart and responsive. Amazon Music uses the power of AI to accomplish this. However, optimizing the customer experience while managing cost of training and inference of AI models that power the search bar’s capabilities, like real-time spellcheck and vector search, is difficult during peak traffic times.

Amazon SageMaker provides an end-to-end set of services that allow Amazon Music to build, train, and deploy on the AWS Cloud with minimal effort. By taking care of the undifferentiated heavy lifting, SageMaker allows you to focus on working on your machine learning (ML) models, and not worry about things such as infrastructure. As part of the shared responsibility model, SageMaker makes sure that the services they provide are reliable, performant, and scalable, while you make sure the application of the ML models makes the best use of the capabilities that SageMaker provides.

In this post, we walk through the journey Amazon Music took to optimize performance and cost using SageMaker and NVIDIA Triton Inference Server and TensorRT. We dive deep into showing how that seemingly simple, yet intricate, search bar works, ensuring an unbroken journey into the universe of Amazon Music with little-to-zero frustrating typo delays and relevant real-time search results.

Amazon SageMaker and NVIDIA: Delivering fast and accurate vector search and spellcheck capabilities

Amazon Music offers a vast library of over 100 million songs and millions of podcast episodes. However, finding the right song or podcast can be challenging, especially if you don’t know the exact title, artist, or album name, or the searched query is very broad, such as “news podcasts.”

Amazon Music has taken a two-pronged approach to improve the search and retrieval process. The first step is to introduce vector search (also known as embedding-based retrieval), an ML technique that can help users find the most relevant content they’re looking for by using semantics of the content. The second step involves introducing a Transformer-based Spell Correction model in the search stack. This can be especially helpful when searching for music, because users may not always know the exact spelling of a song title or artist name. Spell correction can help users find the music they’re looking for even if they make a spelling mistake in their search query.

Introducing Transformer models in a search and retrieval pipeline (in query embedding generation needed for vector search and the generative Seq2Seq Transformer model in Spell Correction) may lead to significant increase in overall latency, affecting customer experience negatively. Therefore, it became a top priority for us to optimize the real-time inference latency for vector search and spell correction models.

Amazon Music and NVIDIA have come together to bring the best possible customer experience to the search bar, using SageMaker to implement both fast and accurate spellcheck capabilities and real-time semantic search suggestions using vector search-based techniques. The solution includes using SageMaker hosting powered by G5 instances that uses NVIDIA A10G Tensor Core GPUs, SageMaker-supported NVIDIA Triton Inference Server Container, and the NVIDIA TensorRT model format. By reducing the inference latency of the spellcheck model to 25 milliseconds at peak traffic, and reducing search query embedding generation latency by 63% on average and cost by 73% compared to CPU based inference, Amazon Music has elevated the search bar’s performance.

Additionally, when training the AI model to deliver accurate results, Amazon Music achieved a whopping 12 fold acceleration in training time for their BART sequence-to-sequence spell corrector transformer model, saving them both time and money, by optimizing their GPU utilization.

Amazon Music partnered with NVIDIA to prioritize the customer search experience and craft a search bar with well-optimized spellcheck and vector search functionalities. In the following sections, we share more about how these optimizations were orchestrated.

Optimizing training with NVIDIA Tensor Core GPUs

Gaining access to an NVIDIA Tensor Core GPU for large language model training is not enough to capture its true potential. There are key optimization steps that must happen during training in order to fully maximize the GPU’s utilization. However, an underutilized GPU will undoubtedly lead to inefficient use of resources, prolonged training durations, and increased operational costs.

During the initial phases of training the spell corrector BART (bart-base) transformer model on a SageMaker ml.p3.24xlarge instance (8 NVIDIA V100 Tensor Core GPUs), Amazon Music’s GPU utilization was around 35%. To maximize the benefits of NVIDIA GPU-accelerated training, AWS and NVIDIA solution architects supported Amazon Music in identifying areas for optimizations, particularly around the batch size and precision parameters. These two crucial parameters influence the efficiency, speed, and accuracy of training deep learning models.

The resulting optimizations yielded a new and improved V100 GPU utilization, steady at around 89%, drastically reducing Amazon Music’s training time from 3 days to 5–6 hours. By switching the batch size from 32 to 256 and using optimization techniques like running automatic mixed precision training instead of only using FP32 precision, Amazon Music was able to save both time and money.

The following chart illustrates the 54% percentage point increase in GPU utilization after optimizations.

The following figure illustrates the acceleration in training time.

This increase in batch size enabled the NVIDIA GPU to process significantly more data concurrently across multiple Tensor Cores, resulting in accelerated training time. However, it’s important to maintain a delicate balance with memory, because larger batch sizes demand more memory. Both increasing batch size and employing mixed precision can be critical in unlocking the power of NVIDIA Tensor Core GPUs.

After the model was trained to convergence, it was time to optimize for inference deployment on Amazon Music’s search bar.

Spell Correction: BART model inferencing

With the help of SageMaker G5 instances and NVIDIA Triton Inference Server (an open source inference serving software), as well as NVIDIA TensorRT, an SDK for high-performance deep learning inference that includes an inference optimizer and runtime, Amazon Music limits their spellcheck BART (bart-base) model server inference latency to just 25 milliseconds at peak traffic. This includes overheads like load balancing, preprocessing, model inferencing, and postprocessing times.

NVIDIA Triton Inference Server provides two different kind backends: one for hosting models on GPU, and a Python backend where you can bring your own custom code to be used in preprocessing and postprocessing steps. The following figure illustrates the model ensemble scheme.

Amazon Music built its BART inference pipeline by running both preprocessing (text tokenization) and postprocessing (tokens to text) steps on CPUs, whereas the model execution step runs on NVIDIA A10G Tensor Core GPUs. A Python backend sits in the middle of the preprocessing and postprocessing steps, and is responsible for communicating with the TensorRT-converted BART models as well as the encoder/decoder networks. TensorRT boosts inference performance with precision calibration, layer and tensor fusion, kernel auto-tuning, dynamic tensor memory, multi-stream execution, and time fusion.

The following figure illustrates the high-level design of the key modules that make up the spell corrector BART model inferencing pipeline.

Vector search: Query embedding generation sentence BERT model inferencing

The following chart illustrates the 60% improvement in latency (serving p90 800–900 TPS) when using the NVIDIA AI Inference Platform compared to a CPU-based baseline.

The following chart shows a 70% improvement in cost when using the NVIDIA AI Inference Platform compared to a CPU-based baseline.

The following figure illustrates an SDK for high-performance deep learning inference. It includes a deep learning inference optimizer and runtime that delivers low latency and high throughput for inference applications.

To achieve these results, Amazon Music experimented with several different Triton deployment parameters using Triton Model Analyzer, a tool that helps find the best NVIDIA Triton model configuration to deploy efficient inference. To optimize model inference, Triton offers features like dynamic batching and concurrent model execution, and has framework support for other flexibility capabilities. The dynamic batching gathers inference requests, seamlessly grouping them together into cohorts in order to maximize throughput, all while ensuring real-time responses for Amazon Music users. The concurrent model execution capability further enhances inference performance by hosting multiple copies of the model on the same GPU. Finally, by utilizing Triton Model Analyzer, Amazon Music was able to carefully fine-tune the dynamic batching and model concurrency inference hosting parameters to find optimal settings that maximize inference performance using simulated traffic.

Conclusion

Optimizing configurations with Triton Inference Server and TensorRT on SageMaker allowed Amazon Music to achieve outstanding results for both training and inference pipelines. The SageMaker platform is the end-to-end open platform for production AI, providing quick time to value and the versatility to support all major AI use cases across both hardware and software. By optimizing V100 GPU utilization for training and switching from CPUs to G5 instances using NVIDIA A10G Tensor Core GPUs, as well as by using optimized NVIDIA software like Triton Inference Server and TensorRT, companies like Amazon Music can save time and money while boosting performance in both training and inference, directly translating to a better customer experience and lower operating costs.

SageMaker handles the undifferentiated heavy lifting for ML training and hosting, allowing Amazon Music to deliver reliable, scalable ML operations across both hardware and software.

We encourage you to check that your workloads are optimized using SageMaker by always evaluating your hardware and software choices to see if there are ways you can achieve better performance with decreased costs.

To learn more about NVIDIA AI in AWS, refer to the following:


About the authors

Siddharth Sharma is a Machine Learning Tech Lead at Science & Modeling team at Amazon Music. He specializes in Search, Retrieval, Ranking and NLP related modeling problems. Siddharth has a rich back-ground working on large scale machine learning problems that are latency sensitive e.g. Ads Targeting, Multi Modal Retrieval, Search Query Understanding etc. Prior to working at Amazon Music, Siddharth was working at companies like Meta, Walmart Labs, Rakuten on E-Commerce centric ML Problems. Siddharth spent early part of his career working with bay area ad-tech startups.

Tarun Sharma is a Software Development Manager leading Amazon Music Search Relevance. His team of scientists and ML engineers is responsible for providing contextually relevant and personalized search results to Amazon Music customers.

James Park is a Solutions Architect at Amazon Web Services. He works with Amazon.com to design, build, and deploy technology solutions on AWS, and has a particular interest in AI and machine learning. In h is spare time he enjoys seeking out new cultures, new experiences,  and staying up to date with the latest technology trends.You can find him on LinkedIn.

Kshitiz Gupta is a Solutions Architect at NVIDIA. He enjoys educating cloud customers about the GPU AI technologies NVIDIA has to offer and assisting them with accelerating their machine learning and deep learning applications. Outside of work, he enjoys running, hiking and wildlife watching.

Jiahong Liu is a Solution Architect on the Cloud Service Provider team at NVIDIA. He assists clients in adopting machine learning and AI solutions that leverage NVIDIA accelerated computing to address their training and inference challenges. In his leisure time, he enjoys origami, DIY projects, and playing basketball.

Tugrul Konuk is a Senior Solution Architect at NVIDIA, specializing at large-scale training, multimodal deep learning, and high-performance scientific computing. Prior to NVIDIA, he worked at the energy industry, focusing on developing algorithms for computational imaging. As part of his PhD, he worked on physics-based deep learning for numerical simulations at scale. In his leisure time, he enjoys reading, playing the guitar and the piano.

Rohil Bhargava is a Product Marketing Manager at NVIDIA, focused on deploying NVIDIA application frameworks and SDKs on specific CSP platforms.

Eliuth Triana Isaza is a Developer Relations Manager at NVIDIA empowering Amazon’s AI MLOps, DevOps, Scientists and AWS technical experts to master the NVIDIA computing stack for accelerating and optimizing Generative AI Foundation models spanning from data curation, GPU training, model inference and production deployment on AWS GPU instances. In addition, Eliuth is a passionate mountain biker, skier, tennis and poker player.

Read More

Machine Learning with MATLAB and Amazon SageMaker

Machine Learning with MATLAB and Amazon SageMaker

This post is written in collaboration with Brad Duncan, Rachel Johnson and Richard Alcock from MathWorks.

MATLAB  is a popular programming tool for a wide range of applications, such as data processing, parallel computing, automation, simulation, machine learning, and artificial intelligence. It’s heavily used in many industries such as automotive, aerospace, communication, and manufacturing. In recent years, MathWorks has brought many product offerings into the cloud, especially on Amazon Web Services (AWS). For more details about MathWorks cloud products, see MATLAB and Simulink in the Cloud or email Mathworks.

In this post, we bring MATLAB’s machine learning capabilities into Amazon SageMaker, which has several significant benefits:

  • Compute resources: Using the high-performance computing environment offered by SageMaker can speed up machine learning training.
  • Collaboration: MATLAB and SageMaker together provide a robust platform that t teams can use to collaborate effectively on building, testing, and deploying machine learning models.
  • Deployment and accessibility: Models can be deployed as SageMaker real-time endpoints, making them readily accessible for other applications to process live streaming data.

We show you how to train a MATLAB machine learning model as a SageMaker training job and then deploy the model as a SageMaker real-time endpoint so it can process live, streaming data.

To do this, we’ll use a predictive maintenance example where we classify faults in an operational pump that’s streaming live sensor data. We have access to a large repository of labeled data generated from a Simulink simulation that has three possible fault types in various possible combinations (for example, one healthy and seven faulty states). Because we have a model of the system and faults are rare in operation, we can take advantage of simulated data to train our algorithm. The model can be tuned to match operational data from our real pump using parameter estimation techniques in MATLAB and Simulink.

Our objective is to demonstrate the combined power of MATLAB and Amazon SageMaker using this fault classification example.

We start by training a classifier model on our desktop with MATLAB. First, we extract features from a subset of the full dataset using the Diagnostic Feature Designer app, and then run the model training locally with a MATLAB decision tree model. Once we’re satisfied with the parameter settings, we can generate a MATLAB function and send the job along with the dataset to SageMaker. This allows us to scale up the training process to accommodate much larger datasets. After training our model, we deploy it as a live endpoint which can be integrated into a downstream app or dashboard, such as a MATLAB Web App.

This example will summarize each step, providing a practical understanding of how to leverage MATLAB and Amazon SageMaker for machine learning tasks. The full code and description for the example is available in this repository.

Prerequisites

  1. Working environment of MATLAB 2023a or later with MATLAB Compiler and the Statistics and Machine Learning Toolbox on Linux. Here is a quick guide on how to run MATLAB on AWS.
  2. Docker set up in an Amazon Elastic Compute Cloud (Amazon EC2) instance where MATLAB is running. Either Ubuntu or Linux.
  3. Installation of AWS Command-Line Interface (AWS CLI), AWS Configure, and Python3.
    1. AWS CLI, should be already installed if you followed the installation guide from step 1.
    2. Set up AWS Configure to interact with AWS resources.
    3. Verify your python3 installation by running python -V or python --version command on your terminal. Install Python if necessary.
  4. Copy this repo to a folder in your Linux machine by running:
    git clone https://github.com/mathworks/Machine-Learning-with-MATLAB-and-Amazon-Sagemaker-Demo.git

  5. Check the permission on the repo folder. If it does not have write permission, run the following shell command:
    sudo chmod -R 777

  6. Build the MATLAB training container and push it to the Amazon Elastic Container Registry (Amazon ECR).
    • Navigate to folder docker
    • Create an Amazon ECR repo using the AWS CLI (replace REGION with your preferred AWS region)
      aws ecr create-repository  
      --repository-name sagemaker-matlab-training  
      --image-scanning-configuration scanOnPush=true  
      --region

    • Run the following docker command:
      docker build -t sagemaker-matlab-training-r2023a . 
       
      docker tag sagemaker-matlab-training-r2023a ACCOUNT.dkr.ecr.REGION.amazonaws.com/sagemaker-matlab-training-r2023a:latest 
       
      aws ecr get-login-password --region REGION | docker login --username AWS --password-stdin ACCOUNT.dkr.ecr.us-east-1.amazonaws.com 
       
      docker push ACCOUNT.dkr.ecr. REGION.amazonaws.com/sagemaker-matlab-training-r2023a:latest 

  7. Open MATLAB and open the live script called PumpFaultClassificationMATLABSageMaker.mlx in folder examples/PumpFaultClassification. Make this folder your current working folder in MATLAB.

Part 1: Data preparation & feature extraction 

The first step in any machine learning project is to prepare your data. MATLAB provides a wide range of tools for importing, cleaning, and extracting features from your data.:

load SensorData.mat

The SensorData.mat dataset contains 240 records. Each record has two timetables: flow and pressure. The target column is faultcode, which is a binary representation of three possible fault combinations in the pump. For those time series tables, each table has 1,201 rows which mimic 1.2 seconds of pump flow and pressure measurement with 0.001 seconds increment.

Next, the Diagnostic Feature Designer app allows you to extract, visualize, and rank a variety of features from the data. Here, you use Auto Features, which quickly extracts a broad set of time and frequency domain features from the dataset and ranks the top candidates for model training. You can then export a MATLAB function that will recompute the top 15 ranked features from new input data. Let’s call this function extractFeaturesTraining. This function can be configured to take in data all in one batch or as streaming data.

This function produces a table of features with associated fault codes, as shown in the following figure:

Part 2: Organize data for SageMaker 

Next, you need to organize the data in a way that SageMaker can use for machine learning training. Typically, this involves splitting the data into training and validation sets and splitting the predictor data from the target response.

In this stage, other more complex data cleaning and filtering operations might be required. In this example, the data is already clean. Potentially, if the data processing is very complex and time consuming, SageMaker processing jobs can be used to run these jobs apart from SageMaker training so that they can be separated into two steps.

trainPredictors = trainingData(:,2:end);

trainResponse = trainingData(:,1);

Part 3: Train and test a machine learning model in MATLAB 

Before moving to SageMaker, it’s a good idea to build and test the machine learning model locally in MATLAB. This allows you to quickly iterate and debug the model. You can set up and train a simple decision tree classifier locally.

classifierModel = fitctree(...
 trainPredictors,...
 trainResponse,...
 OptimizeHyperparameters='auto');

The training job here should take less than a minute to finish and generates some graphs to indicate the training progress. After the training is finished, a MATLAB machine learning model is produced. The Classification Learner app can be used to try many types of classification models and tune them for best performance, then produce the needed code to replace the model training code above.

After checking the accuracy metrics for the locally-trained model, we can move the training into Amazon SageMaker.

Part 4: Train the model in Amazon SageMaker 

After you’re satisfied with the model, you can train it at scale using SageMaker. To begin calling SageMaker SDKs, you need to initiate a SageMaker session.

session = sagemaker.Session();

Specify a SageMaker execution IAM role that training jobs and endpoint hosting will use.

role = "arn:aws:iam::ACCOUNT:role/service-role/AmazonSageMaker-ExecutionRole-XXXXXXXXXXXXXXX";

From MATLAB, save the training data as a .csv file to an Amazon Simple Storage Service (Amazon S3) bucket.

writetable(trainingData,'pump_training_data.csv');

trainingDataLocation = "s3:// "+session.DefaultBucket+ +"/cooling_system/input/pump_training";

copyfile("pump_training_data.csv", trainingDataLocation);

Create a SageMaker Estimator

Next, you need to create a SageMaker estimator and pass all the necessary parameters to it, such as a training docker image, training function, environment variables, training instance size, and so on. The training image URI should be the Amazon ECR URI you created in the prerequisite step with the format ACCOUNT.dkr.ecr.us-east-1.amazonaws.com/sagemaker-matlab-training-r2023a:latest. The training function should be provided at the bottom of the MATLAB live script.

SageMaker Estimator Console

trainingImage = "ACCOUNT.dkr.ecr.us-east-1.amazonaws.com/sagemaker-matlab-training-r2023a:latest"; 
 
est = sagemaker.MATLABEstimator(... 
    role, ... 
    Image=trainingImage, ... 
    Session=session, ... 
    BaseJobName="PumpDecisionTreeMatlab", ... 
    Environment = loadenv(fullfile(rootFolder, "training.env")), ... 
    TrainingFunction = @trainingFunction, ... 
    HyperParameters = struct(), ... % named args to train_decision_tree 
    InstanceType="ml.m5.large", ... 
    MaxRunTime=minutes(10), ...     
    MaxWaitTime=minutes(20), ... 
    UseSpotInstances=true); 

Submit SageMaker training job

Calling the fit method from the estimator submits the training job into SageMaker.

est.fit(training=struct(Location=trainingDataLocation, ContentType="text/csv"))

You can also check the training job status from the SageMaker console:

SageMaker Training Job Console

After the training jobs finishes, selecting the job link takes you to the job description page where you can see the MATLAB model saved in the dedicated S3 bucket:

SageMaker Endpoint Output

Part 5: Deploy the model as a real-time SageMaker endpoint 

After training, you can deploy the model as a real-time SageMaker endpoint, which you can use to make predictions in real time. To do this, call the deploy method from the estimator. This is where you can set up the desired instance size for hosting depending on the workload.

predictor = est.deploy(role, "ClassificationTreeInferenceHandler", uint8(1), "ml.m5.large")

Behind the scenes, this step builds an inference docker image and pushes it to the Amazon ECR repository, nothing is required from the user to build the inference container. The image contains all the necessary information to serve the inference request, such as model location, MATLAB authentication information, and algorithms. After that, Amazon SageMaker creates a SageMaker endpoint configuration and finally deploys the real-time endpoint. The endpoint can be monitored in the SageMaker console and can be terminated anytime if it’s no longer used.

SageMaker Endpoint Monitor Console

Part 6: Test the endpoint 

Now that the endpoint is up and running, you can test the endpoint by giving it a few records to predict. Use the following code to select 10 records from the training data and send them to the endpoint for prediction. The prediction result is sent back from the endpoint and shown in the following image.

input = trainPredictors(10:19,:) 
prediction = predictor.predict(input)

Prediction Result

Part 7: Dashboard integration 

The SageMaker endpoint can be called by many native AWS services. It can also be used as a standard REST API if deployed together with an AWS Lambda function and API gateway, which can be integrated with any web applications. For this particular use case, you can use streaming ingestion with Amazon SageMaker Feature Store and Amazon Managed Streaming for Apache Kafka, MSK, to make machine learning-backed decisions in near real-time. Another possible integration is using a combination of Amazon Kinesis, SageMaker, and Apache Flink to build a managed, reliable, scalable, and highly available application that’s capable of real-time inferencing on a data stream.

After algorithms are deployed to a SageMaker endpoint, you might want to visualize them using a dashboard that displays streaming predictions in real time. In the custom MATLAB web app that follows, you can see pressure and flow data by pump, and live fault predictions from the deployed model.

In this dashboard includes a remaining useful life (RUL) model to predict the time to failure for each pump in question. To learn how to train RUL algorithms, see Predictive Maintenance Toolbox.

Pump Health Status Dashboard

Clean Up

After you run this solution, make sure you clean up any unneeded AWS resources to avoid unexpected costs. You can clean up these resources using the SageMaker Python SDK or the AWS Management Console for the specific services used here (SageMaker, Amazon ECR, and Amazon S3). By deleting these resources, you prevent further charges for resources you’re no longer using.

Conclusion

We’ve demonstrated how you can bring MATLAB to SageMaker for a pump predictive maintenance use case with the entire machine learning lifecycle. SageMaker provides a fully managed environment for running machine learning workloads and deploying models with a great selection of compute instances serving various needs.

Disclaimer: The code used in this post is owned and maintained by MathWorks. Refer to the license terms in the GitHub repo. For any issues with the code or feature requests, please open a GitHub issue in the repository 

References


About the Authors

Brad Duncan is the product manager for machine learning capabilities in the Statistics and Machine Learning Toolbox at MathWorks. He works with customers to apply AI in new areas of engineering such as incorporating virtual sensors in engineered systems, building explainable machine learning models, and standardizing AI workflows using MATLAB and Simulink. Before coming to MathWorks he led teams for 3D simulation and optimization of vehicle aerodynamics, user experience for 3D simulation, and product management for simulation software. Brad is also a guest lecturer at Tufts University in the area of vehicle aerodynamics.

Richard Alcock is the senior development manager for Cloud Platform Integrations at MathWorks. In this role, he is instrumental in seamlessly integrating MathWorks products into cloud and container platforms. He creates solutions that enable engineers and scientists to harness the full potential of MATLAB and Simulink in cloud-based environments. He was previously a software engineering at MathWorks, developing solutions to support parallel and distributed computing workflows.

Rachel Johnson is the product manager for predictive maintenance at MathWorks, and is responsible for overall product strategy and marketing. She was previously an application engineer directly supporting the aerospace industry on predictive maintenance projects. Prior to MathWorks, Rachel was an aerodynamics and propulsion simulation engineer for the US Navy. She also spent several years teaching math, physics, and engineering.

Shun Mao is a Senior AI/ML Partner Solutions Architect in the Emerging Technologies team at Amazon Web Services. He is passionate about working with enterprise customers and partners to design, deploy and scale AI/ML applications to derive their business values. Outside of work, he enjoys fishing, traveling and playing Ping-Pong.

Ramesh Jatiya is a Solutions Architect in the Independent Software Vendor (ISV) team at Amazon Web Services. He is passionate about working with ISV customers to design, deploy and scale their applications in cloud to derive their business values. He is also pursuing an MBA in Machine Learning and Business Analytics from Babson College, Boston. Outside of work, he enjoys running, playing tennis and cooking.

Read More

Text embedding and sentence similarity retrieval at scale with Amazon SageMaker JumpStart

Text embedding and sentence similarity retrieval at scale with Amazon SageMaker JumpStart

Text vectors or embeddings are numerical vector representations of text that are generated by large language models (LLMs). After LLMs are fully pre-trained on a large dataset or fine-tuned from different tasks, including text completion, question answering, and translations, text embeddings capture semantic information of the input text. Different downstream applications are made possible by text embeddings, including similarity searching, information retrieval, recommendations and personalization, multilingual translations, and more.

Before intelligent applications could be built from embeddings, enterprises and organizations had to embed their existing documents, which can be expensive and technically complicated. Amazon SageMaker JumpStart is a machine learning (ML) hub that helps accelerate this journey. With SageMaker JumpStart, you can access pre-trained, cutting-edge text embedding models from various model providers, including Hugging Face, AI 21 Labs, Cohere, and Meta AI. You can seamlessly deploy these models into production with the SageMaker JumpStart user interface or SDK. In addition, none of your data is used to train the underlying models. Because all data is encrypted and doesn’t leave its own VPC, you can trust your data remains private and confidential.

In this post, we demonstrate how to use the SageMaker Python SDK for text embedding and sentence similarity. Sentence similarity involves assessing the likeness between two pieces of text after they are converted into embeddings by the LLM, which is a foundation step for applications like Retrieval Augmented Generation (RAG). We demonstrate how to do the following:

  • Run inference on a text embedding model deployed from SageMaker JumpStart
  • Find the nearest neighbors for an input sentence with your own dataset
  • Run the batch transform on large documents to minimize costs

All the code is available on GitHub.

Deploy a text embedding model via SageMaker JumpStart

To host a model on Amazon SageMaker, the first step is to set up and authenticate the use of AWS services. In Amazon SageMaker Studio, we use the execution role associated with the notebook instance. See the following code:

import sagemaker, boto3, json
from sagemaker.session import Session
sagemaker_session = Session()
aws_role = sagemaker_session.get_caller_identity_arn()
aws_region = boto3.Session().region_name
sess = sagemaker.Session()

On Hugging Face, the Massive Text Embedding Benchmark (MTEB) is provided as a leaderboard for diverse text embedding tasks. It currently provides 129 benchmarking datasets across 8 different tasks on 113 languages. The top text embedding models from the MTEB leaderboard are made available from SageMaker JumpStart, including bge, gte, e5, and more. In this post, we use huggingface-sentencesimilarity-bge-large-en as an example. We can use the SageMaker SDK to deploy this state-of-the-art text embedding model:

from sagemaker.jumpstart.model import JumpStartModel

model_id = "huggingface-sentencesimilarity-bge-large-en"
text_embedding_model = JumpStartModel(model_id=model_id)
predictor = text_embedding_model.deploy()

Text embedding model query

Let’s look at the text embedding model query in more detail.

Text to embedding

If you have already deployed a SageMaker endpoint before, the predictor can be restored as follows:

from sagemaker.predictor import Predictor
from sagemaker.deserializers import JSONDeserializer
from sagemaker.serializers import IdentitySerializer

predictor = Predictor(
    endpoint_name=<YOUR_ENDPOINT_NAME>,
    deserializer=JSONDeserializer(),
    serializer=IdentitySerializer(),
)
predictor.content_type = "application/x-text"

After the model is successfully deployed, you can query the endpoint with a batch of input texts within a JSON payload:

sentences = [
    # Pets
    "Your dog is so cute.",
    "How cute your dog is!",
    "You have such a cute dog!",
    # Cities
    "Sydney is the place where I work.",
    "I work in Sydney.",
    # Color
    "What colour do you like the most?",
    "What is your favourite colour?",
]

predictor.predict(json.dumps(sentences).encode('utf-8'))

The correlation of the embeddings of these sentences is plotted in the following figure.

correlation_heat_map

As shown in the preceding figure, same subjects are highly correlated within themselves, including Pets, Cities, and Color; different subjects are much dissimilar. This indicates the embedding generated by the LLMs (in this case, bge) can represent the semantic information accurately.

For this post, we used the preceding sample and compared the latency across different sentence embedding models currently available from SageMaker JumpStart. Latency is the amount of time from the moment that a user sends a request until the time that the application indicates that the request has been completed. The numbers in the following table represent the average latency for a total of 100 requests using the same batch of input texts on the ml.g5.2xlarge and ml.c6i.xlarge instances.

Model g5.2xlarge Average Latency (ms) c6i.xlarge Average Latency(ms) Language Support
all-MiniLM-L6-v2 19.5 27.9 English
BGE Base En 21.2 114 English
BGE Small En 28.3 45.6 English
BGE Large En 34.7 337 English
Multilingual E5 Base 22.1 118 Multilingual
Multilingual E5 Large 39.8 360 Multilingual
E5 Base 25.6 117 English
E5 Base V2 25.2 123 English
E5 Large 32.2 339 English
E5 Large V2 32.5 331 English
GTE Base 22.2 112 English
GTE Small 19.7 46 English
GTE Large 39.7 347 English

Get the nearest neighbors

The deployed model from SageMaker JumpStart can also facilitate the process of identifying the nearest neighbors to queries within the corpus. When provided with queries and a corpus, the model will produce the corpus_id, which denotes the position of the relevant corpus entry in the input corpus list, and a score indicating the degree of proximity to the query. It uses the following parameters:

  • corpus – Provides the list of inputs from which to find the nearest neighbor
  • queries – Provides the list of inputs for which to find the nearest neighbor from the corpus
  • top_k – The number of nearest neighbors to find from the corpus
  • mode – Set as nn_corpus for getting the nearest neighbors to input queries within the corpus

See the following code:

corpus = [
    "Amazon SageMaker is a fully managed service to prepare data and build, train, and deploy machine learning (ML) models for any use case with fully managed infrastructure, tools, and workflows.",
    "Amazon SageMaker stores code in ML storage volumes, secured by security groups and optionally encrypted at rest.",
    "Amazon SageMaker provides a full end-to-end workflow, but you can continue to use your existing tools with SageMaker. You can easily transfer the results of each stage in and out of SageMaker as your business requirements dictate."
]
queries = [
    "What is Amazon SageMaker?",
    "How does Amazon SageMaker secure my code?",
    "What if I have my own notebook, training, or hosting environment in my own business environment?"
]

payload_nearest_neighbor = {"corpus": corpus, "queries": queries, "top_k": 3, "mode": "nn_corpus"}
query_response = predictor.predict(payload_nearest_neighbor)

We get the following output:

[
    [
        {'corpus_id': 0, 'score': 0.8992230892181396},
        {'corpus_id': 2, 'score': 0.8664969205856323},
        {'corpus_id': 1, 'score': 0.8456423282623291}
    ],
    [
        {'corpus_id': 1, 'score': 0.8919335603713989},
        {'corpus_id': 0, 'score': 0.840064525604248},
        {'corpus_id': 2, 'score': 0.8145401477813721}
    ],
    [
        {'corpus_id': 2, 'score': 0.7712811231613159},
        {'corpus_id': 1, 'score': 0.7564010620117188},
        {'corpus_id': 0, 'score': 0.7525666356086731}
    ]
]

This result means the first query is most similar to the first corpus, the second is closer to the second corpus, and so on. This is a correct match in this example.

We also took the preceding sample and compared the latency across different sentence embedding models currently available from SageMaker JumpStart. The numbers in the following table represent the average latency for a total of 100 requests using the same payload on the ml.g5.2xlarge and ml.c6i.xlarge instances.

Model g5.2xlarge Average Latency (ms) c6i.xlarge Average Latency(ms) Language Support
all-MiniLM-L6-v2 21.7 69.1 English
BGE Base En 29.1 372 English
BGE Small En 29.2 124 English
BGE Large En 47.2 1240 English
Multilingual E5 Base 30 389 Multilingual
Multilingual E5 Large 47.1 1380 Multilingual
E5 Base 30.4 373 English
E5 Base V2 31 409 English
E5 Large 45.9 1230 English
E5 Large V2 49.6 1220 English
GTE Base 30.3 375 English
GTE Small 28.5 129 English
GTE Large 46.6 1320 English

Get the nearest neighbors on a large dataset

When making requests to the SageMaker invoke endpoint, payloads are restricted to approximately 5 MB, and the request timeout is set to 1 minute. If corpus size exceeds these limits, you could use a SageMaker training job, which generates embeddings for your large dataset and persists them alongside the model inside the SageMaker endpoint. Therefore, they don’t have to be passed as part of the invocation payload. The process of finding the nearest neighbors is carried out using SentenceTransformer and its utility function. The nearest neighbor is based on the cosine similarity between the input sentence embedding and the precomputed sentence embeddings during the training job.

In the following example, we fetch and prepare the Amazon_SageMaker_FAQs dataset to use it in finding the nearest neighbor to an input question:

!aws s3 cp s3://jumpstart-cache-prod-us-west-2/training-datasets/Amazon_SageMaker_FAQs/Amazon_SageMaker_FAQs.csv Amazon_SageMaker_FAQs.csv

import pandas as pd

data = pd.read_csv("Amazon_SageMaker_FAQs.csv", names=["Questions", "Answers"])
data["id"] = data.index
data_req = data[["id", "Answers"]]
data_req.to_csv("data.csv", index=False, header=False)

output_bucket = sess.default_bucket()
output_prefix = "jumpstart-example-ss-training"

s3_output_location = f"s3://{output_bucket}/{output_prefix}/output"
training_dataset_s3_path = f"s3://{output_bucket}/{output_prefix}/data/data.csv"

!aws s3 cp data.csv {training_dataset_s3_path}

For algorithm-specific training hyperparameters, the SageMaker SDK can be fetched or overwritten:

from sagemaker import hyperparameters

hyperparameters = hyperparameters.retrieve_default(model_id=model_id, model_version = "*")
hyperparameters["batch_size"] = "64"
print(hyperparameters)
>>> {'max_seq_length': 'None', 'batch_size': '64', 'store_text_with_embedding': 'True'}

The SageMaker training consists of two steps: create the estimator object and launch the training job. The output is a model prepackaged with embeddings of your large dataset used as training data, which can be deployed for inference to get the nearest neighbor for any input sentence. See the following code:

from sagemaker.jumpstart.estimator import JumpStartEstimator

estimator = JumpStartEstimator(
    model_id=model_id,
    hyperparameters=hyperparameters,
    output_path=s3_output_location
)

estimator.fit(
    {"training": f"s3://{output_bucket}/{output_prefix}/data"}
)
predictor = estimator.deploy()

The query syntax to convert text into embeddings is the same as before. The code to get the nearest neighbor, however, can be simplified as follows:

payload_nearest_neighbour = {
    "queries": ["Is R supported with Amazon SageMaker?"],
    "top_k": 1,
    "mode": "nn_train_data",
}

response = predictor.predict(payload_nearest_neighbour)
>>> [[{'id': '9', 'score': 0.9240573048591614}]]

data["Answers"].iloc[int(response[0][0]["id"])]
>>> "Yes, R is supported with Amazon SageMaker. You can use R within SageMaker notebook instances, which include a preinstalled R kernel and the reticulate library. Reticulate offers an R interface for the Amazon SageMaker Python SDK, enabling ML practitioners to build, train, tune, and deploy R models."

We can also query the endpoint with questions in the Amazon_SageMaker_FAQs dataset and compare how many of the correct corresponding answers are returned. In the following example, we measure the top-3 accuracy, given there could be similar question answer pairs. This means if the correct answer is returned as one of the top-3 returns, it’s treated as a correct query.

total_correct_answers = 0

for i in range(len(data)):
    question = data["Questions"].iloc[i]
    payload_nearest_neighbor = {
        "queries": [question],
        "top_k": 3,
        "mode": "nn_train_data",
    }
    response = predictor.predict(payload_nearest_neighbor)
    response_ids = [int(res["id"]) for res in response[0]]

    if i in response_ids:
        total_correct_answers += 1
    else:
        pred_answer = [data["Answers"].iloc[response_id] for response_id in response_ids]

print(total_correct_answers*100/len(data))
>>>
81.16883116883118

Run a batch transform to get embeddings on large datasets

For enterprises and organizations with a large volume of historical documents that exceed the memory of a single endpoint instance, you can use SageMaker batch transform to save cost. When you start a batch transform job, SageMaker launches the necessary compute resources to process the data. During the job, SageMaker automatically provisions and manage the compute resources. When the batch transform job is complete, those resources are automatically cleaned up, which minimizes costs. By dividing a large dataset into smaller chunks and using more instances, you can scale out the compute for faster inference with similar cost, without managing infrastructure. The maximum payload for batch transform is 100 MB and timeout is 1 hour.

The input format for our batch transform job is a JSONL file, with entries as a line of JSON, which consists of id and text_inputs. See the following code:

test_data_file_name = "test.jsonl"
test_data = []

for i in range(len(data)):
    answer = data.loc[i, "Answers"]
    payload = {"id": i, "text_inputs": answer}
    test_data.append(payload)

with open(test_data_file_name, "w") as outfile:
    for entry in test_data:
        outfile.write(f"{json.dumps(entry)}n")

s3 = boto3.client("s3")
s3.upload_file(test_data_file_name, output_bucket, f"{output_prefix}/batch_input/test.jsonl")

When the data is ready in Amazon Simple Storage Service (Amazon S3), you can create the batch transform object from the SageMaker JumpStart model, which triggers the transform job:

s3_input_data_path = f"s3://{output_bucket}/{output_prefix}/batch_input/"
s3_output_data_path = f"s3://{output_bucket}/{output_prefix}/batch_output/"

batch_transformer = text_embedding_model.transformer(
    instance_count=1,
    instance_type="ml.p3.2xlarge",
    output_path=s3_output_data_path,
    assemble_with="Line",
    accept="text/csv",
    max_payload=1,
)

batch_transformer.transform(
    s3_input_data_path,
    content_type="application/jsonlines",
    split_type="Line"
)

batch_transformer.wait()

After the batch transform job is complete, you can download the result from Amazon S3:

s3 = boto3.client("s3")
s3.download_file(
    output_bucket, output_prefix + "/batch_output/" + "test.jsonl.out", "predict.jsonl"
)

with open("predict.jsonl", "r") as json_file:
    json_list = list(json_file)

Conclusion

SageMaker JumpStart provides a straightforward way to use state-of-the-art large language foundation models for text embedding and semantic search. With the user interface or just a few lines of code, you can deploy a highly accurate text embedding model and find semantic matches across large datasets, at scale and cost-efficiently. SageMaker JumpStart removes the barriers to implement semantic search by providing instant access to cutting-edge models like the ones benchmarked on the MTEB leaderboard. Businesses and developers can build intelligent search and recommendation systems faster.

This post demonstrated how to find semantically similar questions and answers, which could be applied to RAG use cases, recommendations and personalization, multilingual translations, and more. With continued advances in language models and the simplicity of SageMaker JumpStart, more organizations can infuse generative AI capabilities into their products. As the next step, you can try text-embedding models from SageMaker JumpStart on your own dataset to test and benchmark the results for your RAG use cases.


About the Authors

Dr. Baichuan Sun, currently serving as a Sr. AI/ML Solution Architect at AWS, focuses on generative AI and applies his knowledge in data science and machine learning to provide practical, cloud-based business solutions. With experience in management consulting and AI solution architecture, he addresses a range of complex challenges, including robotics computer vision, time series forecasting, and predictive maintenance, among others. His work is grounded in a solid background of project management, software R&D, and academic pursuits. Outside of work, Dr. Sun enjoys the balance of traveling and spending time with family and friends, reflecting a commitment to both his professional growth and personal well-being.

Hemant Singh is an Applied Scientist with experience in Amazon SageMaker JumpStart. He got his masters from Courant Institute of Mathematical Sciences and B.Tech from IIT Delhi. He has experience in working on a diverse range of machine learning problems within the domain of natural language processing, computer vision, and time series analysis.

Dr. Ashish Khetan is a Senior Applied Scientist with Amazon SageMaker built-in algorithms and helps develop machine learning algorithms. He got his PhD from University of Illinois Urbana-Champaign. He is an active researcher in machine learning and statistical inference, and has published many papers in NeurIPS, ICML, ICLR, JMLR, ACL, and EMNLP conferences.

Read More

Amazon Textract’s new Layout feature introduces efficiencies in general purpose and generative AI document processing tasks

Amazon Textract’s new Layout feature introduces efficiencies in general purpose and generative AI document processing tasks

Amazon Textract is a machine learning (ML) service that automatically extracts text, handwriting, and data from any document or image. AnalyzeDocument Layout is a new feature that allows customers to automatically extract layout elements such as paragraphs, titles, subtitles, headers, footers, and more from documents. Layout extends Amazon Textract’s word and line detection by automatically grouping the text into these layout elements and sequencing them according to human reading patterns. (That is, reading order from left to right and top to bottom.).

Building document processing and understanding solutions for financial and research reports, medical transcriptions, contracts, media articles, and so on requires extraction of information present in titles, headers, paragraphs, and so on. For example, when cataloging financial reports in a document database, extracting and storing the title as a catalog index enables easy retrieval. Prior to the introduction of this feature, customers had to construct these elements using post-processing code and the words and lines response from Amazon Textract.

The complexity of implementing this code is amplified with documents with multiple columns and complex layouts. With this announcement, extraction of commonly occurring layout elements from documents becomes easier and allows customers to build efficient document processing solutions faster with less code.

In Sept 2023, Amazon Textract launched the Layout feature that automatically extracts layout elements such as paragraphs, titles, lists, headers, and footers and orders the text and elements as a human would read. We also released the updated version of the open source postprocessing toolkit, purpose-built for Amazon Textract, known as Amazon Textract Textractor.

In this post, we discuss how customers can take advantage of this feature for document processing workloads. We also discuss a qualitative study demonstrating how Layout improves generative artificial intelligence (AI) task accuracy for both abstractive and extractive tasks for document processing workloads involving large language models (LLMs).

Layout elements

Central to the Layout feature of Amazon Textract are the new Layout elements. The LAYOUT feature of AnalyzeDocument API can now detect up to ten different layout elements in a document’s page. These layout elements are represented as block type in the response JSON and contain the confidence, geometry (that is, bounding box and polygon information), and Relationships, which is a list of IDs corresponding to the LINE block type.

  • Title – The main title of the document. Returned as LAYOUT_TITLE block type.
  • Header – Text located in the top margin of the document. Returned as LAYOUT_HEADER block type.
  • Footer – Text located in the bottom margin of the document. Returned as LAYOUT_FOOTER block type.
  • Section Title – The titles below the main title that represent sections in the document. Returned as LAYOUT_SECTION_HEADER block type.
  • Page Number – The page number of the documents. Returned as LAYOUT_PAGE_NUMBER block type.
  • List – Any information grouped together in list form. Returned as LAYOUT_LIST block type.
  • Figure – Indicates the location of an image in a document. Returned as LAYOUT_FIGURE block type.
  • Table – Indicates the location of a table in the document. Returned as LAYOUT_TABLE block type.
  • Key Value – Indicates the location of form key-value pairs in a document. Returned as LAYOUT_KEY_VALUE block type.
  • Text – Text that is present typically as a part of paragraphs in documents. It is a catch all for text that is not present in other elements. Returned as LAYOUT_TEXT block type.

Amazon Textract Layout Elements

Each layout element may contain one or more LINE relationships, and these lines constitute the actual textual content of the layout element (for example, LAYOUT_TEXT is typically a paragraph of text containing multiple LINEs). It is important to note that layout elements appear in the correct reading order in the API response as the reading order in the document, which makes it easy to construct the layout text from the API’s JSON response.

Use cases of layout-aware extraction

Following are some of the common use cases for the new AnalyzeDocument LAYOUT feature.

  1. Extracting layout elements for search indexing and cataloging purposes. The contents of the LAYOUT_TITLE or LAYOUT_SECTION_HEADER, along with the reading order, can be used to appropriately tag or enrich metadata. This improves the context of a document in a document repository to improve search capabilities or organize documents.
  2. Summarize the entire document or parts of a document by extracting text in proper reading order and using the layout elements.
  3. Extracting specific parts of the document. For example, a document may contain a mix of images with text within it and other plaintext sections or paragraphs. You can now isolate the text sections using the LAYOUT_TEXT element.
  4. Better performance and accurate answers for in-context document Q&A and entity extractions using an LLM.

There are other possible document automation use cases where Layout can be useful. However, in this post we explain how to extract layout elements in order to help understand how to use the feature for traditional documentation automation solutions. We discuss the benefits of using Layout for a document Q&A use case with LLMs using a common method known as Retrieval Augmented Generation (RAG), and for entity extraction use-case. For the outcomes of both of these use-cases, we present comparative scores that helps differentiate the benefits of layout aware text as opposed to just plaintext.

To highlight the benefits, we ran tests to compare how plaintext extracted using raster scans with DetectDocumentText and layout-aware linearized text extracted using AnalyzeDocument with LAYOUT feature impacts the outcome of in-context Q&A outputs by an LLM. For this test, we used Anthropic’s Claude Instant model with Amazon Bedrock. However, for complex document layouts, the generation of text in proper reading order and subsequently chunking them appropriately may be challenging, depending on how complex the document layout is. In the following sections, we discuss how to extract layout elements, and linearize the text to build an LLM-based application. Specifically, we discuss the comparative evaluation of the responses generated by the LLM for document Q&A application using raster scan–based plaintext and layout-aware linearized text.

Extracting layout elements from a page

The Amazon Textract Textractor toolkit can process a document through the AnalyzeDocument API with LAYOUT feature and subsequently exposes the detected layout elements through the page’s PAGE_LAYOUT property and its own subproperty TITLES, HEADERS, FOOTERS, TABLES, KEY_VALUES, PAGE_NUMBERS, LISTS, and FIGURES. Each element has its own visualization function, allowing you to see exactly what was detected. To get started, you start by installing Textractor using

pip install amazon-textract-textractor

As demonstrated in the following code snippet, the document news_article.pdf is processed with the AnalyzeDocument API with LAYOUT feature. The response results in a variable document that contains each of the detected Layout blocks from the properties.

from textractor import Textractor
from textractor.data.constants import TextractFeatures

extractor = Textractor(profile_name="default")

input_document = "./news_article.pdf"

document = extractor.analyze_document(
                   file_source=input_document,
                   features=[TextractFeatures.LAYOUT],
                   save_image=True)

document.pages[0].visualize()
document.pages[0].page_layout.titles.visualize()
document.pages[0].page_layout.headers.visualize()

document.pages[0].page_layout.section_headers.visualize()
document.pages[0].page_layout.footers.visualize()
document.pages[0].page_layout.tables.visualize()
document.pages[0].page_layout.key_values.visualize()
document.pages[0].page_layout.page_numbers.visualize()
document.pages[0].page_layout.lists.visualize()
document.pages[0].page_layout.figures.visualize()

Layout visualization with Amazon Textract Textractor

See a more in-depth example in the official Textractor documentation.

Linearizing text from the layout response

To use the layout capabilities, Amazon Textract Textractor was extensively reworked for the 1.4 release to provide linearization with over 40 configuration options, allowing you to tailor the linearized text output to your downstream use case with little effort. The new linearizer supports all currently available AnalyzeDocument APIs, including forms and signatures, which lets you add selection items to the resulting text without making any code changes.

from textractor import Textractor
from textractor.data.constants import TextractFeatures
from textractor.data.text_linearization_config import TextLinearizationConfig

extractor = Textractor(profile_name="default")

config = TextLinearizationConfig(
                         hide_figure_layout=True,
                         title_prefix="# ",
                         section_header_prefix="## ")

document = extractor.analyze_document(
                                 file_source=input_document,
                                 features=[TextractFeatures.LAYOUT],
                                 save_image=True)

print(document.get_text(config=config))

See this example and more in the official Textractor documentation.

We have also added a layout pretty printer to the library that allows you to call a single function by passing in the layout API response in JSON format and get the linearized text (by page) in return.

python -m pip install -q amazon-textract-prettyprinter

You have the option to format the text in markdown format, exclude text from within figures in the document, and exclude page header, footer, and page number extractions from the linearized output. You can also store the linearized output in plaintext format in your local file system or in an Amazon S3 location by passing the save_txt_path parameter. The following code snippet demonstrates a sample usage –

from textractcaller.t_call import call_textract, Textract_Features
from textractprettyprinter.t_pretty_print import get_text_from_layout_json

textract_json = call_textract(input_document=input_document,
                      features=[Textract_Features.LAYOUT,
                      Textract_Features.TABLES])
layout = get_text_from_layout_json(textract_json=textract_json,
exclude_figure_text=True, # optional
exclude_page_header=True, # optional
exclude_page_footer=True, # optional
exclude_page_number=True, # optional
save_txt_path="s3://bucket/prefix") # optional

full_text = layout[1]
print(full_text)

Evaluating LLM performing metrics for abstractive and extractive tasks

Layout-aware text is found to improve the performance and quality of text generated by LLMs. In particular, we evaluate two types of LLM tasks—abstractive and extractive tasks.

Abstractive tasks refer to assignments that require the AI to generate new text that is not directly found in the source material. Some examples of abstractive task include summarization and question answering. For these tasks, we use the Recall-Oriented Understudy for Gisting Evaluation (ROUGE) metric to evaluate the performance of an LLM on question-answering tasks with respect to a set of ground truth data.

Extractive tasks refer to activities where the model identifies and extracts specific portions of the input text to construct a response. In these tasks, the model is focused on selecting relevant segments (such as sentences, phrases, or keywords) from the source material rather than generating new content. Some examples are named entity recognition (NER) and keyword extraction. For these tasks, we use Average Normalized Levenshtein Similarity (ANLS) on named entity recognition tasks based on the layout-linearized text extracted by Amazon Textract.

ROUGE score analysis on abstractive question-answering task

Our test is set up to perform in-context Q&A on a multicolumn document by extracting the text and then performing RAG to get answer responses from the LLM. We perform Q&A on a set of questions using the raster scan–based raw text and layout-aware linearized text. We then evaluate ROUGE metrics for each question by comparing the machine-generated response to the corresponding ground truth answer. In this case, the ground truth is the same set of questions answered by a human, which is considered as a control group.

In-context Q&A with RAG requires extracting text from the document, creating smaller chunks of the text, generating vector embeddings of the chunks, and subsequently storing them in a vector database. This is done so that the system can perform a relevance search with the question on the vector database to return chunks of text that are most relevant to the question being asked. These relevant chunks are then used to build the overall context and provided to the LLM so that it can accurately answer the question.

The following document, taken from the DocUNet: Document Image Unwarping via a Stacked U-Net dataset, is used for the test. This document is a multicolumn document with headers, titles, paragraphs, and images. We also defined a set of 20 questions answered by a human as a control group or ground truth. The same set of 20 questions was then used to generate responses from the LLM.

Sample document from DocUNet dataset

In the next step, we extract the text from this document using DetectDocumentText API and AnalyzeDocument API with LAYOUT feature. Since most LLMs have a limited token context window, we kept the chunk size small, about 250 characters with a chunk overlap of 50 characters, using LangChain’s RecursiveCharacterTextSplitter. This resulted in two separate sets of document chunks—one generated using the raw text and the other using the layout-aware linearized text. Both sets of chunks were stored in a vector database by generating vector embeddings using the Amazon Titan Embeddings G1 Text embedding model.

Chunking and embedding with Amazon Titan Embeddings G1 Text

The following code snippet generates the raw text from the document.

import textractcaller as tc
from textractcaller.t_call import call_textract
from textractprettyprinter.t_pretty_print import get_lines_string

plain_textract_json = call_textract(input_document = input_document)
plain_text = get_lines_string(textract_json = plain_textract_json)

print(plain_text)

The output (trimmed for brevity) looks like the following. The text reading order is incorrect due to the lack of layout awareness of the API, and the extracted text spans the text columns.

PHOTONICS FOR A BETTER WORLD
UNESCO ENDORSES
INTERNATIONAL DAY OF LIGHT
First celebration in 2018 will become an annual
reminder of photonics-enabled technologies
T he executive board of the United Nations Educational,
in areas such as science, culture, education, sustainable development,
Scientific, and Cultural Organization (UNESCO) has endorsed
medicine, communications, and energy.
a proposal to establish an annual International Day of Light
The final report of IYL 2015 was delivered to UNESCO in Paris
(IDL) as an extension of the highly successful International Year of
during a special meeting in October 2016. At this event, SPIE member
Light and Light-based Technologies (IYL 2015).
...

The visual of the reading order for raw text extracted by DetectDocumentText can be seen in the following image.

Visualization of raster scan reading order

The following code snippet generates the layout-linearized text from the document. You can use either method to generate the linearized text from the document using the latest version of Amazon Textract Textractor Python library.

import textractcaller as tc
from textractcaller.t_call import call_textract, Textract_Features
from textractprettyprinter.t_pretty_print import get_text_from_layout_json

layout_textract_json = call_textract(input_document = input_document,
                                     features = [Textract_Features.LAYOUT])
layout_text = get_text_from_layout_json(textract_json = layout_textract_json)[1]
print(layout_text)

The output (trimmed for brevity) looks like the following. The text reading order is preserved since we used the LAYOUT feature, and the text makes more sense.

PHOTONICS FOR A BETTER WORLD

UNESCO ENDORSES INTERNATIONAL DAY OF LIGHT

First celebration in 2018 will become an annual
reminder of photonics-enabled technologies

T he executive board of the United Nations Educational,
Scientific, and Cultural Organization (UNESCO) has endorsed
a proposal to establish an annual International Day of Light
(IDL) as an extension of the highly successful International Year of
Light and Light-based Technologies (IYL 2015).
The endorsement for a Day of Light has been
embraced by SPIE and other founding partners of
IYL 2015.
...

The visual of the reading order for raw text extracted by AnalyzeDocument with LAYOUT feature can be seen in the following image.

Visualization of layout aware reading order

We performed chunking on both the extracted text separately, with a chunk size of 250 and an overlap of 50.

Next, we generate vector embeddings for the chunks and load them into a vector database in two separate collections. We used open source ChromaDB as our in-memory vector database and used topK value of 3 for the relevance search. This means that for every question, our relevance search query with ChromaDB returns 3 relevant chunks of text of size 250 each. These three chunks are then used to build a context for the LLM. We intentionally chose a smaller chunk size and smaller topK to build the context for the following specific reasons.

  1. Shorten the overall size of our context since research suggests that LLMs tend to perform better with shorter context, even though the model supports longer context (through a larger token context window).
  2. Smaller overall prompt size results in lower overall text generation model latency. The larger the overall prompt size (which includes the context), the longer it may take the model to generate a response.
  3. Comply with the model’s limited token context window, as is the case with most LLMs.
  4. Cost efficiency since using fewer tokens means lower cost per question for input and output tokens combined.

Note that Anthropic Claude Instant v1 does support a 100,000 token context window via Amazon Bedrock. We intentionally limited ourselves to a smaller chunk size since that also makes the test relevant to models with fewer parameters and overall shorter context windows.

We used ROUGE metrics to evaluate machine-generated text against a reference text (or ground truth), measuring various aspects like the overlap of n-grams, word sequences, and word pairs between the two texts. We chose three ROUGE metrics for evaluation.

  1. ROUGE-1: Compares the overlap of unigrams (single words) between the generated text and a reference text.
  2. ROUGE-2: Compares the overlap of bigrams (two-word sequences) between the generated text and a reference text.
  3. ROUGE-L: Measures the longest common subsequence (LCS) between the generated text and a reference text, focusing on the longest sequence of words that appear in both texts, albeit not necessarily consecutively.

ROUGE Score calculations

For our 20 sample questions relevant to the document, we ran Q&A with the raw text and linearized text, respectively, and then ran the ROUGE score analysis. We noticed almost 50 percent average improvement in precision overall. And there was significant improvement in F1-scores when layout-linearized text was compared to ground truth as opposed to when raw text was compared to ground truth.

This suggests that the model became better at generating correct responses with the help of linearized text and smaller chunking. This led to an increase in precision, and the balance between precision and recall shifted favorably towards precision, leading to an increase in the F1 score. The increased F1 score, which balances precision and recall, suggests an improvement. It’s essential to consider the practical implications of these metric changes. For instance, in a scenario where false positives are costly, the increase in precision is highly beneficial.

ROUGE plot on Q&A task result with Layout

ANLS score analysis on extractive tasks over academic datasets

We measure the ANLS or the Average Normalized Levenshtein Similarity, which is an edit distance metric that was introduced by the paper Scene Text Visual Question Answering and aims to softly penalize minor OCR imperfections while considering the model’s reasoning abilities at the same time. This metric is a derivative version of traditional Levenshtein distance, which is a measure of the difference between two sequences (such as strings). It is defined as the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one word into the other.

For our ANLS tests, we performed an NER task where the LLM was prompted to extract the exact value from the OCR-extracted text. The two academic datasets used for the tests are DocVQA and InfographicVQA. We used zero-shot prompting to attempt extraction of key entities. The prompt used for the LLMs is of the following structure.

template = """You are asked to answer a question using only the provided Document.

The answer to the question should be taken as-is from the document and as short as possible.

Document:n{document}

Question: {question}

Extract the answer from the document with as few words as possible."""

Accuracy improvements were observed in all document question-answering datasets tested with the open source FlanT5-XL model when using layout-aware linearized text, as opposed to raw text (raster scan), in response to zero-shot prompts. In the InfographicVQA dataset, using layout-aware linearized text enables the smaller 3B parameter FlanT5-XL model to match the performance of the larger FlanT5-XXL model (on raw text), which has nearly four times as many parameters (11B).

Dataset ANLS*
FlanT5-XL (3B) FlanT5-XXL (11B)
Not Layout-aware (Raster) Layout-aware Δ Not Layout- aware (Raster) Layout-aware Δ
DocVQA 66.03% 68.46% 1.43% 70.71% 72.05% 1.34%
InfographicsVQA 29.47% 35.76% 6.29% 37.82% 45.61% 7.79%

* ANLS is measured on text extracted by Amazon Textract, not the provided document transcription

Conclusion

The launch of Layout marks a significant advancement in using Amazon Textract to build document automation solutions. As discussed in this post, Layout uses traditional and generative AI methods to improve efficiencies when building a wide variety of document automation solutions such as document search, contextual Q&A, summarization, key-entities extraction, and more. As we continue to embrace the power of AI in building document processing and understanding systems, these enhancements will no doubt pave the way for more streamlined workflows, higher productivity, and more insightful data analysis.

For more information on the Layout feature and how to take advantage of the feature for document automation solutions, refer to AnalyzeDocument, Layout analysis, and Text linearization for generative AI applications documentation.


About the Authors

Anjan Biswas is a Senior AI Services Solutions Architect who focuses on computer vision, NLP, and generative AI. Anjan is part of the worldwide AI services specialist team and works with customers to help them understand and develop solutions to business problems with AWS AI Services and generative AI.

Lalita ReddiLalita Reddi is a Senior Technical Product Manager with the Amazon Textract team. She is focused on building machine learning–based services for AWS customers. In her spare time, Lalita likes to play board games and go on hikes.

Edouard Belval is a Research Engineer in the computer vision team at AWS. He is the main contributor behind the Amazon Textract Textractor library.

Read More