Accelerating data science innovation: How Bayer Crop Science used AWS AI/ML services to build their next-generation MLOps service

Accelerating data science innovation: How Bayer Crop Science used AWS AI/ML services to build their next-generation MLOps service

The world’s population is expanding at a rapid rate. The growing global population requires innovative solutions to produce food, fiber, and fuel, while restoring natural resources like soil and water and addressing climate change. Bayer Crop Science estimates farmers need to increase crop production by 50% by 2050 to meet these demands. To support their mission, Bayer Crop Science is collaborating with farmers and partners to promote and scale regenerative agriculture—a future where farming can produce more while restoring the environment.

Regenerative agriculture is a sustainable farming philosophy that aims to improve soil health by incorporating nature to create healthy ecosystems. It’s based on the idea that agriculture should restore degraded soils and reverse degradation, rather than sustain current conditions. The Crop Science Division at Bayer believes regenerative agriculture is foundational to the future of farming. Their vision is to produce 50% more food by restoring nature and scaling regenerative agriculture. To make this mission a reality, Bayer Crop Science is driving model training with Amazon SageMaker and accelerating code documentation with Amazon Q.

In this post, we show how Bayer Crop Science manages large-scale data science operations by training models for their data analytics needs and maintaining high-quality code documentation to support developers. Through these solutions, Bayer Crop Science projects up to a 70% reduction in developer onboarding time and up to a 30% improvement in developer productivity.

Challenges

Bayer Crop Science faced the challenge of scaling genomic predictive modeling to increase its speed to market. It also needed data scientists to focus on building the high-value foundation models (FMs), rather than worrying about constructing and engineering the solution itself. Prior to building their solution, the Decision Science Ecosystem, provisioning a data science environment could take days for a data team within Bayer Crop Science.

Solution overview

Bayer Crop Science’s Decision Science Ecosystem (DSE) is a next-generation machine learning operations (MLOps) solution built on AWS to accelerate data-driven decision making for data science teams at scale across the organization. AWS services assist Bayer Crop Science in creating a connected decision-making system accessible to thousands of data scientists. The company is using the solution for generative AI, product pipeline advancements, geospatial imagery analytics of field data, and large-scale genomic predictive modeling that will allow Bayer Crop Science to become more data-driven and increase speed to market. This solution helps the data scientist at every step, from ideation to model output, including the entire business decision record made using DSE. Other divisions within Bayer are also beginning to build a similar solution on AWS based on the success of DSE.

Bayer Crop Science teams’ DSE integrates cohesively with SageMaker, a fully managed service that lets data scientists quickly build, train, and deploy machine learning (ML) models for different use cases so they can make data-informed decisions quickly. This boosts collaboration within Bayer Crop Science across product supply, R&D, and commercial. Their data science strategy no longer needs self-service data engineering, but rather provides an effective resource to drive fast data engineering at scale. Bayer Crop Science chose SageMaker because it provides a single cohesive experience where data scientists can focus on building high-value models, without having to worry about constructing and engineering the resource itself. With the help of AWS services, cross-functional teams can align quickly to reduce operational costs by minimizing redundancy, addressing bugs early and often, and quickly identifying issues in automated workflows. The DSE solution uses SageMaker, Amazon Elastic Kubernetes Service (Amazon EKS), AWS Lambda, and Amazon Simple Storage Service (Amazon S3) to accelerate innovation at Bayer Crop Science and to create a customized, seamless, end-to-end user experience.

The following diagram illustrates the DSE architecture.

Comprehensive AWS DSE architecture showing data scientist workflow from platform through deployment, with monitoring and security controls

Solution walkthrough

Bayer Crop Science had two key challenges in managing large-scale data science operations: maintaining high-quality code documentation and optimizing existing documentation across multiple repositories. With Amazon Q, Bayer Crop Science tackled both challenges, which empowered them to onboard developers more rapidly and improve developer productivity.

The company’s first use case focused on automatically creating high-quality code documentation. When a developer pushes code to a GitHub repository, a webhook—a lightweight, event-driven communication that automatically sends data between applications using HTTP—triggers a Lambda function through Amazon API Gateway. This function then uses Amazon Q to analyze the code changes and generate comprehensive documentation and change summaries. The updated documentation is then stored in Amazon S3. The same Lambda function also creates a pull request with the AI-generated summary of code changes. To maintain security and flexibility, Bayer Crop Science uses Parameter Store, a capability of AWS Systems Manager, to manage prompts for Amazon Q, allowing for quick updates without redeployment, and AWS Secrets Manager to securely handle repository tokens.

This automation significantly reduces the time developers spend creating documentation and pull request descriptions. The generated documentation is also ingested into Amazon Q, so developers can quickly answer questions they have about a repository and onboard onto projects.

The second use case addresses the challenge of maintaining and improving existing code documentation quality. An AWS Batch job, triggered by Amazon EventBridge, processes the code repository. Amazon Q generates new documentation for each code file, which is then indexed along with the source code. The system also generates high-level documentation for each module or functionality and compares the AI-generated documentation with existing human-written documentation. This process makes it possible for Bayer Crop Science to systematically evaluate and enhance their documentation quality over time.

To improve search capabilities, Bayer Crop Science added repository names as custom attributes in the Amazon Q index and prefixed them to indexed content. This enhancement improved the accuracy and relevance of documentation searches. The development team also implemented strategies to handle API throttling and variability in AI responses, maintaining robustness in production environments. Bayer Crop Science is considering developing a management plane to streamline the addition of new repositories and centralize the management of settings, tokens, and prompts. This would further enhance the scalability and ease of use of the system.

Organizations looking to replicate Bayer Crop Science’s success can implement similar webhook-triggered documentation generation, use Amazon Q Business for both generating and evaluating documentation quality, and integrate the solution with existing version control and code review processes. By using AWS services like Lambda, Amazon S3, and Systems Manager, companies can create a scalable and manageable architecture for their documentation needs. Amazon Q Developer also helps organizations further accelerate their development timelines by providing real-time code suggestions and a built-in next-generation chat experience.

“One of the lessons we’ve learned over the last 10 years is that we want to write less code. We want to focus our time and investment on only the things that provide differentiated value to Bayer, and we want to leverage everything we can that AWS provides out of the box. Part of our goal is reducing the development cycles required to transition a model from proof-of-concept phase, to production, and ultimately business adoption. That’s where the value is.”

– Will McQueen, VP, Head of CS Global Data Assets and Analytics at Bayer Crop Science.

Summary

Bayer Crop Science’s approach aligns with modern MLOps practices, enabling data science teams to focus more on high-value modeling tasks rather than time-consuming documentation processes and infrastructure management. By adopting these practices, organizations can significantly reduce the time and effort required for code documentation while improving overall code quality and team collaboration.

Learn more about Bayer Crop Science’s generative AI journey, and discover how Bayer Crop Science is redesigning sustainable practices through cutting-edge technology.

About Bayer

Bayer is a global enterprise with core competencies in the life science fields of health care and nutrition. In line with its mission, “Health for all, Hunger for none,” the company’s products and services are designed to help people and the planet thrive by supporting efforts to understand the major challenges presented by a growing and aging global population. Bayer is committed to driving sustainable development and generating a positive impact with its businesses. At the same time, Bayer aims to increase its earning power and create value through innovation and growth. The Bayer brand stands for trust, reliability, and quality throughout the world. In fiscal 2023, the Group employed around 100,000 people and had sales of 47.6 billion euros. R&D expenses before special items amounted to 5.8 billion euros. For more information, go to www.bayer.com.


About the authors

Headshot of Lance SmithLance Smith is a Senior Solutions Architect and part of the Global Healthcare and Life Sciences industry division at AWS. He has spent the last 2 decades helping life sciences companies apply technology in pursuit of their missions to help patients. Outside of work, he loves traveling, backpacking, and spending time with his family.

Headshot of Kenton BlacuttKenton Blacutt is an AI Consultant within the Amazon Q Customer Success team. He works hands-on with customers, helping them solve real-world business problems with cutting-edge AWS technologies. In his free time, he likes to travel and run an occasional marathon.

Headshot of Karthik PrabhakarKarthik Prabhakar is a Senior Applications Architect within the AWS Professional Services team. In this role, he collaborates with customers to design and implement cutting-edge solutions for their mission-critical business systems, focusing on areas such as scalability, reliability, and cost optimization in digital transformation and modernization projects.

Headshot of Jake MalmadJake Malmad is a Senior DevOps Consultant within the AWS Professional Services team, specializing in infrastructure as code, security, containers, and orchestration. As a DevOps consultant, he uses this expertise to collaboratively works with customers, architecting and implementing solutions for automation, scalability, reliability, and security across a wide variety of cloud adoption and transformation engagements.

Headshot of Nicole BrownNicole Brown is a Senior Engagement Manager within the AWS Professional Services team based in Minneapolis, MN. With over 10 years of professional experience, she has led multidisciplinary, global teams across the healthcare and life sciences industries. She is also a supporter of women in tech and currently holds a board position within the Women at Global Services affinity group.

Read More

Combat financial fraud with GraphRAG on Amazon Bedrock Knowledge Bases

Combat financial fraud with GraphRAG on Amazon Bedrock Knowledge Bases

Financial fraud detection isn’t just important to banks—it’s essential. With global fraud losses surpassing $40 billion annually and sophisticated criminal networks constantly evolving their tactics, financial institutions face an increasingly complex threat landscape. Today’s fraud schemes operate across multiple accounts, institutions, and channels, creating intricate webs designed specifically to evade detection systems.

Financial institutions have invested heavily in detection capabilities, but the core challenge remains: how to connect the dots across fragmented information landscapes where the evidence of fraud exists not within individual documents or transactions, but in the relationships between them.

In this post, we show how to use Amazon Bedrock Knowledge Bases GraphRAG with Amazon Neptune Analytics to build a financial fraud detection solution.

The limitations of traditional RAG systems

In recent years, Retrieval Augmented Generation (RAG) has emerged as a promising approach for building AI systems grounded in organizational knowledge. However, traditional RAG-based systems have limitations when it comes to complex financial fraud detection.The fundamental limitation lies in how conventional RAG processes information. Standard RAG retrieves and processes document chunks as isolated units, looking for semantic similarities between a query and individual text passages. This approach works well for straightforward information retrieval, but falls critically short in the following scenarios:

  • Evidence is distributed across multiple documents and systems
  • The connections between entities matter more than the entities themselves
  • Complex relationship chains require multi-hop reasoning
  • Structural context (like hierarchical document organization) provides critical clues
  • Entity resolution across disparate references is essential

A fraud analyst intuitively follows connection paths—linking an account to a phone number, that phone number to another customer, and that customer to a known fraud ring. Traditional RAG systems, however, lack this relational reasoning capability, leaving sophisticated fraud networks undetected until losses have already occurred.

Amazon Bedrock Knowledge Bases with GraphRAG for financial fraud detection

Amazon Bedrock Knowledge Bases GraphRAG helps financial institutions implement fraud detection systems without building complex graph infrastructure from scratch. By offering a fully managed service that seamlessly integrates knowledge graph construction, maintenance, and querying with powerful foundation models (FMs), Amazon Bedrock Knowledge Bases dramatically lowers the technical barriers to implementing relationship-aware fraud detection. Financial organizations can now use their existing transaction data, customer profiles, and risk signals within a graph context that preserves the critical connections between entities while benefiting from the natural language understanding of FMs. This powerful combination enables fraud analysts to query complex financial relationships using intuitive natural language to detect suspicious patterns that can result in financial fraud.

Example fraud detection use case

To demonstrate this use case, we use a fictitious bank (AnyCompany Bank) in Australia whose customers hold savings, checking, and credit card accounts with the bank. These customers perform transactions to buy goods and services from merchants across the country using their debit and credit cards. AnyCompany Bank is looking to use the latest advancements in GraphRAG and generative AI technologies to detect subtle patterns in fraudulent behavior that will yield higher accuracy and reduce false positives.A fraud analyst at AnyCompany Bank wants to use natural language queries to get answers to the following types of queries:

  • Basic queries – For example, “Show me all the transactions processed by ABC Electronics” or “What accounts does Michael Green own?”
  • Relationship exploration queries – For example, “Which devices have accessed account A003?” or “Show all relationships between Jane Smith and her devices.”
  • Temporal pattern detection queries – For example, “Which accounts had transactions and device access on the same day?” or “Which accounts had transactions outside their usual location pattern?”
  • Fraud detection queries – For example, as “Find unusual transaction amounts compared to account history” or “Are there any accounts with failed transactions followed by successful ones within 24 hours?”

Solution overview

To help illustrate the core GraphRAG principles, we have simplified the data model to six key tables: accounts, transactions, individuals, devices, merchants, and relationships. Real-world financial fraud detection systems are much more complex, with hundreds of entity types and intricate relationships, but this example demonstrates the essential concepts that scale to enterprise implementations. The following figure is an example of the accounts table.

The following figure is an example of the individuals table.

The following figure is an example of the devices table.

The following figure is an example of the transactions table.

The following figure is an example of the merchants table.

The following figure is an example of the relationships table.

The following diagram shows the relationships among these entities: accounts, individuals, devices, transactions, and merchants. For example, the individual John Doe uses device D001 to access account A001 to execute transaction T001, which is processed by merchant ABC Electronics.

In the following sections, we demonstrate how to upload documents to Amazon Simple Storage Service (Amazon S3), create a knowledge base using Amazon Bedrock Knowledge Bases, and test the knowledge base by running natural language queries.

Prerequisites

To follow along with this post, make sure you have an active AWS account with appropriate permissions to access Amazon Bedrock and create an S3 bucket to be the data source. Additionally, verify that you have enabled access to both Anthropic’s Claude 3.5 Haiku and an embeddings model, such as Amazon Titan Text Embeddings V2.

Uplaod documents to Amazon S3

In this step, you create an S3 bucket as the data source and upload the six tables (accounts, individuals, devices, transactions, merchants, and relationships) as Excel data sheets. The following screenshot shows our S3 bucket and its contents.

Create a knowledge base

Complete the following steps to create the knowledge base:

  1. On the Amazon Bedrock console, choose Knowledge Bases under Builder tools in the navigation pane.
  2. Choose Create and Knowledge Base with vector store.

  1. In the Knowledge Base details section, provide the following information:
    1. Enter a meaningful name for the knowledge base.
    2. For IAM permissions, select Create and use a new service role to create a new AWS Identity and Access Management (IAM) role.
    3. For Choose data source, select Amazon S3.
    4. Choose Next.

  1. In the Configure data source section, provide the following information:
    1. Enter a data source name.
    2. For Data source location, select the location of your data source (for example, we select This AWS account).
    3. For S3 source, choose Browse S3 and choose the location where you uploaded the files.
    4. For Parsing staretgy, select Amazon Bedrock default parser.
    5. For Chunking strategy, choose Default chunking.
    6. Choose Next.

  1. In the Configure data storage and processing section, provide the following information:
    1. For Embeddings model, choose Titan Text Embeddings V2.
    2. For Vector store creation method, select Quick create a new vector store.
    3. For Vector store type, select Amazon Neptune Analytics (GraphRAG).
    4. Choose Next

Amazon Bedrock chooses the FM as Anthropic’s Claude 3 Haiku v1 to automatically build graphs for our knowledge base. This automatically enables contextual enrichment.

  1. Choose Create knowledge base.
  2. Choose the knowledge base when it’s in Available status.

  1. Select the data source and choose Sync, then wait for the sync process to complete.

In the sync process, Amazon Bedrock ingests data files from Amazon S3, creates chunks and embeddings, and automatically extracts entities and relationships, creating the graph.

Test the knowledge base and run natural language queries

When the sync is complete, you can test the knowledge base.

  1. In the Test Knowledge Base section, choose Select model.
  2. Set the model as Anthropic’s Claude 3.5 Haiku (or another model of your choice) and then choose Apply.

  1. Enter a sample query and choose Run.

Let’s start with some basic queries, such as “Show me all transactions processed by ABC Electronics” or “What accounts does Michael Green own?” The generated responses are shown in the following screenshot.

We can also run some relationship exploration queries, such as “Which devices have accessed account A003?” or “Show all relationships between Jane Smith and her devices.” The generated responses are shown in the following screenshot. To arrive at the response, the model will do multi-hop reasoning where it will traverse multiple files.

The model can also perform temporal pattern detection queries, such as “Which accounts had transactions and device access on the same day?” or “Which accounts had transactions outside their usual location pattern?” The generated responses are shown in the following screenshot.

Let’s try out some fraud detection queries, such as “Find unusual transaction amounts compared to account history” or “Are there any accounts with failed transactions followed by successful ones within 24 hours?” The generated responses are shown in the following screenshot.

The GraphRAG solution also enables complex relationship queries, such as “Show the complete path from Emma Brown to Pacific Fresh Market” or “Map all connections between the individuals and merchants in the system.” The generated responses are shown in the following screenshot.

Clean up

To avoid incurring additional costs, clean up the resources you created. This includes deleting the Amazon Bedrock knowledge base, its associated IAM role, and the S3 bucket used for source documents. Additionally, you must separately delete the Neptune Analytics graph that was automatically created by Amazon Bedrock Knowledge Bases during the setup process.

Conclusion

GraphRAG in Amazon Bedrock emerges as a game-changing feature in the fight against financial fraud. By automatically connecting relationships across transaction data, customer profiles, historical patterns, and fraud reports, it significantly enhances financial institutions’ ability to detect complex fraud schemes that traditional systems might miss. Its unique capability to understand and link information across multiple documents and data sources proves invaluable when investigating sophisticated fraud patterns that span various touchpoints and time periods.For financial institutions and fraud detection teams, GraphRAG intelligent document processing means faster, more accurate fraud investigations. It can quickly piece together related incidents, identify common patterns in fraud reports, and connect seemingly unrelated activities that might indicate organized fraud rings. This deeper level of insight, combined with its ability to provide comprehensive, context-aware responses, enables security teams to stay one step ahead of fraudsters who continuously evolve their tactics.As financial crimes become increasingly sophisticated, GraphRAG in Amazon Bedrock stands as a powerful tool for fraud prevention, transforming how you can analyze, connect, and act on fraud-related information. The future of fraud detection demands tools that can think and connect like humans—and GraphRAG is leading the way in making this possible.


About the Authors

Senaka Ariyasinghe is a Senior Partner Solutions Architect at AWS. He collaborates with Global Systems Integrators to drive cloud innovation across the Asia-Pacific and Japan region. He specializes in helping AWS partners develop and implement scalable, well-architected solutions, with particular emphasis on generative AI, machine learning, cloud migration strategies, and the modernization of enterprise applications.

Senthil Nathan is a Senior Partner Solutions Architect working with Global Systems Integrators at AWS. In his role, Senthil works closely with global partners to help them maximize the value and potential of the AWS Cloud landscape. He is passionate about using the transformative power of cloud computing and emerging technologies to drive innovation and business impact.

Deependra Shekhawat is a Senior Energy and Utilities Industry Specialist Solutions Architect based in Sydney, Australia. In his role, Deependra helps energy companies across the Asia-Pacific and Japan region use cloud technologies to drive sustainability and operational efficiency. He specializes in creating robust data foundations and advanced workflows that enable organizations to harness the power of big data, analytics, and machine learning for solving critical industry challenges.

Aaron Sempf is Next Gen Tech Lead for the AWS Partner Organization in Asia-Pacific and Japan. With over 20 years in distributed system engineering design and development, he focuses on solving for large-scale complex integration and event-driven systems. In his spare time, he can be found coding prototypes for autonomous robots, IoT devices, distributed solutions, and designing agentic architecture patterns for generative AI-assisted business automation.

Ozan Eken is a Product Manager at AWS, passionate about building cutting-edge generative AI and graph analytics products. With a focus on simplifying complex data challenges, Ozan helps customers unlock deeper insights and accelerate innovation. Outside of work, he enjoys trying new foods, exploring different countries, and watching soccer.

JaiPrakash Dave is a Partner Solutions Architect working with Global Systems Integrators at AWS based in India. In his role, JaiPrakash guides AWS partners in the India region to design and scale well-architected solutions, focusing on generative AI, machine learning, DevOps, and application and data modernization initiatives.

Read More

Classify call center conversations with Amazon Bedrock batch inference

Classify call center conversations with Amazon Bedrock batch inference

In this post, we demonstrate how to build an end-to-end solution for text classification using the Amazon Bedrock batch inference capability with the Anthropic’s Claude Haiku model. Amazon Bedrock batch inference offers a 50% discount compared to the on-demand price, which is an important factor when dealing with a large number of requests. We walk through classifying travel agency call center conversations into categories, showcasing how to generate synthetic training data, process large volumes of text data, and automate the entire workflow using AWS services.

Challenges with high-volume text classification

Organizations across various sectors face a common challenge: the need to efficiently handle high-volume classification tasks. From travel agency call centers categorizing customer inquiries to sales teams analyzing lost opportunities and finance departments classifying invoices, these manual processes are a daily necessity. But these tasks come with significant challenges.

The manual approach to analyzing and categorizing these classification requests is not only time-intensive but also prone to inconsistencies. As teams process the high volume of data, the potential for errors and inefficiencies grows. By implementing automated systems to classify these interactions, multiple departments stand to gain substantial benefits. They can uncover hidden trends in their data, significantly enhance the quality of their customer service, and streamline their operations for greater efficiency.

However, the path to effective automated classification has its own challenges. Organizations must grapple with the complexities of efficiently processing vast amounts of textual information while maintaining consistent accuracy in their classification results. In this post, we demonstrate how to create a fully automated workflow while keeping operational costs under control.

Data

For this solution, we used synthetic call center conversation data. For realistic training data that maintains user privacy, we generated synthetic conversations using Anthropic’s Claude 3.7 Sonnet. We used the following prompt generate synthetic data:

Task: Generate <N> synthetic conversations from customer calls to an imaginary travel
company. Come up with 10 most probable categories that calls of this nature can come 
from and treat them as classification categories for these calls. For each generated 
call create a column that indicates the category for that call. 
Conversations should follow the following format:
"User: ...
Agent: ...
User: ...
Agent: ...
...

Class: One of the 10 following categories that is most relevant to the conversation."
Ten acceptable classes:
1. Booking Inquiry - Customer asking about making new reservations
2. Reservation Change - Customer wanting to modify existing bookings
3. Cancellation Request - Customer seeking to cancel their travel plans
4. Refund Issues - Customer inquiring about getting money back
5. Travel Information - Customer seeking details about destinations, documentation, etc.
6. Complaint - Customer expressing dissatisfaction with service
7. Payment Problem - Customer having issues with billing or payments
8. Loyalty Program - Customer asking about rewards points or membership status
9. Special Accommodation - Customer requesting special arrangements
10. Technical Support - Customer having issues with website, app or booking systems

Instructions:
- Keep conversations concise
- Use John Doe for male names and Jane Doe for female names
- Use john.doe@email.com for male email address,jane.doe@email.com for female email 
address and corporate@email.com for corporate email address, whenever you need to 
generate emails.Use " or ' instead of " whenever there is a quote within the
conversation

The synthetic dataset includes the following information:

  • Customer inquiries about flight bookings
  • Hotel reservation discussions
  • Travel package negotiations
  • Customer service complaints
  • General travel inquiries

Solution overview

The solution architecture uses a serverless, event-driven, scalable design to effectively handle and classify large quantities of classification requests. Built on AWS, it automatically starts working when new classification request data arrives in an Amazon Simple Storage Service (Amazon S3) bucket. The system then uses Amazon Bedrock batch processing to analyze and categorize the content at scale, minimizing the need for constant manual oversight.

The following diagram illustrates the solution architecture.

Architecture Diagram

The architecture follows a well-structured flow that facilitates reliable processing of classification requests:

  • Data preparation – The process begins when the user or application submits classification requests into the S3 bucket (Step 1). These requests are ingested into an Amazon Simple Queue Service (Amazon SQS) queue, providing a reliable buffer for incoming data and making sure no requests are lost during peak loads. A serverless data processor, implemented using an AWS Lambda function, reads messages from the queue and begins its data processing work (Step 2). It prepares the data for batch inference, crafting it into the JSONL format with schema that Amazon Bedrock requires. It stores files in a separate S3 bucket to maintain a clear separation from the original S3 bucket shared with the customer’s application, enhancing security and data management.
  • Batch inference – When the data arrives in the S3 bucket, it initiates a notification to an SQS queue. This queue activates the Lambda function batch initiator, which starts the batch inference process. The function submits Amazon Bedrock batch inference jobs through the CreateModelInvocationJob API (Step 3). This initiator acts as the bridge between the queued data and the powerful classification capabilities of Amazon Bedrock. Amazon Bedrock then efficiently processes the data in batches. This batch processing approach allows for optimal use of resources while maintaining high throughput. When Amazon Bedrock completes its task, the classification results are stored in an output S3 bucket (Step 4) for postprocessing and analysis.
  • Classification results processing – After classification is complete, the system processes the results through another SQS queue (Step 5) and specialized Lambda function, which organizes the classifications into simple-to-read files, such as CSV, JSON, or XLSX (Step 6). These files are immediately available to both the customer’s applications and support teams who need to access this information (Step 7).
  • Analytics – We built an analytics layer that automatically catalogs and organizes the classification results, transforming raw classification data into actionable insights. An AWS Glue crawler catalogs everything so it can be quickly found later (Step 8). Now your business teams can use Amazon Athena to run SQL queries against the data, uncovering patterns and trends in the classified categories. We also built an Amazon QuickSight dashboard that provides visualization capabilities, so stakeholders can transform datasets into actionable reports ready for decision-making. (Step 9).

We use AWS best practices in this solution, including event-driven and batch processing for optimal resource utilization, batch operations for cost-effectiveness, decoupled components for independent scaling, and least privilege access patterns. We implemented the system using the AWS Cloud Development Kit (AWS CDK) with TypeScript for infrastructure as code (IaC) and Python for application logic, making sure we achieve seamless automation, dynamic scaling, and efficient processing of classification requests, positioning it to effectively address both current requirements and future demands.

Prerequisites

To perform the solution, you must have the following prerequisites:

  • An active AWS account.
  • An AWS Region from the list of batch inference supported Regions for Amazon Bedrock.
  • Access to your selected models hosted on Amazon Bedrock. Make sure the selected model has been enabled in Amazon Bedrock. The solution is configured to use Anthropic’s Claude 3 Haiku by default.
  • Sign up for QuickSight in the same Region where the main application will be deployed. While subscribing, make sure to configure access to Athena and Amazon S3.
  • In QuickSight, create a group named quicksight-access for managing dashboard access permissions. Make sure to add your own role to this group so you can access the dashboard after it’s deployed. If you use a different group name, modify the corresponding name in the code accordingly.
  • To set up the AWS CDK, install the AWS CDK Command Line Interface (CLI). For instructions, see AWS CDK CLI reference.

Deploy the solution

The solution is accessible in the GitHub repository.

Complete the following steps to set up and deploy the solution:

  1. Clone the Repository: Run the following command: git clone git@github.com:aws-samples/sample-genai-bedrock-batch-classifier.git
  2. Set Up AWS Credentials: Create an AWS Identity and Access Management (IAM) user with appropriate permissions, generate credentials for AWS Command Line Interface (AWS CLI) access, and create a profile. For instructions, see Authenticating using IAM user credentials for the AWS CLI. You can use the Admin Role for testing purposes, although it violates the principle of least privilege and should be avoided in production environments in favor of custom roles with minimal required permissions.
  3. Bootstrap the Application: In the CDK folder, run the command npm install & cdk bootstrap --profile {your_profile_name}, replacing {your_profile_name} with your AWS profile name.
  4. Deploy the Solution: Run the command cdk deploy --all --profile {your_profile_name}, replacing {your_profile_name} with your AWS profile name.

After you complete the deployment process, you will see a total of six stacks created in your AWS account, as illustrated in the following screenshot.

List of stacks

SharedStack acts as a central hub for resources that multiple parts of the system need to access. Within this stack, there are two S3 buckets: one handles internal operations behind the scenes, and the other serves as a bridge between the system and customers, so they can both submit their classification requests and retrieve their results.

DataPreparationStack serves as a data transformation engine. It’s designed to handle incoming files in three specific formats: XLSX, CSV, and JSON, which at the time of writing are the only supported input formats. This stack’s primary role is to convert these inputs into the specialized JSONL format required by Amazon Bedrock. The data processing script is available in the GitHub repo. This transformation makes sure that incoming data, regardless of its original format, is properly structured before being processed by Amazon Bedrock. The format is as follows:

{
 "recordId": ${unique_id}, 
 "modelInput": {
     "anthropic_version": "bedrock-2023-05-31", 
     "max_tokens": 1024,
     "messages": [ { 
           "role": "user", 
           "content": [{"type":"text", "text": ${initial_text}]} ],
      },
      "system": ${prompt}
}

where:
initial_text - text that you want to classify
prompt       - instructions to Bedrock service how to classify
unique_id    - id coming from the upstream service, otherwise it will be 
               automatically generated by the code

BatchClassifierStack handles the classification operations. Although currently powered by Anthropic’s Claude Haiku, the system maintains flexibility by allowing straightforward switches to alternative models as needed. This adaptability is made possible through a comprehensive constants file that serves as the system’s control center. The following configurations are available:

  • PREFIX – Resource naming convention (genai by default).
  • BEDROCK_AGENT_MODEL – Model selection.
  • BATCH_SIZE – Number of classifications per output file (enables parallel processing); the minimum should be 100.
  • CLASSIFICATION_INPUT_FOLDER – Input folder name in the S3 bucket that will be used for uploading incoming classification requests.
  • CLASSIFICATION_OUTPUT_FOLDER – Output folder name in the S3 bucket where the output files will be available after the classification completes.
  • OUTPUT_FORMAT – Supported formats (CSV, JSON, XLSX).
  • INPUT_MAPPING – A flexible data integration approach that adapts to your existing file structures rather than requiring you to adapt to ours. It consists of two key fields:
    • record_id – Optional unique identifier (auto-generated if not provided).
    • record_text – Text content for classification.
  • PROMPT – Template for guiding the model’s classification behavior. A sample prompt template is available in the GitHub repo. Pay attention to the structure of the template that guides the AI model through its decision-making process. The template not only combines a set of possible categories, but also contains instructions, requiring the model to select a single category and present it within <class> tags. These instructions help maintain consistency in how the model processes incoming requests and saves the output.

BatchResultsProcessingStack functions as the data postprocessing stage, transforming the Amazon Bedrock JSONL output into user-friendly formats. At the time of writing, the system supports CSV, JSON, and XLSX. These processed files are then stored in a designated output folder in the S3 bucket, organized by date for quick retrieval and management. The conversion scripts are available in the GitHub repo. The output files have the following schema:

  • ID – Resource naming convention
  • INPUT_TEXT – Initial text that was used for classification
  • CLASS – The classification category
  • RATIONALE – Reasoning or explanation of given classification

Excel File Sample

AnalyticsStack provides a business intelligence (BI) dashboard that displays a list of classifications and allows filtering based on defined in prompt categories. It offers the following key configuration options:

  • ATHENA_DATABASE_NAME – Defines the name of Athena database that is used as a main data source for the QuickSight dashboard.
  • QUICKSIGHT_DATA_SCHEMA – Defines how labels should be displayed on the dashboard and specifies which columns are filterable.
  • QUICKSIGHT_PRINCIPAL_NAME – Designates the principal group that will have access to the QuickSight dashboard. The group should be created manually before deploying the stack.
  • QUICKSIGHT_QUERY_MODE – You can choose between SPICE or direct query for fetching data, depending on your use case, data volume, and data freshness requirements. The default setting is direct query.

Now that you’ve successfully deployed the system, you can prepare your data file—this can be either real customer data or the synthetic dataset we provided for testing. When your file is ready, go to the S3 bucket named {prefix}-{account_id}-customer-requests-bucket-{region} and upload your file to input_data folder. After the batch inference job is complete, you can view the classification results on the dashboard. You can find it under the name {prefix}-{account_id}-classifications-dashboard-{region}. The following screenshot shows a preview of what you can expect.

BI Dashboard

The dashboard will not display data until Amazon Bedrock finishes processing the batch inference jobs and the AWS Glue crawler creates the Athena table. Without these steps completed, the dashboard can’t connect to the table because it doesn’t exist yet. Additionally, you must update the QuickSight role permissions that were set up during pre-deployment. To update permissions, complete the following steps:

  1. On the QuickSight console, choose the user icon in the top navigation bar and choose Manage QuickSight.
  2. In the navigation pane, choose Security & Permissions.
  3. Verify that the role has been granted proper access to the S3 bucket with the following path format: {prefix}-{account_id}-internal-classifications-{region}.

Results

To test the solution’s performance and reliability, we tested 1,190 synthetically generated travel agency conversations from a single Excel file across multiple runs. The results were remarkably consistent across 10 consecutive runs, with processing times ranging between 11–12 minutes per batch (200 classifications in a single batch).Our solution achieved the following:

  • Speed – Maintained consistent processing times around 11–12 minutes
  • Accuracy – Achieved 100% classification accuracy on our synthetic dataset
  • Cost-effectiveness – Optimized expenses through efficient batch processing

Challenges

For certain cases, the generated class didn’t exactly match the class name given in the prompt. For instance, in multiple cases, it output “Hotel/Flight Booking Inquiry” instead of “Booking Inquiry,” which was defined as the class in the prompt. This was addressed by prompt engineering and asking the model to check the final class output to match exactly with one of the provided classes.

Error handling

For troubleshooting purposes, the solution includes an Amazon DynamoDB table that tracks batch processing status, along with Amazon CloudWatch Logs. Error tracking is not automated and requires manual monitoring and validation.

Key takeaways

Although our testing focused on travel agency scenarios, the solution’s architecture is flexible and can be adapted to various classification needs across different industries and use cases.

Known limitations

The following are key limitations of the classification solution and should be considered when planning its use:

  • Minimum batch size – Amazon Bedrock batch inference requires at least 100 classifications per batch.
  • Processing time – The completion time of a batch inference job depends on various factors, such as job size. Although Amazon Bedrock strives to complete a typical job within 24 hours, this time frame is a best-effort estimate and not guaranteed.
  • Input file formats – The solution currently supports only CSV, JSON, and XLSX file formats for input data.

Clean up

To avoid additional charges, clean up your AWS resources when they’re no longer needed by running the command cdk destroy --all --profile {your_profile_name}, replacing {your_profile_name} with your AWS profile name.

To remove resources associated with this project, complete the following steps:

  1. Delete the S3 buckets:
    1. On the Amazon S3 console, choose Buckets in the navigation pane.
    2. Locate your buckets by searching for your {prefix}.
    3. Delete these buckets to facilitate proper cleanup.
  2. Clean up the DynamoDB resources:
    1. On the DynamoDB console, choose Tables in the navigation pane.
    2. Delete the table {prefix}-{account_id}-batch-processing-status-{region}.

This comprehensive cleanup helps make sure residual resources don’t remain in your AWS account from this project.

Conclusion

In this post, we explored how Amazon Bedrock batch inference can transform your large-scale text classification workflows. You can now automate time-consuming tasks your teams handle daily, such as analyzing lost sales opportunities, categorizing travel requests, and processing insurance claims. This solution frees your teams to focus on growing and improving your business.

Furthermore, this solution gives the opportunity to create a system that provides real-time classifications, seamlessly integrates with your communication channels, offers enhanced monitoring capabilities, and supports multiple languages for global operations.

This solution was developed for internal use in test and non-production environments only. It is the responsibility of the customer to perform their due diligence to verify the solution aligns with their compliance obligations.

We’re excited to see how you will adapt this solution to your unique challenges. Share your experience or questions in the comments—we’re here to help you get started on your automation journey.


About the authors

Nika Mishurina is a Senior Solutions Architect with Amazon Web Services. She is passionate about delighting customers through building end-to-end production-ready solutions for Amazon. Outside of work, she loves traveling, working out, and exploring new things.

Farshad Harirchi is a Principal Data Scientist at AWS Professional Services. He helps customers across industries, from retail to industrial and financial services, with the design and development of generative AI and machine learning solutions. Farshad brings extensive experience in the entire machine learning and MLOps stack. Outside of work, he enjoys traveling, playing outdoor sports, and exploring board games.

Read More

Effective cross-lingual LLM evaluation with Amazon Bedrock

Effective cross-lingual LLM evaluation with Amazon Bedrock

Evaluating the quality of AI responses across multiple languages presents significant challenges for organizations deploying generative AI solutions globally. How can you maintain consistent performance when human evaluations require substantial resources, especially across diverse languages? Many companies find themselves struggling to scale their evaluation processes without compromising quality or breaking their budgets.

Amazon Bedrock Evaluations offers an efficient solution through its LLM-as-a-judge capability, so you can assess AI outputs consistently across linguistic barriers. This approach reduces the time and resources typically required for multilingual evaluations while maintaining high-quality standards.

In this post, we demonstrate how to use the evaluation features of Amazon Bedrock to deliver reliable results across language barriers without the need for localized prompts or custom infrastructure. Through comprehensive testing and analysis, we share practical strategies to help reduce the cost and complexity of multilingual evaluation while maintaining high standards across global large language model (LLM) deployments.

Solution overview

To scale and streamline the evaluation process, we used Amazon Bedrock Evaluations, which offers both automatic and human-based methods for assessing model and RAG system quality. To learn more, see Evaluate the performance of Amazon Bedrock resources.

Automatic evaluations

Amazon Bedrock supports two modes of automatic evaluation:

For LLM-as-a-judge evaluations, you can choose from a set of built-in metrics or define your own custom metrics tailored to your specific use case. You can run these evaluations on models hosted in Amazon Bedrock or on external models by uploading your own prompt-response pairs.

Human evaluations

For use cases that require subject-matter expert judgment, Amazon Bedrock also supports human evaluation jobs. You can assign evaluations to human experts, and Amazon Bedrock manages task distribution, scoring, and result aggregation.

Human evaluations are especially valuable for establishing a baseline against which automated scores, like those from judge model evaluations, can be compared.

Evaluation dataset preparation

We used the Indonesian splits from the SEA-MTBench dataset. It is based on MT-Bench, a widely used benchmark for conversational AI assessment. The Indonesian version was manually translated by native speakers and consisted of 58 records covering a diverse range of categories such as math, reasoning, and writing.

We converted multi-turn conversations into single-turn interactions while preserving context. This allows each turn to be evaluated independently with consistent context. This conversion process resulted in 116 records for evaluation. Here’s how we approached this conversion:

Original row: {"prompts: [{ "text": "prompt 1"}, {"text": "prompt 2"}]}
Converted into 2 rows in the evaluation dataset:
Human: {prompt 1}nnAssistant: {response 1}
Human: {prompt 1}nnAssistant: {response 1}nnHuman: {prompt 2}nnAssistant: {response 2}

For each record, we generated responses using a stronger LLM (Model Strong-A) and a relatively weaker LLM (Model Weak-A). These outputs were later evaluated by both human annotators and LLM judges.

Establishing a human evaluation baseline

To assess evaluation quality, we first established a set of human evaluations as the baseline for comparing LLM-as-a-judge scores. A native-speaking evaluator rated each response from Model Strong-A and Model Weak-A on a 1–5 Likert helpfulness scale, using the same rubric applied in our LLM evaluator prompts.

We conducted manual evaluations on the full evaluation dataset using the human evaluation feature in Amazon Bedrock. Setting up human evaluations in Amazon Bedrock is straightforward: you upload a dataset and define the worker group, and Amazon Bedrock automatically generates the annotation UI and manages the scoring workflow and result aggregation.

The following screenshot shows a sample result from an Amazon Bedrock human evaluation job.

Custom evaluation dashboard showing distribution of 116 ratings across 5-point scale with prompt category filter options

LLM-as-a-judge evaluation setup

We evaluated responses from Model Strong-A and Model Weak-A using four judge models: Model Strong-A, Model Strong-B, Model Weak-A, and Model Weak-B. These evaluations were run using custom metrics in an LLM-as-a-judge evaluation in Amazon Bedrock, which allows flexible prompt definition and scoring without the need to manage your own infrastructure.

Each judge model was given a custom evaluation prompt aligned with the same helpfulness rubric used in the human evaluation. The prompt asked the evaluator to rate each response on a 1–5 Likert scale based on clarity, task completion, instruction adherence, and factual accuracy. We prepared both English and Indonesian versions to support multilingual testing. The following table compares the English and Indonesian prompts.

English prompt Indonesian prompt
">You are given a user task and a candidate completion from an AI assistant.
Your job is to evaluate how helpful the completion is — with special attention to whether it follows the user’s instructions and produces the correct or appropriate output.


A helpful response should:


- Accurately solve the task (math, formatting, generation, extraction, etc.)
- Follow all explicit and implicit instructions
- Use appropriate tone, clarity, and structure
- Avoid hallucination, false claims, or harmful implications


Even if the response is well-written or polite, it should be rated low if it:
- Produces incorrect results or misleading explanations
- Fails to follow core instructions
- Makes basic reasoning mistakes


Scoring Guide (1–5 scale):
5 – Very Helpful
The response is correct, complete, follows instructions fully, and could be used directly by the end user with confidence.


4 – Somewhat Helpful
Minor errors, omissions, or ambiguities, but still mostly correct and usable with small modifications or human verification.


3 – Neutral / Mixed
Either (a) the response is generally correct but doesn’t really follow the user’s instruction, or (b) it follows instructions but contains significant flaws that reduce trust.


2 – Somewhat Unhelpful
The response is incorrect or irrelevant in key areas, or fails to follow instructions, but shows some effort or structure.


1 – Very Unhelpful
The response is factually wrong, ignores the task, or shows fundamental misunderstanding or no effort.


Instructions:
You will be shown:
- The user’s task
- The AI assistant’s completion


Evaluate the completion on the scale above, considering both accuracy and instruction-following as primary criteria.


Task:
{{prompt}}


Candidate Completion:
{{prediction}}

Anda diberikan instruksi dari pengguna beserta jawaban/penyelesaian instruksi tersebut oleh asisten AI.
Tugas Anda adalah mengevaluasi seberapa membantu jawaban tersebut — dengan fokus utama pada apakah jawaban tersebut mengikuti instruksi pengguna dengan benar dan menghasilkan output yang akurat serta sesuai.


Sebuah jawaban dianggap membantu jika:
- Menyelesaikan instruksi dengan akurat (perhitungan matematika, pemformatan, pembuatan konten, ekstraksi data, dll.)
- Mengikuti semua instruksi eksplisit maupun implisit dari pengguna
- Menggunakan nada, kejelasan, dan struktur yang sesuai
- Menghindari halusinasi, klaim yang salah, atau implikasi yang berbahaya


Meskipun jawaban terdengar baik atau sopan, tetap harus diberi nilai rendah jika:
- Memberikan hasil yang salah atau penjelasan yang menyesatkan
- Gagal mengikuti inti dari instruksi pengguna
- Membuat kesalahan penalaran yang mendasar


Panduan Penilaian (Skala 1–5):
5 – Sangat Membantu
Jawaban benar, lengkap, mengikuti instruksi pengguna sepenuhnya, dan dapat langsung digunakan oleh pengguna dengan percaya diri.


4 – Cukup Membantu
Ada sedikit kesalahan, kekurangan, atau ambiguitas, tetapi jawaban secara umum benar dan masih dapat digunakan dengan sedikit perbaikan atau verifikasi manual.


3 – Netral
Baik (a) jawabannya secara umum benar tetapi tidak sepenuhnya mengikuti instruksi pengguna, atau (b) jawabannya mengikuti instruksi tetapi mengandung kesalahan besar yang mengurangi tingkat kepercayaan.


2 – Kurang Membantu
Jawaban salah atau tidak relevan pada bagian-bagian penting, atau tidak mengikuti instruksi pengguna, tetapi masih menunjukkan upaya atau struktur penyelesaian.


1 – Sangat Tidak Membantu
Jawaban salah secara fakta, mengabaikan instruksi pengguna, menunjukkan kesalahpahaman mendasar, atau tidak menunjukkan adanya upaya untuk menyelesaikan instruksi.


Petunjuk penilaian:
Anda akan diberikan:
- Instruksi dari pengguna
- Jawaban dari asisten AI


Evaluasilah jawaban tersebut menggunakan skala di atas, dengan mempertimbangkan akurasi dan kepatuhan terhadap instruksi pengguna sebagai kriteria utama.


Instruksi pengguna:
{{prompt}}


Jawaban asisten AI:
{{prediction}}

To measure alignment, we used two standard metrics:

  • Pearson correlation – Measures the linear relationship between score values. Useful for detecting overall similarity in score trends.
  • Cohen’s kappa (linear weighted) – Captures agreement between evaluators, adjusted for chance. Especially useful for discrete scales like Likert scores.

Alignment between LLM judges and human evaluations

We began by comparing the average helpfulness scores given by each evaluator using the English judge prompt. The following chart shows the evaluation results.

Comparative analysis of helpfulness scores between Human evaluator and Models (Strong-A/B, Weak-A/B), showing ratings between 4.11-4.93

When evaluating responses from the stronger model, LLM judges tended to agree with human ratings. But on responses from the weaker model, most LLMs gave noticeably higher scores than humans. This suggests that LLM judges tend to be more generous when response quality is lower.

We designed the evaluation prompt to guide models toward scoring behavior similar to human annotators, but score patterns still showed signs of potential bias. Model Strong-A rated its own outputs highly (4.93), whereas Model Weak-A gave its own responses a higher score than humans did. In contrast, Model Strong-B, which didn’t evaluate its own outputs, gave scores that were closer to human ratings.

To better understand alignment between LLM judges and human preferences, we analyzed Pearson and Cohen’s kappa correlations between them. On responses from Model Weak-A, alignment was strong. Model Strong-A and Model Strong-B achieved Pearson correlations of 0.45 and 0.61, with kappa scores of 0.33 and 0.4.

LLM judges and human alignment on responses from Model Strong-A was more moderate. All evaluators had Pearson correlations between 0.26–0.33 and weighted Kappa scores between 0.2–0.22. This might be due to limited variation in either human or model scores, which reduces the ability to detect strong correlation patterns.

To complete our analysis, we also conducted a qualitative deep dive. Amazon Bedrock makes this straightforward by providing JSONL outputs from each LLM-as-a-judge run that include both the evaluation score and the model’s reasoning. This helped us review evaluator justifications and identify cases where scores were incorrectly extracted or parsed.

From this review, we identified several factors behind the misalignment between LLM and human judgments:

  • Evaluator capability ceiling – In some cases, especially in reasoning tasks, the LLM evaluator couldn’t solve the original task itself. This made its evaluations flawed and unreliable at identifying whether a response was correct.
  • Evaluation hallucination – In other cases, the LLM evaluator assigned low scores to correct answers not because of reasoning failure, but because it imagined errors or flawed logic in responses that were actually valid.
  • Overriding instructions – Certain models occasionally overrode explicit instructions based on ethical judgment. For example, two evaluator models rated a response that created misleading political campaign content as very unhelpful (even though the response included its own warnings), whereas human evaluators rated it very helpful for following the task.

These problems highlight the importance of using human evaluations as a baseline and performing qualitative deep dives to fully understand LLM-as-a-judge results.

Cross-lingual evaluation capabilities

After analyzing evaluation results from the English judge prompt, we moved to the final step of our analysis: comparing evaluation results between English and Indonesian judge prompts.

We began by comparing overall helpfulness scores and alignment with human ratings. Helpfulness scores remained nearly identical for all models, with most shifts within ±0.05. Alignment with human ratings was also similar: Pearson correlations between human scores and LLM-as-a-judge using Indonesian judge prompts closely matched those using English judge prompts. In statistically meaningful cases, correlation score differences were typically within ±0.1.

To further assess cross-language consistency, we computed Pearson correlation and Cohen’s kappa directly between LLM-as-a-judge evaluation scores generated using English and Indonesian judge prompts on the same response set. The following tables show correlation between scores from Indonesian and English judge prompts for each evaluator LLM, on responses generated by Model Weak-A and Model Strong-A.

The first table summarizes the evaluation of Model Weak-A responses.

Metric Model Strong-A Model Strong-B Model Weak-A Model Weak-B
Pearson correlation 0.73 0.79 0.64 0.64
Cohen’s Kappa 0.59 0.69 0.42 0.49

The next table summarizes the evaluation of Model Strong-A responses.

Metric Model Strong-A Model Strong-B Model Weak-A Model Weak-B
Pearson correlation 0.41 0.8 0.51 0.7
Cohen’s Kappa 0.36 0.65 0.43 0.61

Correlation between evaluation results from both judge prompt languages was strong across all evaluator models. On average, Pearson correlation was 0.65 and Cohen’s kappa was 0.53 across all models.

We also conducted a qualitative review comparing evaluations from both evaluation prompt languages for Model Strong-A and Model Strong-B. Overall, both models showed consistent reasoning across languages in most cases. However, occasional hallucinated errors or flawed logic occurred at similar rates across both languages (we should note that humans make occasional mistakes as well).

One interesting pattern we observed with one of the stronger evaluator models was that it tended to follow the evaluation prompt more strictly in the Indonesian version. For example, it rated a response as unhelpful when it refused to generate misleading political content, even though the task explicitly asked for it. This behavior differed from the English prompt evaluation. In a few cases, it also assigned a noticeably stricter score compared to the English evaluator prompt even though the reasoning across both languages was similar, better matching how humans typically evaluate.

These results confirm that although prompt translation remains a useful option, it is not required to achieve consistent evaluation. You can rely on English evaluator prompts even for non-English outputs, for example by using Amazon Bedrock LLM-as-a-judge predefined and custom metrics to make multilingual evaluation simpler and more scalable.

Takeaways

The following are key takeaways for building a robust LLM evaluation framework:

  • LLM-as-a-judge is a practical evaluation method – It offers faster, cheaper, and scalable assessments while maintaining reasonable judgment quality across languages. This makes it suitable for large-scale deployments.
  • Choose a judge model based on practical evaluation needs – Across our experiments, stronger models aligned better with human ratings, especially on weaker outputs. However, even top models can misjudge harder tasks or show self-bias. Use capable, neutral evaluators to facilitate fair comparisons.
  • Manual human evaluations remain essential – Human evaluations provide the reference baseline for benchmarking automated scoring and understanding model judgment behavior.
  • Prompt design meaningfully shapes evaluator behavior – Aligning your evaluation prompt with how humans actually score improves quality and trust in LLM-based evaluations.
  • Translated evaluation prompts are helpful but not required – English evaluator prompts reliably judge non-English responses, especially for evaluator models that support multilingual input.
  • Always be ready to deep dive with qualitative analysis – Reviewing evaluation disagreements by hand helps uncover hidden model behaviors and makes sure that statistical metrics tell the full story.
  • Simplify your evaluation workflow using Amazon Bedrock evaluation features – Amazon Bedrock built-in human evaluation and LLM-as-a-judge evaluation capabilities simplify iteration and streamline your evaluation workflow.

Conclusion

Through our experiments, we demonstrated that LLM-as-a-judge evaluations can deliver consistent and reliable results across languages, even without prompt translation. With properly designed evaluation prompts, LLMs can maintain high alignment with human ratings regardless of evaluator prompt language. Though we focused on Indonesian, the results indicate similar techniques are likely effective for other non-English languages, but you are encouraged to assess for yourself on any language you choose. This reduces the need to create localized evaluation prompts for every target audience.

To level up your evaluation practices, consider the following ways to extend your approach beyond foundation model scoring:

  • Evaluate your Retrieval Augmented Generation (RAG) pipeline, assessing not just LLM responses but also retrieval quality using Amazon Bedrock RAG evaluation capabilities
  • Evaluate and monitor continuously, and run evaluations before production launch, during live operation, and ahead of any major system upgrades

Begin your cross-lingual evaluation journey today with Amazon Bedrock Evaluations and scale your AI solutions confidently across global landscapes.


About the authors

Riza Saputra is a Senior Solutions Architect at AWS, working with startups of all stages to help them grow securely, scale efficiently, and innovate faster. His current focus is on generative AI, guiding organizations in building and scaling AI solutions securely and efficiently. With experience across roles, industries, and company sizes, he brings a versatile perspective to solving technical and business challenges. Riza also shares his knowledge through public speaking and content to support the broader tech community.

Read More

Cohere Embed 4 multimodal embeddings model is now available on Amazon SageMaker JumpStart

Cohere Embed 4 multimodal embeddings model is now available on Amazon SageMaker JumpStart

This post is co-written with Payal Singh from Cohere.

The Cohere Embed 4 multimodal embeddings model is now generally available on Amazon SageMaker JumpStart. The Embed 4 model is built for multimodal business documents, has leading multilingual capabilities, and offers notable improvement over Embed 3 across key benchmarks.

In this post, we discuss the benefits and capabilities of this new model. We also walk you through how to deploy and use the Embed 4 model using SageMaker JumpStart.

Cohere Embed 4 overview

Embed 4 is the most recent addition to the Cohere Embed family of enterprise-focused large language models (LLMs). It delivers state-of-the-art multimodality. This is useful because businesses continue to store the majority of important data in an unstructured format. Document formats include intricate PDF reports, presentation slides, as well as text-based documents or design files that might include images, tables, graphs, code, and diagrams. Without the ability to natively understand complex multimodal documents, these types of documents become repositories of unsearchable information. With Embed 4, enterprises and their employees can search across text, image, and multimodal documents. Embed 4 also offers leading multilingual capabilities, understanding over 100 languages, including Arabic, French, Japanese, and Korean. This capability is useful to global enterprises that handle documents in multiple languages. Employees can also find critical data even if the information isn’t stored using a language they speak. Overall, Embed 4 empowers global enterprises to break down language barriers and manage information in the languages most familiar to their customers.

In the following diagram (source), each language category represents a blend of public and proprietary benchmarks (see more details). Tasks ranged from monolingual to cross-lingual (English as the query language and the respective monolingual non-English language as the corpus). Dataset performance metrics are measured by NDCG@10.

Embeddings models are already being used to handle documents with both text and images. However, optimal performance usually requires additional complexity because a multimodal generative model must preprocess documents into a format that is suitable for the embeddings model. Embed 4 can transform different modalities such as images, texts, and interleaved images and texts into a single vector representation. Processing a single payload of images and text decreases the operational burden associated with handling documents.

Embed 4 can also generate embeddings for documents up to 128,000 tokens, approximately 200 pages in length. This extended capacity alleviates the need for custom logic to split lengthy documents, making it straightforward to process financial reports, product manuals, and detailed legal contracts. In contrast, models with shorter context lengths force developers to create complex workflows to split documents while preserving their logical structure. With Embed 4, as long as a document fits within the 128,000-token limit, it can be converted into a high-quality, unified vector representation.

Embed 4 also has enhancements for security-minded industries such as finance, healthcare, and manufacturing. Businesses in these regulated industries need models that have both strong general business knowledge as well as domain-specific understanding. Business data also tends to be imperfect. Documents often come with spelling mistakes and formatting issues such as improper portrait and landscape orientation. Embed 4 was trained to be robust against noisy real-world data that also includes scanned documents and handwriting. This further alleviates the need for complex and expensive data preprocessing pipelines.

Use cases

In this section, we discuss several possible use cases for Embed 4. Embed 4 unlocks a range of capabilities for enterprises seeking to streamline information discovery, enhance generative AI workflows, and optimize storage efficiency. Below, we highlight several keys use cases that demonstrate the versatility of Embed 4 in a range of regulated industries.

Simplifying multimodal search

You can take advantage of Embed 4 multimodal capabilities in use cases that require point semantic search. For example, in the retail industry, it might be helpful to search with both text and image. An example search phrase can even include a modifier (for example, “Same style pants but with no stripes”). The same logic can be applied to an analyst’s workflow where users might need to find the right charts and diagrams to explain trends. This is traditionally a time-consuming process that requires manual effort to sift through documents and contextualize information. Because Embed 4 has enhancements for finance, healthcare, and manufacturing, users in these industries can take advantage of built-in domain-specific understanding as well as strong general knowledge to accelerate the time to value. With Embed 4, it’s straightforward to conduct research and turn data into actionable insights.

Powering Retrieval Augmented Generation workflows

Another use case is Retrieval Augmented Generation (RAG) applications that require access to internal information. With the 128,000 context length of Embed 4, businesses can use existing long-form documents that include images without the need to implement complex preprocessing pipelines. An example might be a generative AI application built to assist analysts with M&A due diligence. Having access to a broader repository of information increases the likelihood of making informed decisions.

Optimizing agentic AI workflows with compressed embeddings

Businesses can use intelligent AI agents to reduce unnecessary costs, automate repetitive tasks, and reduce human errors. AI agents need relevant and contextual information to perform tasks accurately. This is done through RAG. The generative model as part of RAG that powers the conversational experience relies on a search engine that is connected to company data sources to retrieve relevant information before providing the final response. For example, an agent might need to extract relevant conversation logs to analyze customer sentiment about a specific product and deduce the most effective next step in a customer interaction.

Embed 4 is the optimal search engine for enterprise AI assistants and agents, which improves the accuracy of responses and mitigates against hallucinations.

At scale, storing embeddings can lead to high storage costs. Embed 4 is designed to output compressed embeddings, where users can choose their own dimension size (example: 256, 512, 1024, and 1536). This helps organizations to save up to 83% on storage costs while maintaining search accuracy.

The following diagram illustrates retrieval quality vs. storage cost across different models (source). Compression can occur on the format precision of the vectors (binary, int8, and fp32) and the dimension of the vectors. Dataset performance metrics are measured by NDCG@10.

Using domain-specific understanding for regulated industries

With Embed 4 enhancements, you can surface relevant insights from complex financial documents like investor presentations, annual reports, and M&A due diligence files. Embed 4 can also extract key information from healthcare documents such as medical records, procedural charts, and clinical trial reports. For manufacturing use cases, Embed 4 can handle product specifications, repair guides, and supply chain plans. These capabilities unlock a broader range of use cases because enterprises can use models out of the box without costly fine-tuning efforts.

Solution overview

SageMaker JumpStart onboards and maintains foundation models (FMs) for you to access and integrate into machine learning (ML) lifecycles. The FMs available in SageMaker JumpStart include publicly available FMs as well as proprietary FMs from third-party providers.

Amazon SageMaker AI is a fully managed ML service. It helps data scientists and developers quickly and confidently build, train, and deploy ML models into a production-ready hosted environment. Amazon SageMaker Studio is a single web-based experience for running ML workflows. It provides access to your SageMaker AI resources in one interface.

In the following sections, we show how to get started with Cohere Embed 4.

Prerequisites

Make sure you meet the following prerequisites:

  • Make sure your SageMaker AWS Identity and Access Management (IAM) role has the AmazonSageMakerFullAccess permission policy attached.
  • To deploy Cohere Embed 4 successfully, confirm that your IAM role has the following permissions and you have the authority to make AWS Marketplace subscriptions in the AWS account used:
    • aws-marketplace:ViewSubscriptions
    • aws-marketplace:Unsubscribe
    • aws-marketplace:Subscribe
  • Alternatively, confirm your AWS account has a subscription to the model. If so, skip to the next section in this post.

Subscribe to the model package

To subscribe to the model package, complete the following steps:

  1. In the AWS Marketplace listing, choose Continue to Subscribe.

  1. On the Subscribe to this software page, review and choose Accept Offer if you and your organization agree with EULA, pricing, and support terms.
  2. Choose Continue to configuration and then choose an AWS Region.

You will see a product Amazon Resource Name (ARN) displayed. This is the model package ARN that you must specify while creating a deployable model using Boto3.

Deploy Cohere Embed 4 for inference through SageMaker JumpStart

If you want to start using Embed 4 immediately, you can choose from one of three available launch methods. Either use the AWS CloudFormation template, the SageMaker console, or the AWS Command Line Interface (AWS CLI). You will incur costs for software use based on hourly pricing as long as your endpoint is running. You will also incur costs for infrastructure use independent and in addition to the costs of software.

Choose the appropriate model package ARN for your Region. For example, the ARN for Cohere Embed 4 is:

 arn:aws:sagemaker:[REGION]:[ACCOUNT_ID]:model-package/cohere-embed-v4-1-04072025-17ec0571acd93686b6cfb44babe01d66

Alternatively, in SageMaker Studio, open JumpStart. Search for Cohere Embed 4. If you don’t yet have a domain, refer to Guide to getting set up with Amazon SageMaker AI to create a domain. Search for the Cohere Embed 4 model. Deployment starts when you choose Deploy.

When deployment is complete, an endpoint is created. You can test the endpoint by passing a sample inference request payload or by selecting the testing option using the SDK.

To use the Python SDK example code, choose Test inference and Open in JupyterLab. If you don’t have a JupyterLab space yet, refer to Create a space.

Clean up

After you finish running the notebook and experimenting with the Embed 4 model, it’s crucial to clean up the resources you have provisioned. Failing to do so might result in unnecessary charges accruing on your account. To use the SageMaker AI console, complete the following steps:

  1. On the SageMaker AI console, under Inference in the navigation pane, choose Endpoints.
  2. Choose the endpoint you created.
  3. On the endpoint details page, choose Delete.
  4. Choose Delete again to confirm.

Conclusion

In this post, we explored how Cohere Embed 4, now available on SageMaker JumpStart, delivers state-of-the-art multimodal embedding capabilities. These capabilities make it particularly valuable for enterprises working with unstructured data across finance, healthcare, manufacturing, and other regulated industries.

Interested in diving deeper? Check out the Cohere on AWS GitHub repo.


About the authors

James Yi is a Senior AI/ML Partner Solutions Architect at AWS. He spearheads AWS’s strategic partnerships in Emerging Technologies, guiding engineering teams to design and develop cutting-edge joint solutions in generative AI. He enables field and technical teams to seamlessly deploy, operate, secure, and integrate partner solutions on AWS. James collaborates closely with business leaders to define and execute joint Go-To-Market strategies, driving cloud-based business growth. Outside of work, he enjoys playing soccer, traveling, and spending time with his family.

Karan Singh is a Generative AI Specialist at AWS, where he works with top-tier third-party foundation model and agentic frameworks providers to develop and execute joint go-to-market strategies, enabling customers to effectively deploy and scale solutions to solve enterprise generative AI challenges.

Mehran Najafi, PhD, serves as AWS Principal Solutions Architect and leads the Generative AI Solution Architects team for AWS Canada. His expertise lies in ensuring the scalability, optimization, and production deployment of multi-tenant generative AI solutions for enterprise customers.

John Liu has 15 years of experience as a product executive and 9 years of experience as a portfolio manager. At AWS, John is a Principal Product Manager for Amazon Bedrock. Previously, he was the Head of Product for AWS Web3 / Blockchain. Prior to AWS, John held various product leadership roles at public blockchain protocols, fintech companies and also spent 9 years as a portfolio manager at various hedge funds.

Hugo Tse is a Solutions Architect at AWS, with a focus on Generative AI and Storage solutions. He is dedicated to empowering customers to overcome challenges and unlock new business opportunities using technology. He holds a Bachelor of Arts in Economics from the University of Chicago and a Master of Science in Information Technology from Arizona State University.

Payal Singh is a Solutions Architect at Cohere with over 15 years of cross-domain expertise in DevOps, Cloud, Security, SDN, Data Center Architecture, and Virtualization. She drives partnerships at Cohere and helps customers with complex GenAI solution integrations.

Read More

How INRIX accelerates transportation planning with Amazon Bedrock

How INRIX accelerates transportation planning with Amazon Bedrock

This post is co-written with Shashank Saraogi, Nat Gale, and Durran Kelly from INRIX.

The complexity of modern traffic management extends far beyond mere road monitoring, encompassing massive amounts of data collected worldwide from connected cars, mobile devices, roadway sensors, and major event monitoring systems. For transportation authorities managing urban, suburban, and rural traffic flow, the challenge lies in effectively processing and acting upon this vast network of information. The task requires balancing immediate operational needs, such as real-time traffic redirection during incidents, with strategic long-term planning for improved mobility and safety.

Traditionally, analyzing these complex data patterns and producing actionable insights has been a resource-intensive process requiring extensive collaboration. With recent advances in generative AI, there is an opportunity to transform how we process, understand, and act upon transportation data, enabling more efficient and responsive traffic management systems.

In this post, we partnered with Amazon Web Services (AWS) customer INRIX to demonstrate how Amazon Bedrock can be used to determine the best countermeasures for specific city locations using rich transportation data and how such countermeasures can be automatically visualized in street view images. This approach allows for significant planning acceleration compared to traditional approaches using conceptual drawings.

INRIX pioneered the use of GPS data from connected vehicles for transportation intelligence. For over 20 years, INRIX has been a leader for probe-based connected vehicle and device data and insights, powering automotive, enterprise, and public sector use cases. INRIX’s products range from tickerized datasets that inform investment decisions for the financial services sector to digital twins for the public rights-of-way in the cities of Philadelphia and San Francisco. INRIX was the first company to develop a crowd-sourced traffic network, and they continue to lead in real-time mobility operations.

In June 2024, the State of California’s Department of Transportation (Caltrans) selected INRIX for a proof of concept for a generative AI-powered solution to improve safety for vulnerable road users (VRUs). The problem statement sought to harness the combination of Caltrans’ asset, crash, and points-of-interest (POI) data and INRIX’s 50 petabyte (PB) data lake to anticipate high-risk locations and quickly generate empirically validated safety measures to mitigate the potential for crashes. Trained on real-time and historical data and industry research and manuals, the solution provides a new systemic, safety-based methodology for risk assessment, location prioritization, and project implementation.

Solution overview

INRIX announced INRIX Compass in November 2023. INRIX Compass is an application that harnesses generative AI and INRIX’s 50 PB data lake to solve transportation challenges. This solution uses INRIX Compass countermeasures as the input, AWS serverless architecture, and Amazon Nova Canvas as the image visualizer. Key components include:

The following diagram shows the architecture of INRIX Compass.

INRIX Compass for countermeasures

By using INRIX Compass, users can ask natural language queries such as, Where are the top five locations with the highest risk for vulnerable road users? and Can you recommend a suite of proven safety countermeasures at each of these locations? Furthermore, users can probe deeper into the roadway characteristics that contribute to risk factors, and find similar locations in the roadway network that meet those conditions. Behind the scenes, Compass AI uses RAG and Amazon Bedrock powered foundation models (FMs) to query the roadway network to identify and prioritize locations with systemic risk factors and anomalous safety patterns. The solution provides prioritized recommendations for operational and design solutions and countermeasures based on industry knowledge.

The following image shows the interface of INRIX Compass.

Image visualization for countermeasures

The generation of countermeasure suggestions represents the initial phase in transportation planning. Image visualization requires the crucial next step of preparing conceptual drawings. This process has traditionally been time-consuming due to the involvement of multiple specialized teams, including:

  • Transportation engineers who assess technical feasibility and safety standards
  • Urban planners who verify alignment with city development goals
  • Landscape architects who integrate environmental and aesthetic elements
  • CAD or visualization specialists who create detailed technical drawings
  • Safety analysts who evaluate the potential impact on road safety
  • Public works departments who oversee implementation feasibility
  • Traffic operations teams who assess impact on traffic flow and management

These teams work collaboratively, creating and iteratively refining various visualizations based on feedback from urban designers and other stakeholders. Each iteration cycle typically involves multiple rounds of reviews, adjustments, and approvals, often extending the timeline significantly. The complexity is further amplified by city-specific rules and design requirements, which often necessitate significant customization. Additionally, local regulations, environmental considerations, and community feedback must be incorporated into the design process. Consequently, this lengthy and costly process frequently leads to delays in implementing safety countermeasures. To streamline this challenge, INRIX has pioneered an innovative approach to the visualization phase by using generative AI technology. This prototyped solution enables rapid iteration of conceptual drawings that can be efficiently reviewed by various teams, potentially reducing the design cycle from weeks to days. Moreover, the system incorporates a few-shot learning approach with reference images and carefully crafted prompts, allowing for seamless integration of city-specific requirements into the generated outputs. This approach not only accelerates the design process but also supports consistency across different projects while maintaining compliance with local standards.

The following image shows the congestion insights by INRIX Compass.

Amazon Nova Canvas for conceptual visualizations

INRIX developed and prototyped this solution using Amazon Nova models. Amazon Nova Canvas delivers advanced image processing through text-to-image generation and image-to-image transformation capabilities. The model provides sophisticated controls for adjusting color schemes and manipulating layouts to achieve desired visual outcomes. To promote responsible AI implementation, Amazon Nova Canvas incorporates built-in safety measures, including watermarking and content moderation systems.

The model supports a comprehensive range of image editing operations. These operations encompass basic image generation, object removal from existing images, object replacement within scenes, creation of image variations, and modification of image backgrounds. This versatility makes Amazon Nova Canvas suitable for a wide range of professional applications requiring sophisticated image editing.

The following sample images show an example of countermeasures visualization.

In-painting implementation in Compass AI

Amazon Nova Canvas integrates with INRIX Compass’s existing natural language analytics capabilities. The original Compass system generated text-based countermeasure recommendations based on:

  • Historical transportation data analysis
  • Current environmental conditions
  • User-specified requirements

The INRIX Compass visualization feature specifically uses the image generation and in-painting capabilities of Amazon Nova Canvas. In-painting enables object replacement through two distinct approaches:

  • A binary mask precisely defines the areas targeted for replacement.
  • Text prompts identify objects for replacement, allowing the model to interpret and modify the specified elements while maintaining visual coherence with the surrounding image context. This functionality provides seamless integration of new elements while preserving the overall image composition and contextual relevance. The developed interface accommodates both image generation and in-painting approaches, providing comprehensive image editing capabilities.

The implementation follows a two-stage process for visualizing transportation countermeasures. Initially, the system employs image generation functionality to create street-view representations corresponding to specific longitude and latitude coordinates where interventions are proposed. Following the initial image creation, the in-painting capability enables precise placement of countermeasures within the generated street view scene. This sequential approach provides accurate visualization of proposed modifications within the actual geographical context.

An Amazon Bedrock API facilitates image editing and generation through the Amazon Nova Canvas model. The responses contain the generated or modified images in base64 format, which can be decoded and processed for further use in the application. The generative AI capabilities of Amazon Bedrock enable rapid iteration and simultaneous visualization of multiple countermeasures within a single image. RAG implementation can further extend the pipeline’s capabilities by incorporating county-specific regulations, standardized design patterns, and contextual requirements. The integration of these technologies significantly streamlines the countermeasure deployment workflow. Traditional manual visualization processes that previously required extensive time and resources can now be executed efficiently through automated generation and modification. This automation delivers substantial improvements in both time-to-deployment and cost-effectiveness.

Conclusion

The partnership between INRIX and AWS showcases the transformative potential of AI in solving complex transportation challenges. By using Amazon Bedrock FMs, INRIX has turned their massive 50 PB data lake into actionable insights through effective visualization solutions. This post highlighted a single specific transportation use case, but Amazon Bedrock and Amazon Nova power a wide spectrum of applications, from text generation to video creation. The combination of extensive data and advanced AI capabilities continues to pave the way for smarter, more efficient transportation systems worldwide.

For more information, check out the documentation for Amazon Nova Foundation Models, Amazon Bedrock, and INRIX Compass.


About the authors

Arun is a Senior Solutions Architect at AWS, supporting enterprise customers in the Pacific Northwest. He’s passionate about solving business and technology challenges as an AWS customer advocate, with his recent interest being AI strategy. When not at work, Arun enjoys listening to podcasts, going for short trail runs, and spending quality time with his family.

Alicja Kwasniewska, PhD, is an AI leader driving generative AI innovations in enterprise solutions and decision intelligence for customer engagements in North America, advertisement and marketing verticals at AWS. She is recognized among the top 10 women in AI and 100 women in data science. Alicja published in more than 40 peer-reviewed publications. She also serves as a reviewer for top-tier conferences, including ICML,NeurIPS,and ICCV. She advises organizations on AI adoption, bridging research and industry to accelerate real-world AI applications.

Shashank is the VP of Engineering at INRIX, where he leads multiple verticals, including generative AI and traffic. He is passionate about using technology to make roads safer for drivers, bikers, and pedestrians every day. Prior to working at INRIX, he held engineering leadership roles at Amazon and Lyft. Shashank brings deep experience in building impactful products and high-performing teams at scale. Outside of work, he enjoys traveling, listening to music, and spending time with his family.

Nat Gale is the Head of Product at INRIX, where he manages the Safety and Traffic product verticals. Nat leads the development of data products and software that help transportation professionals make smart, more informed decisions. He previously ran the City of Los Angeles’ Vision Zero program and was the Director of Capital Projects and Operations for the City of Hartford, CT.

Durran is a Lead Software Engineer at INRIX, where he designs scalable backend systems and mentors engineers across multiple product lines. With over a decade of experience in software development, he specializes in distributed systems, generative AI, and cloud infrastructure. Durran is passionate about writing clean, maintainable code and sharing best practices with the developer community. Outside of work, he enjoys spending quality time with his family and deepening his Japanese language skills.

Read More

Qwen3 family of reasoning models now available in Amazon Bedrock Marketplace and Amazon SageMaker JumpStart

Qwen3 family of reasoning models now available in Amazon Bedrock Marketplace and Amazon SageMaker JumpStart

Today, we are excited to announce that Qwen3, the latest generation of large language models (LLMs) in the Qwen family, is available through Amazon Bedrock Marketplace and Amazon SageMaker JumpStart. With this launch, you can deploy the Qwen3 models—available in 0.6B, 4B, 8B, and 32B parameter sizes—to build, experiment, and responsibly scale your generative AI applications on AWS.

In this post, we demonstrate how to get started with Qwen3 on Amazon Bedrock Marketplace and SageMaker JumpStart. You can follow similar steps to deploy the distilled versions of the models as well.

Solution overview

Qwen3 is the latest generation of LLMs in the Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Qwen3 delivers groundbreaking advancements in reasoning, instruction-following, agent capabilities, and multilingual support, with the following key features:

  • Unique support of seamless switching between thinking mode and non-thinking mode within a single model, providing optimal performance across various scenarios.
  • Significantly enhanced in its reasoning capabilities, surpassing previous QwQ (in thinking mode) and Qwen2.5 instruct models (in non-thinking mode) on mathematics, code generation, and commonsense logical reasoning.
  • Good human preference alignment, excelling in creative writing, role-playing, multi-turn dialogues, and instruction following, to deliver a more natural, engaging, and immersive conversational experience.
  • Expertise in agent capabilities, enabling precise integration with external tools in both thinking and unthinking modes and achieving leading performance among open source models in complex agent-based tasks.
  • Support for over 100 languages and dialects with strong capabilities for multilingual instruction following and translation.

Prerequisites

To deploy Qwen3 models, make sure you have access to the recommended instance types based on the model size. You can find these instance recommendations on Amazon Bedrock Marketplace or the SageMaker JumpStart console. To verify you have the necessary resources, complete the following steps:

  1. Open the Service Quotas console.
  2. Under AWS Services, select Amazon SageMaker.
  3. Check that you have sufficient quota for the required instance type for endpoint deployment.
  4. Make sure at least one of these instance types is available in your target AWS Region.

If needed, request a quota increase and contact your AWS account team for support.

Deploy Qwen3 in Amazon Bedrock Marketplace

Amazon Bedrock Marketplace gives you access to over 100 popular, emerging, and specialized foundation models (FMs) through Amazon Bedrock. To access Qwen3 in Amazon Bedrock, complete the following steps:

  1. On the Amazon Bedrock console, in the navigation pane under Foundation models, choose Model catalog.
  2. Filter for Hugging Face as a provider and choose a Qwen3 model. For this example, we use the Qwen3-32B model.

The Amazon Bedrock model catalog displays Qwen3 text generation models. The interface includes a left navigation panel with filters for model collection, providers, and modality, while the main content area shows model cards with deployment information.

The model detail page provides essential information about the model’s capabilities, pricing structure, and implementation guidelines. You can find detailed usage instructions, including sample API calls and code snippets for integration.

The page also includes deployment options and licensing information to help you get started with Qwen3-32B in your applications.

  1. To begin using Qwen3-32B, choose Deploy.

The details page displays comprehensive information about the Qwen3 32B model, including its version, delivery method, release date, model ID, and deployment status. The interface includes deployment options and playground access.

You will be prompted to configure the deployment details for Qwen3-32B. The model ID will be pre-populated.

  1. For Endpoint name, enter an endpoint name (between 1–50 alphanumeric characters).
  2. For Number of instances, enter a number of instances (between 1–100).
  3. For Instance type, choose your instance type. For optimal performance with Qwen3-32B, a GPU-based instance type like ml.g5-12xlarge is recommended.
  4. To deploy the model, choose Deploy.

he deployment configuration page displays essential settings for hosting a Bedrock model endpoint in SageMaker. It includes fields for Model ID, Endpoint name, Number of instances, and Instance type selection.

When the deployment is complete, you can test Qwen3-32B’s capabilities directly in the Amazon Bedrock playground.

  1. Choose Open in playground to access an interactive interface where you can experiment with different prompts and adjust model parameters like temperature and maximum length.

This is an excellent way to explore the model’s reasoning and text generation abilities before integrating it into your applications. The playground provides immediate feedback, helping you understand how the model responds to various inputs and letting you fine-tune your prompts for optimal results.You can quickly test the model in the playground through the UI. However, to invoke the deployed model programmatically with any Amazon Bedrock APIs, you must have the endpoint Amazon Resource Name (ARN).

Enable reasoning and non-reasoning responses with Converse API

The following code shows how to turn reasoning on and off with Qwen3 models using the Converse API, depending on your use case. By default, reasoning is left on for Qwen3 models, but you can streamline interactions by using the /no_think command within your prompt. When you add this to the end of your query, reasoning is turned off and the models will provide just the direct answer. This is particularly useful when you need quick information without explanations, are familiar with the topic, or want to maintain a faster conversational flow. At the time of writing, the Converse API doesn’t support tool use for Qwen3 models. Refer to the Invoke_Model API example later in this post to learn how to use reasoning and tools in the same completion.

import boto3
from botocore.exceptions import ClientError

# Create a Bedrock Runtime client in the AWS Region you want to use.
client = boto3.client("bedrock-runtime", region_name="us-west-2")

# Configuration
model_id = ""  # Replace with Bedrock Marketplace endpoint arn

# Start a conversation with the user message.
user_message = "hello, what is 1+1 /no_think" #remove /no_think to leave default reasoning on
conversation = [
    {
        "role": "user",
        "content": [{"text": user_message}],
    }
]

try:
    # Send the message to the model, using a basic inference configuration.
    response = client.converse(
        modelId=model_id,
        messages=conversation,
        inferenceConfig={"maxTokens": 512, "temperature": 0.5, "topP": 0.9},
    )

    # Extract and print the response text.
    #response_text = response["output"]["message"]["content"][0]["text"]
    #reasoning_content = response ["output"]["message"]["reasoning_content"][0]["text"]
    #print(response_text, reasoning_content)
    print(response)
    
except (ClientError, Exception) as e:
    print(f"ERROR: Can't invoke '{model_id}'. Reason: {e}")
    exit(1)

The following is a response using the Converse API, without default thinking:

{'ResponseMetadata': {'RequestId': 'f7f3953a-5747-4866-9075-fd4bd1cf49c4', 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Tue, 17 Jun 2025 18:34:47 GMT', 'content-type': 'application/json', 'content-length': '282', 'connection': 'keep-alive', 'x-amzn-requestid': 'f7f3953a-5747-4866-9075-fd4bd1cf49c4'}, 'RetryAttempts': 0}, 'output': {'message': {'role': 'assistant', 'content': [{'text': 'nnHello! The result of 1 + 1 is **2**. 😊'}, {'reasoningContent': {'reasoningText': {'text': 'nn'}}}]}}, 'stopReason': 'end_turn', 'usage': {'inputTokens': 20, 'outputTokens': 22, 'totalTokens': 42}, 'metrics': {'latencyMs': 1125}}

The following is an example with default thinking on; the <think> tokens are automatically parsed into the reasoningContent field for the Converse API:

{'ResponseMetadata': {'RequestId': 'b6d2ebbe-89da-4edc-9a3a-7cb3e7ecf066', 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Tue, 17 Jun 2025 18:32:28 GMT', 'content-type': 'application/json', 'content-length': '1019', 'connection': 'keep-alive', 'x-amzn-requestid': 'b6d2ebbe-89da-4edc-9a3a-7cb3e7ecf066'}, 'RetryAttempts': 0}, 'output': {'message': {'role': 'assistant', 'content': [{'text': 'nnHello! The sum of 1 + 1 is **2**. Let me know if you have any other questions or need further clarification! 😊'}, {'reasoningContent': {'reasoningText': {'text': 'nOkay, the user asked "hello, what is 1+1". Let me start by acknowledging their greeting. They might just be testing the water or actually need help with a basic math problem. Since it's 1+1, it's a very simple question, but I should make sure to answer clearly. Maybe they're a child learning math for the first time, or someone who's not confident in their math skills. I should provide the answer in a friendly and encouraging way. Let me confirm that 1+1 equals 2, and maybe add a brief explanation to reinforce their understanding. I can also offer further assistance in case they have more questions. Keeping it conversational and approachable is key here.n'}}}]}}, 'stopReason': 'end_turn', 'usage': {'inputTokens': 16, 'outputTokens': 182, 'totalTokens': 198}, 'metrics': {'latencyMs': 7805}}

Perform reasoning and function calls in the same completion using the Invoke_Model API

With Qwen3, you can stream an explicit trace and the exact JSON tool call in the same completion. Up until now, reasoning models have forced the choice to either show the chain of thought or call tools deterministically. The following code shows an example:

messages = json.dumps( {
    "messages": [
        {
            "role": "user",
            "content": "Hi! How are you doing today?"
        }, 
        {
            "role": "assistant",
            "content": "I'm doing well! How can I help you?"
        }, 
        {
            "role": "user",
            "content": "Can you tell me what the temperate will be in Dallas, in fahrenheit?"
        }
    ],
    "tools": [{
        "type": "function",
        "function": {
            "name": "get_current_weather",
            "description": "Get the current weather in a given location",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {
                        "type":
                            "string",
                        "description":
                            "The city to find the weather for, e.g. 'San Francisco'"
                    },
                    "state": {
                        "type":
                            "string",
                        "description":
                            "the two-letter abbreviation for the state that the city is in, e.g. 'CA' which would mean 'California'"
                    },
                    "unit": {
                        "type": "string",
                        "description":
                            "The unit to fetch the temperature in",
                        "enum": ["celsius", "fahrenheit"]
                    }
                },
                "required": ["city", "state", "unit"]
            }
        }
    }],
    "tool_choice": "auto"
})

response = client.invoke_model(
    modelId=model_id, 
    body=body
)
print(response)
model_output = json.loads(response['body'].read())
print(json.dumps(model_output, indent=2))

Response:

{'ResponseMetadata': {'RequestId': '5da8365d-f4bf-411d-a783-d85eb3966542', 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Tue, 17 Jun 2025 18:57:38 GMT', 'content-type': 'application/json', 'content-length': '1148', 'connection': 'keep-alive', 'x-amzn-requestid': '5da8365d-f4bf-411d-a783-d85eb3966542', 'x-amzn-bedrock-invocation-latency': '6396', 'x-amzn-bedrock-output-token-count': '148', 'x-amzn-bedrock-input-token-count': '198'}, 'RetryAttempts': 0}, 'contentType': 'application/json', 'body': <botocore.response.StreamingBody object at 0x7f7d4a598dc0>}
{
  "id": "chatcmpl-bc60b482436542978d233b13dc347634",
  "object": "chat.completion",
  "created": 1750186651,
  "model": "lmi",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "reasoning_content": "nOkay, the user is asking about the weather in San Francisco. Let me check the tools available. There's a get_weather function that requires location and unit. The user didn't specify the unit, so I should ask them if they want Celsius or Fahrenheit. Alternatively, maybe I can assume a default, but since the function requires it, I need to include it. I'll have to prompt the user for the unit they prefer.n",
        "content": "nnThe user hasn't specified whether they want the temperature in Celsius or Fahrenheit. I need to ask them to clarify which unit they prefer.nn",
        "tool_calls": [
          {
            "id": "chatcmpl-tool-fb2f93f691ed4d8ba94cadc52b57414e",
            "type": "function",
            "function": {
              "name": "get_weather",
              "arguments": "{"location": "San Francisco, CA", "unit": "celsius"}"
            }
          }
        ]
      },
      "logprobs": null,
      "finish_reason": "tool_calls",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 198,
    "total_tokens": 346,
    "completion_tokens": 148,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null
}

Deploy Qwen3-32B with SageMaker JumpStart

SageMaker JumpStart is a machine learning (ML) hub with FMs, built-in algorithms, and prebuilt ML solutions that you can deploy with just a few clicks. With SageMaker JumpStart, you can customize pre-trained models to your use case, with your data, and deploy them into production using either the UI or SDK.Deploying the Qwen3-32B model through SageMaker JumpStart offers two convenient approaches: using the intuitive SageMaker JumpStart UI or implementing programmatically through the SageMaker Python SDK. Let’s explore both methods to help you choose the approach that best suits your needs.

Deploy Qwen3-32B through SageMaker JumpStart UI

Complete the following steps to deploy Qwen3-32B using SageMaker JumpStart:

  1. On the SageMaker console, choose Studio in the navigation pane.
  2. First-time users will be prompted to create a domain.
  3. On the SageMaker Studio console, choose JumpStart in the navigation pane.

The model browser displays available models, with details like the provider name and model capabilities.

The SageMaker Studio Public Hub interface displays a grid of AI model providers, including Meta, DeepSeek, HuggingFace, and AWS, each showing their model counts and Bedrock integration status. The page includes a navigation sidebar and search functionality.

  1. Search for Qwen3 to view the Qwen3-32B model card.

Each model card shows key information, including:

  • Model name
  • Provider name
  • Task category (for example, Text Generation)
  • Bedrock Ready badge (if applicable), indicating that this model can be registered with Amazon Bedrock, so you can use Amazon Bedrock APIs to invoke the model

The SageMaker interface shows search results for "qwen3" displaying four text generation models from Qwen, each marked as Bedrock ready. Models range from 0.6B to 32B in size with consistent formatting and capabilities.

  1. Choose the model card to view the model details page.

The model details page includes the following information:

  • The model name and provider information
  • A Deploy button to deploy the model
  • About and Notebooks tabs with detailed information

The About tab includes important details, such as:

  • Model description
  • License information
  • Technical specifications
  • Usage guidelines

Screenshot of the SageMaker Studio interface displaying details about the Qwen3 32B language model, including its main features, capabilities, and deployment options. The interface shows tabs for About and Notebooks, with action buttons for Train, Deploy, Optimize, and Evaluate.

Before you deploy the model, it’s recommended to review the model details and license terms to confirm compatibility with your use case.

  1. Choose Deploy to proceed with deployment.
  2. For Endpoint name, use the automatically generated name or create a custom one.
  3. For Instance type¸ choose an instance type (default: ml.g6-12xlarge).
  4. For Initial instance count, enter the number of instances (default: 1).

Selecting appropriate instance types and counts is crucial for cost and performance optimization. Monitor your deployment to adjust these settings as needed. Under Inference type, Real-time inference is selected by default. This is optimized for sustained traffic and low latency.

  1. Review all configurations for accuracy. For this model, we strongly recommend adhering to SageMaker JumpStart default settings and making sure that network isolation remains in place.
  2. Choose Deploy to deploy the model.

A deployment configuration screen in SageMaker Studio showing endpoint settings, instance type selection, and real-time inference options. The interface includes fields for endpoint name, instance type (ml.g5.12xlarge), and initial instance count.

The deployment process can take several minutes to complete.

When deployment is complete, your endpoint status will change to InService. At this point, the model is ready to accept inference requests through the endpoint. You can monitor the deployment progress on the SageMaker console Endpoints page, which will display relevant metrics and status information. When the deployment is complete, you can invoke the model using a SageMaker runtime client and integrate it with your applications.

Deploy Qwen3-32B using the SageMaker Python SDK

To get started with Qwen3-32B using the SageMaker Python SDK, you must install the SageMaker Python SDK and make sure you have the necessary AWS permissions and environment set up. The following is a step-by-step code example that demonstrates how to deploy and use Qwen3-32B for inference programmatically:

!pip install --force-reinstall --no-cache-dir sagemaker==2.235.2

from sagemaker.serve.builder.model_builder import ModelBuilder 
from sagemaker.serve.builder.schema_builder import SchemaBuilder 
from sagemaker.jumpstart.model import ModelAccessConfig 
from sagemaker.session import Session 
import logging 

sagemaker_session = Session()
artifacts_bucket_name = sagemaker_session.default_bucket() 
execution_role_arn = sagemaker_session.get_caller_identity_arn()

# Changed to Qwen32B model
js_model_id = "huggingface-reasoning-qwen3-32b"
gpu_instance_type = "ml.g5.12xlarge"

response = "Hello, I'm a language model, and I'm here to help you with your English."

sample_input = {
    "inputs": "Hello, I'm a language model,",
    "parameters": {
        "max_new_tokens": 128, 
        "top_p": 0.9, 
        "temperature": 0.6
    }
}

sample_output = [{"generated_text": response}]

schema_builder = SchemaBuilder(sample_input, sample_output)

model_builder = ModelBuilder( 
    model=js_model_id, 
    schema_builder=schema_builder, 
    sagemaker_session=sagemaker_session, 
    role_arn=execution_role_arn, 
    log_level=logging.ERROR 
) 

model = model_builder.build() 

predictor = model.deploy(
    model_access_configs={js_model_id: ModelAccessConfig(accept_eula=True)}, 
    accept_eula=True
) 

predictor.predict(sample_input)

You can run additional requests against the predictor:

new_input = {
"inputs": "What is Amazon doing in Generative AI?",
"parameters": {"max_new_tokens": 64, "top_p": 0.8, "temperature": 0.7},
}

prediction = predictor.predict(new_input)
print(prediction)

The following are some error handling and best practices to enhance deployment code:

# Enhanced deployment code with error handling
import backoff
import botocore
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

@backoff.on_exception(backoff.expo, 
                     (botocore.exceptions.ClientError,),
                     max_tries=3)
def deploy_model_with_retries(model_builder, model_id):
    try:
        model = model_builder.build()
        predictor = model.deploy(
            model_access_configs={model_id:ModelAccessConfig(accept_eula=True)},
            accept_eula=True
        )
        return predictor
    except Exception as e:
        logger.error(f"Deployment failed: {str(e)}")
        raise

def safe_predict(predictor, input_data):
    try:
        return predictor.predict(input_data)
    except Exception as e:
        logger.error(f"Prediction failed: {str(e)}")
        return None

Clean up

To avoid unwanted charges, complete the steps in this section to clean up your resources.

Delete the Amazon Bedrock Marketplace deployment

If you deployed the model using Amazon Bedrock Marketplace, complete the following steps:

  1. On the Amazon Bedrock console, under Foundation models in the navigation pane, choose Marketplace deployments.
  2. In the Managed deployments section, locate the endpoint you want to delete.
  3. Select the endpoint, and on the Actions menu, choose Delete.
  4. Verify the endpoint details to make sure you’re deleting the correct deployment:
    1. Endpoint name
    2. Model name
    3. Endpoint status
  5. Choose Delete to delete the endpoint.
  6. In the deletion confirmation dialog, review the warning message, enter confirm, and choose Delete to permanently remove the endpoint.

Delete the SageMaker JumpStart predictor

The SageMaker JumpStart model you deployed will incur costs if you leave it running. Use the following code to delete the endpoint if you want to stop incurring charges. For more details, see Delete Endpoints and Resources.

predictor.delete_model()
predictor.delete_endpoint()

Conclusion

In this post, we explored how you can access and deploy the Qwen3 models using Amazon Bedrock Marketplace and SageMaker JumpStart. With support for both the full parameter models and its distilled versions, you can choose the optimal model size for your specific use case. Visit SageMaker JumpStart in Amazon SageMaker Studio or Amazon Bedrock Marketplace to get started. For more information, refer to Use Amazon Bedrock tooling with Amazon SageMaker JumpStart models, SageMaker JumpStart pretrained models, Amazon SageMaker JumpStart Foundation Models, Amazon Bedrock Marketplace, and Getting started with Amazon SageMaker JumpStart.

The Qwen3 family of LLMs offers exceptional versatility and performance, making it a valuable addition to the AWS foundation model offerings. Whether you’re building applications for content generation, analysis, or complex reasoning tasks, Qwen3’s advanced architecture and extensive context window make it a powerful choice for your generative AI needs.


About the authors

Niithiyn Vijeaswaran is a Generative AI Specialist Solutions Architect with the Third-Party Model Science team at AWS. His area of focus is AWS AI accelerators (AWS Neuron). He holds a Bachelor’s degree in Computer Science and Bioinformatics.

Avan Bala is a Solutions Architect at AWS. His area of focus is AI for DevOps and machine learning. He holds a bachelor’s degree in Computer Science with a minor in Mathematics and Statistics from the University of Maryland. Avan is currently working with the Enterprise Engaged East Team and likes to specialize in projects about emerging AI technologies.

Mohhid Kidwai is a Solutions Architect at AWS. His area of focus is generative AI and machine learning solutions for small-medium businesses. He holds a bachelor’s degree in Computer Science with a minor in Biological Science from North Carolina State University. Mohhid is currently working with the SMB Engaged East Team at AWS.

Yousuf Athar is a Solutions Architect at AWS specializing in generative AI and AI/ML. With a Bachelor’s degree in Information Technology and a concentration in Cloud Computing, he helps customers integrate advanced generative AI capabilities into their systems, driving innovation and competitive edge. Outside of work, Yousuf loves to travel, watch sports, and play football.

John Liu has 15 years of experience as a product executive and 9 years of experience as a portfolio manager. At AWS, John is a Principal Product Manager for Amazon Bedrock. Previously, he was the Head of Product for AWS Web3 / Blockchain. Prior to AWS, John held various product leadership roles at public blockchain protocols, fintech companies and also spent 9 years as a portfolio manager at various hedge funds.

Rohit Talluri is a Generative AI GTM Specialist at Amazon Web Services (AWS). He is partnering with top generative AI model builders, strategic customers, key AI/ML partners, and AWS Service Teams to enable the next generation of artificial intelligence, machine learning, and accelerated computing on AWS. He was previously an Enterprise Solutions Architect and the Global Solutions Lead for AWS Mergers & Acquisitions Advisory.

Varun Morishetty is a Software Engineer with Amazon SageMaker JumpStart and Bedrock Marketplace. Varun received his Bachelor’s degree in Computer Science from Northeastern University. In his free time, he enjoys cooking, baking and exploring New York City.

Read More

Build a just-in-time knowledge base with Amazon Bedrock

Build a just-in-time knowledge base with Amazon Bedrock

Software as a service (SaaS) companies managing multiple tenants face a critical challenge: efficiently extracting meaningful insights from vast document collections while controlling costs. Traditional approaches often lead to unnecessary spending on unused storage and processing resources, impacting both operational efficiency and profitability. Organizations need solutions that intelligently scale processing and storage resources based on actual tenant usage patterns while maintaining data isolation. Traditional Retrieval Augmented Generation (RAG) systems consume valuable resources by ingesting and maintaining embeddings for documents that might never be queried, resulting in unnecessary storage costs and reduced system efficiency. Systems designed to handle large amounts of small to mid-sized tenants can exceed cost structure and infrastructure limits or might need to use silo-style deployments to keep each tenant’s information and usage separate. Adding to this complexity, many projects are transitory in nature, with work being completed on an intermittent basis, leading to data occupying space in knowledge base systems that could be used by other active tenants.

To address these challenges, this post presents a just-in-time knowledge base solution that reduces unused consumption through intelligent document processing. The solution processes documents only when needed and automatically removes unused resources, so organizations can scale their document repositories without proportionally increasing infrastructure costs.

With a multi-tenant architecture with configurable limits per tenant, service providers can offer tiered pricing models while maintaining strict data isolation, making it ideal for SaaS applications serving multiple clients with varying needs. Automatic document expiration through Time-to-Live (TTL) makes sure the system remains lean and focused on relevant content, while refreshing the TTL for frequently accessed documents maintains optimal performance for information that matters. This architecture also makes it possible to limit the number of files each tenant can ingest at a specific time and the rate at which tenants can query a set of files.This solution uses serverless technologies to alleviate operational overhead and provide automatic scaling, so teams can focus on business logic rather than infrastructure management. By organizing documents into groups with metadata-based filtering, the system enables contextual querying that delivers more relevant results while maintaining security boundaries between tenants.The architecture’s flexibility supports customization of tenant configurations, query rates, and document retention policies, making it adaptable to evolving business requirements without significant rearchitecting.

Solution overview

This architecture combines several AWS services to create a cost-effective, multi-tenant knowledge base solution that processes documents on demand. The key components include:

  • Vector-based knowledge base – Uses Amazon Bedrock and Amazon OpenSearch Serverless for efficient document processing and querying
  • On-demand document ingestion – Implements just-in-time processing using the Amazon Bedrock CUSTOM data source type
  • TTL management – Provides automatic cleanup of unused documents using the TTL feature in Amazon DynamoDB
  • Multi-tenant isolation – Enforces secure data separation between users and organizations with configurable resource limits

The solution enables granular control through metadata-based filtering at the user, tenant, and file level. The DynamoDB TTL tracking system supports tiered pricing structures, where tenants can pay for different TTL durations, document ingestion limits, and query rates.

The following diagram illustrates the key components and workflow of the solution.

Multi-tier AWS serverless architecture diagram showcasing data flow and integration of various AWS services

The workflow consists of the following steps:

  1. The user logs in to the system, which attaches a tenant ID to the current user for calls to the Amazon Bedrock knowledge base. This authentication step is crucial because it establishes the security context and makes sure subsequent interactions are properly associated with the correct tenant. The tenant ID becomes the foundational piece of metadata that enables proper multi-tenant isolation and resource management throughout the entire workflow.
  2. After authentication, the user creates a project that will serve as a container for the files they want to query. This project creation step establishes the organizational structure needed to manage related documents together. The system generates appropriate metadata and creates the necessary database entries to track the project’s association with the specific tenant, enabling proper access control and resource management at the project level.
  3. With a project established, the user can begin uploading files. The system manages this process by generating pre-signed URLs for secure file upload. As files are uploaded, they are stored in Amazon Simple Storage Service (Amazon S3), and the system automatically creates entries in DynamoDB that associate each file with both the project and the tenant. This three-way relationship (file-project-tenant) is essential for maintaining proper data isolation and enabling efficient querying later.
  4. When a user requests to create a chat with a knowledge base for a specific project, the system begins ingesting the project files using the CUSTOM data source. This is where the just-in-time processing begins. During ingestion, the system applies a TTL value based on the tenant’s tier-specific TTL interval. The TTL makes sure project files remain available during the chat session while setting up the framework for automatic cleanup later. This step represents the core of the on-demand processing strategy, because files are only processed when they are needed.
  5. Each chat session actively updates the TTL for the project files being used. This dynamic TTL management makes sure frequently accessed files remain in the knowledge base while allowing rarely used files to expire naturally. The system continually refreshes the TTL values based on actual usage, creating an efficient balance between resource availability and cost optimization. This approach maintains optimal performance for actively used content while helping to prevent resource waste on unused documents.
  6. After the chat session ends and the TTL value expires, the system automatically removes files from the knowledge base. This cleanup process is triggered by Amazon DynamoDB Streams monitoring TTL expiration events, which activate an AWS Lambda function to remove the expired documents. This final step reduces the load on the underlying OpenSearch Serverless cluster and optimizes system resources, making sure the knowledge base remains lean and efficient.

Prerequisites

You need the following prerequisites before you can proceed with solution. For this post, we use the us-east-1 AWS Region.

Deploy the solution

Complete the following steps to deploy the solution:

  1. Download the AWS CDK project from the GitHub repo.
  2. Install the project dependencies:
    npm run install:all

  3. Deploy the solution:
    npm run deploy

  4. Create a user and log in to the system after validating your email.

Validate the knowledge base and run a query

Before allowing users to chat with their documents, the system performs the following steps:

  • Performs a validation check to determine if documents need to be ingested. This process happens transparently to the user and includes checking document status in DynamoDB and the knowledge base.
  • Validates that the required documents are successfully ingested and properly indexed before allowing queries.
  • Returns both the AI-generated answers and relevant citations to source documents, maintaining traceability and empowering users to verify the accuracy of responses.

The following screenshot illustrates an example of chatting with the documents.

AWS Just In Time Knowledge Base interface displaying project files and AI-powered question-answering feature

Looking at the following example method for file ingestion, note how file information is stored in DynamoDB with a TTL value for automatic expiration. The ingest knowledge base documents call includes essential metadata (user ID, tenant ID, and project), enabling precise filtering of this tenant’s files in subsequent operations.

# Ingesting files with tenant-specific TTL values
def ingest_files(user_id, tenant_id, project_id, files):
    # Get tenant configuration and calculate TTL
    tenants = json.loads(os.environ.get('TENANTS'))['Tenants']
    tenant = find_tenant(tenant_id, tenants)
    ttl = int(time.time()) + (int(tenant['FilesTTLHours']) * 3600)
    
    # For each file, create a record with TTL and start ingestion
    for file in files:
        file_id = file['id']
        s3_key = file.get('s3Key')
        bucket = file.get('bucket')
        
        # Create a record in the knowledge base files table with TTL
        knowledge_base_files_table.put_item(
            Item={
                'id': file_id,
                'userId': user_id,
                'tenantId': tenant_id,
                'projectId': project_id,
                'documentStatus': 'ready',
                'createdAt': int(time.time()),
                'ttl': ttl  # TTL value for automatic expiration
            }
        )
        
        # Start the ingestion job with tenant, user, and project metadata for filtering
        bedrock_agent.ingest_knowledge_base_documents(
            knowledgeBaseId=KNOWLEDGE_BASE_ID,
            dataSourceId=DATA_SOURCE_ID,
            clientToken=str(uuid.uuid4()),
            documents=[
                {
                    'content': {
                        'dataSourceType': 'CUSTOM',
                        'custom': {
                            'customDocumentIdentifier': {
                                'id': file_id
                            },
                            's3Location': {
                                'uri': f"s3://{bucket}/{s3_key}"
                            },
                            'sourceType': 'S3_LOCATION'
                        }
                    },
                    'metadata': {
                        'type': 'IN_LINE_ATTRIBUTE',
                        'inlineAttributes': [
                            {'key': 'userId', 'value': {'stringValue': user_id, 'type': 'STRING'}},
                            {'key': 'tenantId', 'value': {'stringValue': tenant_id, 'type': 'STRING'}},
                            {'key': 'projectId', 'value': {'stringValue': project_id, 'type': 'STRING'}},
                            {'key': 'fileId', 'value': {'stringValue': file_id, 'type': 'STRING'}}
                        ]
                    }
                }
            ]
        )

During a query, you can use the associated metadata to construct parameters that make sure you only retrieve files belonging to this specific tenant. For example:

    filter_expression = {
        "andAll": [
            {
                "equals": {
                    "key": "tenantId",
                    "value": tenant_id
                }
            },
            {
                "equals": {
                    "key": "projectId",
                    "value": project_id
                }
            },
            {
                "in": {
                    "key": "fileId",
                    "value": file_ids
                }
            }
        ]
    }

    # Create base parameters for the API call
    retrieve_params = {
        'input': {
            'text': query
        },
        'retrieveAndGenerateConfiguration': {
            'type': 'KNOWLEDGE_BASE',
            'knowledgeBaseConfiguration': {
                'knowledgeBaseId': knowledge_base_id,
                'modelArn': 'arn:aws:bedrock:us-east-1::foundation-model/amazon.nova-pro-v1:0',
                'retrievalConfiguration': {
                    'vectorSearchConfiguration': {
                        'numberOfResults': limit,
                        'filter': filter_expression
                    }
                }
            }
        }
    }
    response = bedrock_agent_runtime.retrieve_and_generate(**retrieve_params)

Manage the document lifecycle with TTL

To further optimize resource usage and costs, you can implement an intelligent document lifecycle management system using the DynamoDB (TTL) feature. This consists of the following steps:

  1. When a document is ingested into the knowledge base, a record is created with a configurable TTL value.
  2. This TTL is refreshed when the document is accessed.
  3. DynamoDB Streams with specific filters for TTL expiration events is used to trigger a cleanup Lambda function.
  4. The Lambda function removes expired documents from the knowledge base.

See the following code:

# Lambda function triggered by DynamoDB Streams when TTL expires items
def lambda_handler(event, context):
    """
    This function is triggered by DynamoDB Streams when TTL expires items.
    It removes expired documents from the knowledge base.
    """
    
    # Process each record in the event
    for record in event.get('Records', []):
        # Check if this is a TTL expiration event (REMOVE event from DynamoDB Stream)
        if record.get('eventName') == 'REMOVE':
            # Check if this is a TTL expiration
            user_identity = record.get('userIdentity', {})
            if user_identity.get('type') == 'Service' and user_identity.get('principalId') == 'dynamodb.amazonaws.com':
                # Extract the file ID and tenant ID from the record
                keys = record.get('dynamodb', {}).get('Keys', {})
                file_id = keys.get('id', {}).get('S')
                
                # Delete the document from the knowledge base
                bedrock_agent.delete_knowledge_base_documents(
                    clientToken=str(uuid.uuid4()),
                    knowledgeBaseId=knowledge_base_id,
                    dataSourceId=data_source_id,
                    documentIdentifiers=[
                        {
                            'custom': {
                                'id': file_id
                            },
                            'dataSourceType': 'CUSTOM'
                        }
                    ]
                )

Multi-tenant isolation with tiered service levels

Our architecture enables sophisticated multi-tenant isolation with tiered service levels:

  • Tenant-specific document filtering – Each query includes user, tenant, and file-specific filters, allowing the system to reduce the number of documents being queried.
  • Configurable TTL values – Different tenant tiers can have different TTL configurations. For example:
    • Free tier: Five documents ingested with a 7-day TTL and five queries per minute.
    • Standard tier: 100 documents ingested with a 30-day TTL and 10 queries per minute.
    • Premium tier: 1,000 documents ingested with a 90-day TTL and 50 queries per minute.
    • You can configure additional limits, such as total queries per month or total ingested files per day or month.

Clean up

To clean up the resources created in this post, run the following command from the same location where you performed the deploy step:

npm run destroy

Conclusion

The just-in-time knowledge base architecture presented in this post transforms document management across multiple tenants by processing documents only when queried, reducing the unused consumption of traditional RAG systems. This serverless implementation uses Amazon Bedrock, OpenSearch Serverless, and the DynamoDB TTL feature to create a lean system with intelligent document lifecycle management, configurable tenant limits, and strict data isolation, which is essential for SaaS providers offering tiered pricing models.

This solution directly addresses cost structure and infrastructure limitations of traditional systems, particularly for deployments handling numerous small to mid-sized tenants with transitory projects. This architecture combines on-demand document processing with automated lifecycle management, delivering a cost-effective, scalable resource that empowers organizations to focus on extracting insights rather than managing infrastructure, while maintaining security boundaries between tenants.

Ready to implement this architecture? The full sample code is available in the GitHub repository.


About the author

Steven Warwick is a Senior Solutions Architect at AWS, where he leads customer engagements to drive successful cloud adoption and specializes in SaaS architectures and Generative AI solutions. He produces educational content including blog posts and sample code to help customers implement best practices, and has led programs on GenAI topics for solution architects. Steven brings decades of technology experience to his role, helping customers with architectural reviews, cost optimization, and proof-of-concept development.

Read More

Agents as escalators: Real-time AI video monitoring with Amazon Bedrock Agents and video streams

Agents as escalators: Real-time AI video monitoring with Amazon Bedrock Agents and video streams

Organizations deploying video monitoring systems face a critical challenge: processing continuous video streams while maintaining accurate situational awareness. Traditional monitoring approaches that use rule-based detection or basic computer vision frequently miss important events or generate excessive false positives, leading to operational inefficiencies and alert fatigue.

In this post, we show how to build a fully deployable solution that processes video streams using OpenCV, Amazon Bedrock for contextual scene understanding and automated responses through Amazon Bedrock Agents. This solution extends the capabilities demonstrated in Automate chatbot for document and data retrieval using Amazon Bedrock Agents and Knowledge Bases, which discussed using Amazon Bedrock Agents for document and data retrieval. In this post, we apply Amazon Bedrock Agents to real-time video analysis and event monitoring.

Benefits of using Amazon Bedrock Agents for video monitoring

The following figure shows example video stream inputs from different monitoring scenarios. With contextual scene understanding, users can search for specific events.

A front door camera will capture many events throughout the day, but some are more interesting than others—having context if a package is being delivered or removed (as in the following package example) limits alerts to urgent events.

Amazon Bedrock is a fully managed service that provides access to high-performing foundation models (FMs) from leading AI companies through a single API. Using Amazon Bedrock, you can build secure, responsible generative AI applications. Amazon Bedrock Agents extends these capabilities by enabling applications to execute multi-step tasks across systems and data sources, making it ideal for complex monitoring scenarios. The solution processes video streams through these key steps:

  1. Extract frames when motion is detected from live video streams or local files.
  2. Analyze context using multimodal FMs.
  3. Make decisions using agent-based logic with configurable responses.
  4. Maintain searchable semantic memory of events.

You can build this intelligent video monitoring system using Amazon Bedrock Agents and Amazon Bedrock Knowledge Bases in an automated solution. The complete code is available in the GitHub repo.

Limitations of current video monitoring systems

Organizations deploying video monitoring systems face a fundamental dilemma. Despite advances in camera technology and storage capabilities, the intelligence layer interpreting video feeds often remains rudimentary. This creates a challenging situation where security teams must make significant trade-offs in their monitoring approach. Current video monitoring solutions typically force organizations to choose between the following:

  • Simple rules that scale but generate excessive false positives
  • Complex rules that require ongoing maintenance and customization
  • Manual monitoring that relies on human attention and doesn’t scale
  • Point solutions that only handle specific scenarios but lack flexibility

These trade-offs create fundamental barriers to effective video monitoring that impact security, safety, and operational efficiency across industries. Based on our work with customers, we’ve identified three critical challenges that emerge from these limitations:

  • Alert fatigue – Traditional motion detection and object recognition systems generate alerts for any detected change or recognized object. Security teams quickly become overwhelmed by the volume of notifications for normal activities. This leads to reduced attention when genuinely critical events occur, diminishing security effectiveness and increasing operational costs from constant human verification of false alarms.
  • Limited contextual understanding – Rule-based systems fundamentally struggle with nuanced scene interpretation. Even sophisticated traditional systems operate with limited understanding of the environments they monitor due to a lack of contextual awareness, because they can’t easily do the following:
    • Distinguish normal from suspicious behavior
    • Understand temporal patterns like recurring weekly events
    • Consider environmental context such as time of day or location
    • Correlate multiple events that might indicate a pattern
  • Lack of semantic memory – Conventional systems lack the ability to build and use knowledge over time. They can’t do the following:
    • Establish baselines of routine versus unusual events
    • Offer natural language search capabilities across historical data
    • Support reasoning about emerging patterns

Without these capabilities, you can’t gain cumulative benefits from your monitoring infrastructure or perform sophisticated retrospective analysis. To address these challenges effectively, you need a fundamentally different approach. By combining the contextual understanding capabilities of FMs with a structured framework for event classification and response, you can build more intelligent monitoring systems. Amazon Bedrock Agents provides the ideal platform for this next-generation approach.

Solution overview

You can address these monitoring challenges by building a video monitoring solution with Amazon Bedrock Agents. The system intelligently screens events, filters routine activity, and escalates situations requiring human attention, helping reduce alert fatigue while improving detection accuracy. The solution uses Amazon Bedrock Agents to analyze detected motion from video, and alerts users when an event of interest happens according to the provided instructions. This allows the system to intelligently filter out trivial events that can trigger motion detection, such as wind or birds, and direct the user’s attention only to events of interest. The following diagram illustrates the solution architecture.

The solution uses three primary components to address the core challenges: agents as escalators, a video processing pipeline, and Amazon Bedrock Agents. We discuss these components in more detail in the following sections.

The solution uses the AWS Cloud Development Kit (AWS CDK) to deploy the solution components. The AWS CDK is an open source software development framework for defining cloud infrastructure as code and provisioning it through AWS CloudFormation.

Agent as escalators

The first component uses Amazon Bedrock Agents to examine detected motion events with the following capabilities:

  • Provides natural language understanding of scenes and activities for contextual interpretation
  • Maintains temporal awareness across frame sequences to understand event progression
  • References historical patterns to distinguish unusual from routine events
  • Applies contextual reasoning about behaviors, considering factors like time of day, location, and action sequences

We implement a graduated response framework that categorizes events by severity and required action:

  • Level 0: Log only – The system logs normal or expected activities. For example, when a delivery person arrives during business hours or a recognized vehicle enters the driveway, these events are documented for pattern analysis and future reference but require no immediate action. They remain searchable in the event history.
  • Level 1: Human notification – This level handles unusual but non-critical events that warrant human attention. An unrecognized vehicle parked nearby, an unexpected visitor, or unusual movement patterns trigger a notification to security personnel. These events require human verification and assessment.
  • Level 2: Immediate response – Reserved for critical security events. Unauthorized access attempts, detection of smoke or fire, or suspicious behavior trigger automatic response actions through API calls. The system notifies personnel through SMS or email alerts with event information and context.

The solution provides an interactive processing and monitoring interface through a Streamlit application. With the Streamlit UI, users can provide instructions and interact with the agent.

The application consists of the following key features:

  • Live stream or video file input – The application accepts M3U8 stream URLs from webcams or security feeds, or local video files in common formats (MP4, AVI). Both are processed using the same motion detection pipeline that saves triggered events to Amazon Simple Storage Service (Amazon S3) for agent analysis.
  • Custom instructions – Users can provide specific monitoring guidance, such as “Alert me about unknown individuals near the loading dock after hours” or “Focus on vehicle activity in the parking area.” These instructions adjust how the agent interprets detected motion events.
  • Notification configuration – Users can specify contact information for different alert levels. The system uses Amazon Simple Notification Service (Amazon SNS) to send emails or text messages based on event severity, so different personnel can be notified for potential issues vs. critical situations.
  • Natural language queries about past events – The interface includes a chat component for historical event retrieval. Users can ask “What vehicles have been in the driveway this week?” or “Show me any suspicious activity from last night,” receiving responses based on the system’s event memory.

Video processing pipeline

The solution uses several AWS services to capture and prepare video data for analysis through a modular processing pipeline. The solution supports multiple types of video sources:

When using streams, OpenCV’s VideoCapture component handles the connection and frame extraction. For testing, we’ve included sample event videos demonstrating different scenarios. The core of the video processing is a modular pipeline implemented in Python. Key components include:

  • SimpleMotionDetection – Identifies movement in the video feed
  • FrameSampling – Captures sequences of frames over time when motion is detected
  • GridAggregator – Organizes multiple frames into a visual grid for context
  • S3Storage – Stores captured frame sequences in Amazon S3

This multi-process framework optimizes performance by running components concurrently and maintaining a queue of frames to process. The video processing pipeline organizes captured frame data in a structured way before passing it to the Amazon Bedrock agent for analysis:

  • Frame sequence storage – When motion is detected, the system captures a sequence of frames over 10 seconds. These frames are stored in Amazon S3 using a timestamp-based path structure (YYYYMMDD-HHMMSS) that allows for efficient retrieval by date and time. In the case where motions exceed 10 seconds, multiple events are created.
  • Image grid format – Rather than processing individual frames separately, the system arranges multiple sequential frames into a grid format (typically 3×4 or 4×5). This presentation provides temporal context and is sent to the Amazon Bedrock agent for analysis. The grid format enables understanding of how motion progresses over time, which is critical for accurate scene interpretation.

The following figure is an example of an image grid sent to the agent. Package theft is difficult to identify with classic image models. The large language model’s (LLM’s) ability to reason over a sequence of image allows it to make observations about intent.

The video processing pipeline’s output—timestamped frame grids stored in Amazon S3—serves as the input for the Amazon Bedrock agent components, which we discuss in the next section.

Amazon Bedrock agent components

The solution integrates multiple Amazon Bedrock services to create an intelligent analysis system:

  • Core agent architecture – The agent orchestrates these key workflows:
    • Receives frame grids from Amazon S3 on motion detection
    • Coordinates multi-step analysis processes
    • Makes classification decisions
    • Triggers appropriate response actions
    • Maintains event context and state
  • Knowledge management – The solution uses Amazon Bedrock Knowledge Bases with Amazon OpenSearch Serverless to:
    • Store and index historical events
    • Build baseline activity patterns
    • Enable natural language querying
    • Track temporal patterns
    • Support contextual analysis
  • Action groups – The agent has access to several actions defined through API schemas:
    • Analyze grid – Process incoming frame grids from Amazon S3
    • Alert – Send notifications through Amazon SNS based on severity
    • Log – Record event details for future reference
    • Search events by date – Retrieve past events based on a date range
    • Look up vehicle (Text-to-SQL) – Query the vehicle database for information

For structured data queries, the system uses the FM’s ability to convert natural language to SQL. This enables the following:

  • Querying Amazon Athena tables containing event records
  • Retrieving information about registered vehicles
  • Generating reports from structured event data

These components work together to create a comprehensive system that can analyze video content, maintain event history, and support both real-time alerting and retrospective analysis through natural language interaction.

Video processing framework

The video processing framework implements a multi-process architecture for handling video streams through composable processing chains.

Modular pipeline architecture

The framework uses a composition-based approach built around the FrameProcessor abstract base class.

Processing components implement a consistent interface with a process(frame) method that takes a Frame and returns a potentially modified Frame:

```
class FrameProcessor(ABC):
    @abstractmethod
    def process(self, frame: Frame) -> Optional[Frame]: ...
```

The Frame class encapsulates the image data along with timestamps, indexes, and extensible metadata:

```
@dataclass
class Frame:
    buffer: ndarray  # OpenCV image array
    timestamp: float
    index: float
    fps: float
    metadata: dict = field(default_factory=dict)
```

Customizable processing chains

The architecture supports configuring multiple processing chains that can be connected in sequence. The solution uses two primary chains. The detection and analysis chain processes incoming video frames to identify events of interest:

```
chain = FrameProcessorChain([
    SimpleMotionDetection(motion_threshold=10_000, frame_skip_size=1),
    FrameSampling(timedelta(milliseconds=250), threshold_time=timedelta(seconds=2)),
    GridAggregator(shape=(13, 3))
])
```

The storage and notification chain handles the storage of identified events and invocation of the agent:

```
storage_chain = FrameProcessorChain([
    S3Storage(bucket_name=TARGET_S3_BUCKET, prefix=S3_PREFIX, s3_client_provider=s3_client_provider),
    LambdaProcessor(get_response=get_response, monitoring_instructions=config.monitoring_instructions)
])
```

You can modify these changes independently to add or replace components based on specific monitoring requirements.

Component implementation

The solution includes several processing components that demonstrate the framework’s capabilities. You can modify each processing step or add new ones. For example, for simple motion detection, we use a simple pixel difference, but you can refine the motion detection functionality as needed, or follow the format to implement other detection algorithms, such as object detection or scene segmentation.

Additional components include the FrameSampling processor to control capture timing, GridAggregator to create visual frame grids, and storage processors that save event data and trigger agent analysis, and these can be customized and replaced as needed. For example:

  • Modify existing components – Adjust thresholds or parameters to tune for specific environments
  • Create alternative storage backends – Direct output to different storage services or databases
  • Implement preprocessing and postprocessing steps – Add image enhancement, data filtering, or additional context generation

Finally, the LambdaProcessor serves as the bridge to the Amazon Bedrock agent by invoking an AWS Lambda function that sends the information in a request to the deployed agent. From there, the Amazon Bedrock agent takes over and analyzes the event and takes action accordingly.

Agent implementation

After you deploy the solution, an Amazon Bedrock agent alias becomes available. This agent functions as an intelligent analysis layer, processing captured video events and executing appropriate actions based on its analysis. You can test the agent and view its reasoning trace directly on the Amazon Bedrock console, as shown in the following screenshot.

This agent will lack some of the metadata supplied by the Streamlit application (such as current time) and might not give the same answers as the full application.

Invocation flow

The agent is invoked through a Lambda function that handles the request-response cycle and manages session state. It finds the highest published version ID and uses it to invoke the agent and parses the response.

Action groups

The agent’s capabilities are defined through action groups implemented using the BedrockAgentResolver framework. This approach automatically generates the OpenAPI schema required by the agent.

When the agent is invoked, it receives an event object that includes the API path and other parameters that inform the agent framework how to route the request to the appropriate handler based. You can add new actions by defining additional endpoint handlers following the same pattern and generating a new OpenAPI schema:

```
if __name__ == "__main__":
    print(app.get_openapi_json_schema())
```

Text-to-SQL integration

Through its action group, the agent is able to translate natural language queries into SQL for structured data analysis. The system reads data from assets/data_query_data_source, which can include various formats like CSV, JSON, ORC, or Parquet.

This capability enables users to query structured data using natural language. As demonstrated in the following example, the system translates natural language queries about vehicles into SQL, returning structured information from the database.

The database connection is configured through a SQL Alchemy engine. Users can connect to existing databases by updating the create_sql_engine() function to use their connection parameters.

Event memory and semantic search

The agent maintains a detailed memory of past events, storing event logs with rich descriptions in Amazon S3. These events become searchable through both vector-based semantic search and date-based filtering. As shown in the following example, temporal queries make it possible to retrieve information about events within specific time periods, such as vehicles observed in the past 72 hours.

The system’s semantic memory capabilities enable queries based on abstract concepts and natural language descriptions. As shown in the following example, the agent can understand abstract concepts like “funny” and retrieve relevant events, such as a person dropping a birthday cake.

Events can be linked together by the agent to identify patterns or related incidents. For example, the system can correlate separate sightings of individuals with similar characteristics. In the following screenshots, the agent connects related incidents by identifying common attributes like clothing items across different events.

This event memory store allows the system to build knowledge over time, providing increasingly valuable insights as it accumulates data. The combination of structured database querying and semantic search across event descriptions creates an agent with a searchable memory of all past events.

Prerequisites

Before you deploy the solution, complete the following prerequisites:

  1. Configure AWS credentials using aws configure. Use either the us-west-2 or us-east-1 AWS Region.
  2. Enable access to Anthropic’s Claude 3.x models, or another supported Amazon Bedrock Agents model you want to use.
  3. Make sure you have the following dependencies:

Deploy the solution

The AWS CDK deployment creates the following resources:

  • Storage – S3 buckets for assets and query results
  • Amazon Bedrock resources – Agent and knowledge base
  • Compute – Lambda functions for actions, invocation, and updates
  • Database – Athena database for structured queries, and an AWS Glue crawler for data discovery

Deploy the solution with the following commands:

```
#1. Clone the repository and navigate to folder
git clone https://github.com/aws-samples/sample-video-monitoring-agent.git && cd sample-video-monitoring-agent
#2. Set up environment and install dependencies
python3 -m venv .venv && source .venv/bin/activate && pip install -r requirements.txt
#3. Deploy AWS resources
cdk bootstrap && cdk deploy
#4. Run the streamlit app
cd code/streamlit_app && streamlit run app.py
```

On Windows, replace the second line with the following code:

```
python3 -m venv .venv && % .venvScriptsactivate.bat && pip install -r requirements.txt
```

Clean up

To destroy the resources you created and stop incurring charges, run the following command:

```
cdk destroy
```

Future enhancements

The current implementation demonstrates the potential of agent-based video monitoring in a home security setting, but there are many potential applications.

Sample Use Cases

The following showcases the application of the solution to various scenarios.

Small business

{ “alert_level”: 0, “timestamp”: “2024-11-20T15:24:15Z”, “reason”: “Vehicle arrival in driveway”, “description”: ”Standard vehicle arrival and parking sequence. Vehicles present: Black Nissan Frontier pickup (parked), silver Honda CR-V (arriving), and partial view of blue vehicle in foreground. Area features: Gravel driveway surface, two waste bins (County Waste and recycling), evergreen trees in background. Sequence shows Honda CR-V executing normal parking maneuver: approaches from east, performs standard three-point turn, achieves final position next to pickup truck. Daytime conditions, clear visibility. Vehicle condition: Clean, well-maintained CR-V appears to be 2012-2016 model year, no visible damage or unusual modifications. Movement pattern indicates familiar driver performing routine parking. No suspicious behavior or safety concerns observed. Timestamp indicates standard afternoon arrival time. Waste bins properly positioned and undisturbed during parking maneuver.” }

Industrial

{ “alert_level”: 2, “timestamp”: “2024-11-20T15:24:15Z”, “reason”: “Warehouse product spill/safety hazard”,”description”: ”Significant product spill incident in warehouse storage aisle. Location: Main warehouse aisle between high-bay racking systems containing boxed inventory. Sequence shows what appears to be liquid or container spill, likely water/beverage products based on blue colored containers visible. Infrastructure: Professional warehouse setup with multi-level blue metal racking, concrete flooring, overhead lighting. Incident progression: Initial frames show clean aisle, followed by product falling/tumbling, resulting in widespread dispersal of items across aisle floor. Hazard assessment: Creates immediate slip/trip hazard, blocks emergency egress path, potential damage to inventory. Area impact: Approximately 15-20 feet of aisle space affected. Facility type appears to be distribution center or storage warehouse. Multiple cardboard boxes visible on surrounding shelves potentially at risk from liquid damage.” }

Backyard

{ “alert_level”: 1, “timestamp”: “2024-11-20T15:24:15Z”, “reason”: “Wildlife detected on property”, “description”: ”Adult raccoon observed investigating porch/deck area with white railings. Night vision/IR camera provides clear footage of animal. Subject animal characteristics: medium-sized adult raccoon, distinctive facial markings clearly visible, healthy coat condition, normal movement patterns. Sequence shows animal approaching camera (15:42PM), investigating area near railing (15:43-15:44PM), with close facial examination (15:45PM). Final frame shows partial view as animal moves away. Environment: Location appears to be elevated deck/porch with white painted wooden railings and balusters. Lighting conditions: Nighttime, camera operating in infrared/night vision mode providing clear black and white footage. Animal behavior appears to be normal nocturnal exploration, no signs of aggression or disease.” }

Home safety

{ “alert_level”: 2, “timestamp”: “2024-11-20T15:24:15Z”, “reason”: “Smoke/possible fire detected”, “description”: ”Rapid development of white/grey smoke visible in living room area. Smoke appears to be originating from left side of frame, possibly near electronics/TV area. Room features: red/salmon colored walls, grey couch, illuminated aquarium, table lamps, framed artwork. Sequence shows progressive smoke accumulation over 4-second span (15:42PM – 15:46PM).Notable smoke density increase in upper left corner of frame with potential light diffusion indicating particulate matter in air. Smoke pattern suggests active fire development rather than residual smoke. Blue light from aquarium remains visible throughout sequence providing contrast reference for smoke density.”

Further extensions

In addition, you can extend the FM capabilities using the following methods:

  • Fine-tuning for specific monitoring contexts – Adapting the models to recognize domain-specific objects, behaviors, and scenarios
  • Refined prompts for specific use cases – Creating specialized instructions that optimize the agent’s performance for particular environments like industrial facilities, retail spaces, or residential settings

You can expand the agent’s ability to take action, for example:

  • Direct control of smart home and smart building systems – Integrating with Internet of Things (IoT) device APIs to control lights, locks, or alarm systems
  • Integration with security and safety protocols – Connecting to existing security infrastructure to follow established procedures
  • Automated response workflows – Creating multi-step action sequences that can be triggered by specific events

You can also consider enhancing the event memory system:

  • Long-term pattern recognition – Identifying recurring patterns over extended time periods
  • Cross-camera correlation – Linking observations from multiple cameras to track movement through a space
  • Anomaly detection based on historical patterns – Automatically identifying deviations from established baselines

Lastly, consider extending the monitoring capabilities beyond fixed cameras:

  • Monitoring for robotic vision systems – Applying the same intelligence to mobile robots that patrol or inspect areas
  • Drone-based surveillance – Processing aerial footage for comprehensive site monitoring
  • Mobile security applications – Extending the platform to process feeds from security personnel body cameras or mobile devices

These enhancements can transform the system from a passive monitoring tool into an active participant in security operations, with increasingly sophisticated understanding of normal patterns and anomalous events.

Conclusion

The approach of using agents as escalators represents a significant advancement in video monitoring, using the contextual understanding capabilities of FMs with the action-oriented framework of Amazon Bedrock Agents. By filtering the signal from the noise, this solution addresses the critical problem of alert fatigue while enhancing security and safety monitoring capabilities.With this solution, you can:

  • Reduce false positives while maintaining high detection sensitivity
  • Provide human-readable descriptions and classifications of events
  • Maintain searchable records of all activity
  • Scale monitoring capabilities without proportional human resources

The combination of intelligent screening, graduated responses, and semantic memory enables a more effective and efficient monitoring system that enhances human capabilities rather than replacing them. Try the solution today and experience how Amazon Bedrock Agents can transform your video monitoring capabilities from simple motion detection to intelligent scene understanding.


About the authors

Kiowa Jackson is a Senior Machine Learning Engineer at AWS ProServe, specializing in computer vision and agentic systems for industrial applications. His work bridges classical machine learning approaches with generative AI to enhance industrial automation capabilities. His past work includes collaborations with Amazon Robotics, NFL, and Koch Georgia Pacific.

Piotr Chotkowski is a Senior Cloud Application Architect at AWS Generative AI Innovation Center. He has experience in hands-on software engineering as well as software architecture design. In his role at AWS, he helps customers design and build production grade generative AI applications in the cloud.

Read More