Accelerating Articul8’s domain-specific model development with Amazon SageMaker HyperPod

June 12, 2025

by Yashesh Shroff Amazon AWS

This post was co-written with Renato Nascimento, Felipe Viana, Andre Von Zuben from Articul8.

Generative AI is reshaping industries, offering new efficiencies, automation, and innovation. However, generative AI requires powerful, scalable, and resilient infrastructures that optimize large-scale model training, providing rapid iteration and efficient compute utilization with purpose-built infrastructure and automated cluster management.

In this post, we share how Articul8 is accelerating their training and deployment of domain-specific models (DSMs) by using Amazon SageMaker HyperPod and achieving over 95% cluster utilization and a 35% improvement in productivity.

What is SageMaker HyperPod?

SageMaker HyperPod is an advanced distributed training solution designed to accelerate the development of scalable, reliable, and secure generative AI model development. Articul8 uses SageMaker HyperPod to efficiently train large language models (LLMs) on diverse, representative data and uses its observability and resiliency features to keep the training environment stable over the long duration of training jobs. SageMaker HyperPod provides the following features:

Fault-tolerant compute clusters with automated faulty node replacement during model training
Efficient cluster utilization through observability and performance monitoring
Seamless model experimentation with streamlined infrastructure orchestration using Slurm and Amazon Elastic Kubernetes Service (Amazon EKS)

Who is Articul8?

Articul8 was established to address the gaps in enterprise generative AI adoption by developing autonomous, production-ready products. For instance, they found that most general-purpose LLMs often fall short in delivering the accuracy, efficiency, and domain-specific knowledge needed for real-world business challenges. They are pioneering a set of DSMs that offer twofold better accuracy and completeness, compared to general-purpose models, at a fraction of the cost. (See their recent blog post for more details.)

The company’s proprietary ModelMesh technology serves as an autonomous layer that decides, selects, executes, and evaluates the right models at runtime. Think of it as a reasoning system that determines what to run, when to run it, and in what sequence, based on the task and context. It evaluates responses at every step to refine its decision-making, enabling more reliable and interpretable AI solutions while dramatically improving performance.

Articul8’s ModelMesh supports:

LLMs for general tasks
Domain-specific models optimized for industry-specific applications
Non-LLMs for specialized reasoning tasks or established domain-specific tasks (for example, scientific simulation)

Articul8’s domain-specific models are setting new industry standards across supply chain, energy, and semiconductor sectors. The A8-SupplyChain model, built for complex workflows, achieves 92% accuracy and threefold performance gains over general-purpose LLMs in sequential reasoning. In energy, A8-Energy models were developed with EPRI and NVIDIA as part of the Open Power AI Consortium, enabling advanced grid optimization, predictive maintenance, and equipment reliability. The A8-Semicon model has set a new benchmark, outperforming top open-source (DeepSeek-R1, Meta Llama 3.3/4, Qwen 2.5) and proprietary models (GPT-4o, Anthropic’s Claude) by twofold in Verilog code accuracy, all while running at 50–100 times smaller model sizes for real-time AI deployment.

Articul8 develops some of their domain-specific models using Meta’s Llama family as a flexible, open-weight foundation for expert-level reasoning. Through a rigorous fine-tuning pipeline with reasoning trajectories and curated benchmarks, general Llama models are transformed into domain specialists. To tailor models for areas like hardware description languages, Articul8 applies Reinforcement Learning with Verifiable Rewards (RLVR), using automated reward pipelines to specialize the model’s policy. In one case, a dataset of 50,000 documents was automatically processed into 1.2 million images, 360,000 tables, and 250,000 summaries, clustered into a knowledge graph of over 11 million entities. These structured insights fuel A8-DSMs across research, product design, development, and operations.

How SageMaker HyperPod accelerated the development of Articul8’s DSMs

Cost and time to train DSMs is critical for success for Articul8 in a rapidly evolving ecosystem. Training high-performance DSMs requires extensive experimentation, rapid iteration, and scalable compute infrastructure. With SageMaker HyperPod, Articul8 was able to:

Rapidly iterate on DSM training – SageMaker HyperPod resiliency features enabled Articul8 to train and fine-tune its DSMs in a fraction of the time required by traditional infrastructure
Optimize model training performance – By using the automated failure recovery feature in SageMaker HyperPod, Articul8 provided stable and resilient training processes
Reduce AI deployment time by four times and lower total cost of ownership by five times – The orchestration capabilities of SageMaker HyperPod alleviated the manual overhead of cluster management, allowing Articul8’s research teams to focus on model optimization rather than infrastructure upkeep

These advantages contributed to record-setting benchmark results by Articul8, proving that domain-specific models deliver superior real-world performance compared to general-purpose models.

Distributed training challenges and the role of SageMaker HyperPod

Distributed training across hundreds of nodes faces several critical challenges beyond basic resource constraints. Managing massive training clusters requires robust infrastructure orchestration and careful resource allocation for operational efficiency. SageMaker HyperPod offers both managed Slurm and Amazon EKS orchestration experience that streamlines cluster creation, infrastructure resilience, job submission, and observability. The following details focus on the Slurm implementation for reference:

Cluster setup – Although setting up a cluster is a one-time effort, the process is streamlined with a setup script that walks the administrator through each step of cluster creation. This post shows how this can be done in discrete steps.
Resiliency – Fault tolerance becomes paramount when operating at scale. SageMaker HyperPod handles node failures and network interruptions by replacing faulty nodes automatically. You can add the flag --auto-resume=1 with the Slurm srun command, and the distributed training job will recover from the last checkpoint.
Job submission – SageMaker HyperPod managed Slurm orchestration is a powerful way for data scientists to submit and manage distributed training jobs. Refer to the following example in the AWS-samples distributed training repo for reference. For instance, a distributed training job can be submitted with a Slurm sbatch command: sbatch 1.distributed-training-llama2.sbatch. You can use squeue and scancel to view and cancel jobs, respectively.
Observability – SageMaker HyperPod uses Amazon CloudWatch and open source managed Prometheus and Grafana services for monitoring and logging. Cluster administrators can view the health of the infrastructure (network, storage, compute) and utilization.

Solution overview

The SageMaker HyperPod platform enables Articul8 to efficiently manage high-performance compute clusters without requiring a dedicated infrastructure team. The service automatically monitors cluster health and replaces faulty nodes, making the deployment process frictionless for researchers.

To enhance their experimental capabilities, Articul8 integrated SageMaker HyperPod with Amazon Managed Grafana, providing real-time observability of GPU resources through a single-pane-of-glass dashboard. They also used SageMaker HyperPod lifecycle scripts to customize their cluster environment and install required libraries and packages. This comprehensive setup empowers Articul8 to conduct rapid experimentation while maintaining high performance and reliability—they reduced their customers’ AI deployment time by four times and lowered their total cost of ownership by five times.

The following diagram illustrates the observability architecture.

The platform’s efficiency in managing computational resources with minimum downtime has been particularly valuable for Articul8’s research and development efforts, empowering them to quickly iterate on their generative AI solutions while maintaining enterprise-grade performance standards. The following sections describe the setup and results in detail.

For the setup for this post, we begin with the AWS published workshop for SageMaker HyperPod, and adjust it to suit our workload.

Prerequisites

The following two AWS CloudFormation templates address the prerequisites of the solution setup.

For SageMaker HyperPod

This CloudFormation stack addresses the prerequisites for SageMaker HyperPod:

VPC and two subnets – A public subnet and a private subnet are created in an Availability Zone (provided as a parameter). The virtual private cloud (VPC) contains two CIDR blocks with 10.0.0.0/16 (for the public subnet) and 10.1.0.0/16 (for the private subnet). An internet gateway and NAT gateway are deployed in the public subnet.
Amazon FSx for Lustre file system – An Amazon FSx for Lustre volume is created in the specified Availability Zone, with a default of 1.2 TB storage, which can be overridden by a parameter. For this case study, we increased the storage size to 7.2 TB.
Amazon S3 bucket – The stack deploys endpoints for Amazon Simple Storage Service (Amazon S3) to store lifecycle scripts.
IAM role – An AWS Identity and Access Management (IAM) role is also created to help execute SageMaker HyperPod cluster operations.
Security group – The script creates a security group to enable EFA communication for multi-node parallel batch jobs.

For cluster observability

To get visibility into cluster operations and make sure workloads are running as expected, an optional CloudFormation stack has been used for this case study. This stack includes:

Node exporter – Supports visualization of CPU load averages, memory and disk usage, network traffic, file system, and disk I/O metrics
NVIDIA DCGM – Supports visualization of GPU utilization, temperatures, power usage, and memory usage
EFA metrics – Supports visualization of EFA network and error metrics, EFA RDMA performance, and so on.
FSx for Lustre – Supports visualization of file system read/write operations, free capacity, and metadata operations

Observability can be configured through YAML scripts to monitor SageMaker HyperPod clusters on AWS. Amazon Managed Service for Prometheus and Amazon Managed Grafana workspaces with associated IAM roles are deployed in the AWS account. Prometheus and exporter services are also set up on the cluster nodes.

Using Amazon Managed Grafana with SageMaker HyperPod helps you create dashboards to monitor GPU clusters and make sure they operate efficiently with minimum downtime. In addition, dashboards have become a critical tool to give you a holistic view of how specialized workloads consume different resources of the cluster, helping developers optimize their implementation.

Cluster setup

The cluster is set up with the following components (results might vary based on customer use case and deployment setup):

Head node and compute nodes – For this case study, we use a head node and SageMaker HyperPod compute nodes. The head node has an ml.m5.12xlarge instance, and the compute queue consists of ml.p4de.24xlarge instances.
Shared volume – The cluster has an FSx for Lustre file system mounted at /fsx on both the head and compute nodes.
Local storage – Each node has 8 TB local NVME volume attached for local storage.
Scheduler – Slurm is used as an orchestrator. Slurm is an open source and highly scalable cluster management tool and job scheduling system for high-performance computing (HPC) clusters.
Accounting – As part of cluster configuration, a local MariaDB is deployed that keeps track of job runtime information.

Results

During this project, Articul8 was able to confirm the expected performance of A100 with the added benefit of creating a cluster using Slurm and providing observability metrics to monitor the health of various components (storage, GPU nodes, fiber). The primary validation was on the ease of use and rapid ramp-up of data science experiments. Furthermore, they were able to demonstrate near linear scaling with distributed training, achieving a 3.78 times reduction in time to train for Meta Llama-2 13B with 4x nodes. Having the flexibility to run multiple experiments, without losing development time from infrastructure overhead was an important accomplishment for the Articul8 data science team.

Clean up

If you run the cluster as part of the workshop, you can follow the cleanup steps to delete the CloudFormation resources after deleting the cluster.

Conclusion

This post demonstrated how Articul8 AI used SageMaker HyperPod to overcome the scalability and efficiency challenges of training multiple high-performing DSMs across key industries. By alleviating infrastructure complexity, SageMaker HyperPod empowered Articul8 to focus on building AI systems with measurable business outcomes. From semiconductor and energy to supply chain, Articul8’s DSMs are proving that the future of enterprise AI is not general—it’s purpose-built. Key takeaways include:

DSMs significantly outperform general-purpose LLMs in critical domains
SageMaker HyperPod accelerated the development of Articul8’s A8-Semicon, A8-SupplyChain, and Energy DSM models
Articul8 reduced AI deployment time by four times and lowered total cost of ownership by five times using the scalable, automated training infrastructure of SageMaker HyperPod

Learn more about SageMaker HyperPod by following this workshop. Reach out to your account team on how you can use this service to accelerate your own training workloads.

About the Authors

Yashesh A. Shroff, PhD. is a Sr. GTM Specialist in the GenAI Frameworks organization, responsible for scaling customer foundational model training and inference on AWS using self-managed or specialized services to meet cost and performance requirements. He holds a PhD in Computer Science from UC Berkeley and an MBA from Columbia Graduate School of Business.

Amit Bhatnagar is a Sr Technical Account Manager with AWS, in the Enterprise Support organization, with a focus on generative AI startups. He is responsible for helping key AWS customers with their strategic initiatives and operational excellence in the cloud. When he is not chasing technology, Amit loves to cook vegan delicacies and hit the road with his family to chase the horizon.

Renato Nascimento is the Head of Technology at Articul8, where he leads the development and execution of the company’s technology strategy. With a focus on innovation and scalability, he ensures the seamless integration of cutting-edge solutions into Articul8’s products, enabling industry-leading performance and enterprise adoption.

Felipe Viana is the Head of Applied Research at Articul8, where he leads the design, development, and deployment of innovative generative AI technologies, including domain-specific models, new model architectures, and multi-agent autonomous systems.

Andre Von Zuben is the Head of Architecture at Articul8, where he is responsible for designing and implementing scalable generative AI platform elements, novel generative AI model architectures, and distributed model training and deployment pipelines.

How VideoAmp uses Amazon Bedrock to power their media analytics interface

June 12, 2025

by Suzanne Willard, Makoto Uchida Amazon AWS

This post was co-written with Suzanne Willard and Makoto Uchida from VideoAmp.

In this post, we illustrate how VideoAmp, a media measurement company, worked with the AWS Generative AI Innovation Center (GenAIIC) team to develop a prototype of the VideoAmp Natural Language (NL) Analytics Chatbot to uncover meaningful insights at scale within media analytics data using Amazon Bedrock. The AI-powered analytics solution involved the following components:

A natural language to SQL pipeline, with a conversational interface, that works with complex queries and media analytics data from VideoAmp
An automated testing and evaluation tool for the pipeline

VideoAmp background

VideoAmp is a tech-first measurement company that empowers media agencies, brands, and publishers to precisely measure and optimize TV, streaming, and digital media. With a comprehensive suite of measurement, planning, and optimization solutions, VideoAmp offers clients a clear, actionable view of audiences and attribution across environments, enabling them to make smarter media decisions that help them drive better business outcomes. VideoAmp has seen incredible adoption for its measurement and currency solutions with 880% YoY growth, 98% coverage of the TV publisher landscape, 11 agency groups, and more than 1,000 advertisers. VideoAmp is headquartered in Los Angeles and New York with offices across the United States. To learn more, visit www.videoamp.com.

VideoAmp’s AI journey

VideoAmp has embraced AI to enhance its measurement and optimization capabilities. The company has integrated machine learning (ML) algorithms into its infrastructure to analyze vast amounts of viewership data across traditional TV, streaming, and digital services. This AI-driven approach allows VideoAmp to provide more accurate audience insights, improve cross-environment measurement, and optimize advertising campaigns in real time. By using AI, VideoAmp has been able to offer advertisers and media owners more precise targeting, better attribution models, and increased return on investment for their advertising spend. The company’s AI journey has positioned it as a leader in the evolving landscape of data-driven advertising and media measurement.

To take their innovations a step further, VideoAmp is building a brand-new analytics solution powered by generative AI, which will provide their customers with accessible business insights. Their goal for a beta product is to create a conversational AI assistant powered by large language models (LLMs) that allows VideoAmp’s data analysts and non-technical users such as content researchers and publishers to perform data analytics using natural language queries.

Use case overview

VideoAmp is undergoing a transformative journey by integrating generative AI into its analytics. The company aims to revolutionize how customers, including publishers, media agencies, and brands, interact with and derive insights from VideoAmp’s vast repository of data through a conversational AI assistant interface.

Presently, analysis by data scientists and analysts is done manually, requires technical SQL knowledge, and can be time-consuming for complex and high-dimensional datasets. Acknowledging the necessity for streamlined and accessible processes, VideoAmp worked with the GenAIIC to develop an AI assistant capable of comprehending natural language queries, generating and executing SQL queries on VideoAmp’s data warehouse, and delivering natural language summaries of retrieved information. The assistant allows non-technical users to surface data-driven insights, and it reduces research and analysis time for both technical and non-technical users.

Key success criteria for the project included:

The ability to convert natural language questions into SQL statements, connect to VideoAmp’s provided database, execute statements on VideoAmp performance metrics data, and create a natural language summary of results
A UI to ask natural language questions and view assistant output, which includes generated SQL queries, reasoning for the SQL statements, retrieved data, and natural language data summaries
Conversational support for the user to iteratively refine and filter asked questions
Low latency and cost-effectiveness
An automated evaluation pipeline to assess the quality and accuracy of the assistant

The team overcame a few challenges during the development process:

Adapting LLMs to understand the domain aspects of VideoAmp’s dataset – The dataset included highly industry-specific fields and metrics, and required complex queries to effectively filter and analyze. The queries often involved multiple specialized metric calculations, filters selecting from over 30 values, and extensive grouping and ordering.
Developing an automated evaluation pipeline – The pipeline is able to correctly identify if generated outputs are equivalent to ground truth data, even if they have different column aliasing, ordering, and metric calculations.

Solution overview

The GenAIIC team worked with VideoAmp to create an AI assistant that used Anthropic’s Claude 3 LLMs through Amazon Bedrock. Amazon Bedrock was chosen for this project because it provides access to high-quality foundation models (FMs), including Anthropic’s Claude 3 series, through a unified API. This allowed the team to quickly integrate the most suitable models for different components of the solution, such as SQL generation and data summarization.

Additional features in Amazon Bedrock, including Amazon Bedrock Prompt Management, native support for Retrieval Augmented Generation (RAG) and structured data retrieval through Amazon Bedrock Knowledge Bases, Amazon Bedrock Guardrails, and fine-tuning, enable VideoAmp to quickly expand the analytics solution and take it to production. Amazon Bedrock also offers robust security and adheres to compliance certifications, allowing VideoAmp to confidently expand their AI analytics solution while maintaining data privacy and adhering to industry standards.

The solution is connected to a data warehouse. It supports a variety of database connections, such as Snowflake, SingleStore, PostgreSQL, Excel and CSV files, and more. The following diagram illustrates the high-level workflow of the solution.

The workflow consists of the following steps:

The user navigates to the frontend application and asks a question in natural language.
A Question Rewriter LLM component uses previous conversational context to augment the question with additional details if applicable. This allows follow-up questions and refinements to previous questions.
A Text-to-SQL LLM component creates a SQL query that corresponds to the user question.
The SQL query is executed in the data warehouse.
A Data-to-Text LLM component summarizes the retrieved data for the user.

The rewritten question, generated SQL, reasoning, and retrieved data are returned at each step.

AI assistant workflow details

In this section, we discuss the components of the AI assistant workflow in more detail.

Rewriter

After the user asks the question, the current question and the previous questions the user asked in the current session are sent to the Question Rewriter component, which uses Anthropic’s Claude 3 Sonnet model. If deemed necessary, the LLM uses context from the previous questions to augment the current user question to make it a standalone question with context included. This enables multi-turn conversational support for the user, allowing for natural interactions with the assistant.

For example, if a user first asked, “For the week of 09/04/2023 – 09/10/2023, what were the top 10 ranked original national broadcast shows based on viewership for households with 18+?”, followed by, “Can I have the same data for one year later”, the rewriter would rewrite the latter question as “For the week of 09/03/2024 – 09/09/2024, what were the top 10 ranked original national broadcast shows based on viewership for households with 18+?”

Text-to-SQL

The rewritten user question is sent to the Text-to-SQL component, which also uses Anthropic’s Claude 3 Sonnet model. The Text-to-SQL component uses information about the database in its prompt to generate a SQL query corresponding to the user question. It also generates an explanation of the query.

The text-to-SQL prompt addressed several challenges, such as industry-specific language in user questions, complex metrics, and several rules and defaults for filtering. The prompt was developed through several iterations, based on feedback and guidance from the VideoAmp team, and manual and automated evaluation.

The prompt consisted of four overarching sections: context, SQL instructions, task, and examples. During the development phase, database schema and domain- or task-specific knowledge were found to be critical, so one major part of the prompt was designed to incorporate them in the context. To make this solution reusable and scalable, a modularized design of the prompt/input system is employed, making it generic so it can be applied to other use cases and domains. The solution can support Q&A with multiple databases by dynamically switching/changing the corresponding context with an orchestrator if needed.

The context section contains the following details:

Database schema
Sample categories for relevant data fields such as television networks to aid the LLM in understanding what fields to use for identifiers in the question
Industry term definitions
How to calculate different types of metrics or aggregations
Default values or fields should be selected if not specified
Other domain- or task-specific knowledge

The SQL instructions contain the following details:

Dynamic insertion of today’s date as a reference for terms, such as “last 3 quarters”
Instructions on usage of sub-queries
Instructions on when to retrieve additional informational columns not specified in the user question
Known SQL syntax and database errors to avoid and potential fixes

In the task section, the LLM is given a detailed step-by-step process to formulate SQL queries based on the context. A step-by-step process is required for the LLM to correctly think through and assimilate the required context and rules. Without the step-by-step process, the team found that the LLM wouldn’t adhere to all instructions provided in the previous sections.

In the examples section, the LLM is given several examples of user questions, corresponding SQL statements, and explanations.

In addition to iterating on the prompt content, different content organization patterns were tested due to long context. The final prompt was organized with markdown and XML.

SQL execution

After the Text-to-SQL component outputs a query, the query is executed against VideoAmp’s data warehouse using database connector code. For this use case, only read queries for analytics are executed to protect the database from unexpected operations like updates or deletes. The credentials for the database are securely stored and accessed using AWS Secrets Manager and AWS Key Management Service (AWS KMS).

Data-to-Text

The data retrieved by the SQL query is sent to the Data-to-Text component, along with the rewritten user question. The Data-to-Text component, which uses Anthropic’s Claude 3 Haiku model, produces a concise summary of the retrieved data and answers the user question.

The final outputs are displayed on the frontend application as shown in the following screenshots (protected data is hidden).

Evaluation framework workflow details

The GenAIIC team developed a sophisticated automated evaluation pipeline for VideoAmp’s NL Analytics Chatbot, which directly informed prompt optimization and solution improvements and was a critical component in providing high-quality results.

The evaluation framework comprises of two categories:

SQL query evaluation – Generated SQL queries are evaluated for overall closeness to the ground truth SQL query. A key feature of the SQL evaluation component was the ability to account for column aliasing and ordering differences when comparing statements and determine equivalency.
Retrieved data evaluation – The retrieved data is compared to ground truth data to determine an exact match, after a few processing steps to account for column, formatting, and system differences.

The evaluation pipeline also produces detailed reports of the results and discrepancies between generated data and ground truth data.

Dataset

The dataset used for the prototype solution was hosted in a data warehouse and consisted of performance metrics data such as viewership, ratings, and rankings for television networks and programs. The field names were industry-specific, so a data dictionary was included in the text-to-SQL prompt as part of the schema. The credentials for the database are securely stored and accessed using Secrets Manager and AWS KMS.

Results

A set of test questions were evaluated by the GenAIIC and VideoAmp teams, focusing on three metrics:

Accuracy – Different accuracy metrics were analyzed, but exact matches between retrieved data and ground truth data were prioritized
Latency – LLM generation latency, excluding the time taken to query the database
Cost – Average cost per user question

Both the evaluation pipeline and human review reported high accuracies on the dataset, whereas costs and latencies remained low. Overall, the results were well-aligned with VideoAmp expectations. VideoAmp anticipates this solution will make it simple for users to handle complex data queries with confidence through intuitive natural language interactions, reducing the time to business insights.

Conclusion

In this post, we shared how the GenAIIC team worked with VideoAmp to build a prototype of the VideoAmp NL Analytics Chatbot, an end-to-end generative AI data analytics interface using Amazon Bedrock and Anthropic’s Claude 3 LLMs. The solution is equipped with a variety of state-of-the-art LLM-based techniques, such as question rewriting, text-to-SQL query generation, and summarization of data in natural language. It also includes an automated evaluation module for evaluating the correctness of generated SQL statements and retrieved data. The solution achieved high accuracy on VideoAmp’s evaluation samples. Users can interact with the solution through an intuitive AI assistant interface with conversational capabilities.

VideoAmp will soon be launching their new generative AI-powered analytics interface, which enables customers to analyze data and gain business insights through natural language conversation. Their successful work with the GenAIIC team will allow VideoAmp to use generative AI technology to swiftly deliver valuable insights for both technical and non-technical customers.

This is just one of the ways AWS enables builders to deliver generative AI-based solutions. You can get started with Amazon Bedrock and see how it can be integrated in example code bases. The GenAIIC is a group of science and strategy experts with comprehensive expertise spanning the generative AI journey, helping you prioritize use cases, build a roadmap, and move solutions into production. If you’re interested in working with the GenAIIC, reach out to them today.

About the authors

Suzanne Willard is the VP of Engineering at VideoAmp where she founded and leads the GenAI program, establishing the strategic vision and execution roadmap. With over 20 years experience she is driving innovation in AI technologies, creating transformative solutions that align with business objectives and set the company apart in the market.

Makoto Uchida is a senior architect at VideoAmp in the AI domain, acting as area technical lead of AI portfolio, responsible for defining and driving AI product and technical strategy in the content and ads measurement platform PaaS product. Previously, he was a software engineering lead in generative and predictive AI Platform at a major hyperscaler public Cloud service. He has also engaged with multiple startups, laying the foundation of Data/ML/AI infrastructures.

Shreya Mohanty is a Deep Learning Architect at the AWS Generative AI Innovation Center, where she partners with customers across industries to design and implement high-impact GenAI-powered solutions. She specializes in translating customer goals into tangible outcomes that drive measurable impact.

Long Chen is a Sr. Applied Scientist at AWS Generative AI Innovation Center. He holds a Ph.D. in Applied Physics from University of Michigan – Ann Arbor. With more than a decade of experience for research and development, he works on innovative solutions in various domains using generative AI and other machine learning techniques, ensuring the success of AWS customers. His interest includes generative models, multi-modal systems and graph learning.

Amaran Asokkumar is a Deep Learning Architect at AWS, specializing in infrastructure, automation, and AI. He leads the design of GenAI-enabled solutions across industry segments. Amaran is passionate about all things AI and helping customers accelerate their GenAI exploration and transformation efforts.

Vidya Sagar Ravipati is a Science Manager at the Generative AI Innovation Center, where he leverages his vast experience in large-scale distributed systems and his passion for machine learning to help AWS customers across different industry verticals accelerate their AI and cloud adoption.

Scaling up image segmentation across data and tasks

June 12, 2025

by Pei Wang Amazon AWS

Scaling up image segmentation across data and tasks

Novel architecture that fuses learnable queries and conditional queries improves a segmentation models ability to transfer across tasks.

Computer vision

Pei Wang

Stefano Soatto

June 12, 12:25 PMJune 12, 12:25 PM

The first draft of this blog post was generated by Amazon Nova Pro, based on detailed instructions from Amazon Science editors and multiple examples of prior posts.

In a paper we’re presenting at the 2025 Conference on Computer Vision and Pattern Recognition (CVPR), we introduce a new approach to image segmentation that scales across diverse datasets and tasks. Traditional segmentation models, while effective on isolated tasks, often struggle as the number of new tasks or unfamiliar scenarios grows. Our proposed method, which uses a model we call a mixed-query transformer (MQ-former), aims to enable joint training and evaluation across multiple tasks and datasets.

Scalable segmentation

Image segmentation is a computer vision task that involves partitioning an image into distinct regions or segments. Each segment corresponds to a different object or part of the scene. There are several types of segmentation tasks, including foreground/background segmentation (distinguishing objects at different distances), semantic segmentation (labeling each pixel as belonging to a particular object class), and instance segmentation (identifying each pixel as belonging to a particular instance of an object class).

An example of instance segmentation, given the text prompt “cardinal”.

Scalability means that a segmentation model can effectively improve with an increase in the size of its training dataset, in the diversity of the tasks it performs, or both. Most prior research has focused on one or the other data or task diversity. We address both at once.

A tale of two queries

In our paper, we show that one issue preventing effective scalability in segmentation models is the design of object queries. An object query is a way of representing a hypothesis about objects in a scene a hypothesis that can be tested against images.

There are two main types of object queries. The first, which we refer to as learnable queries, are learned vectors that interact with image features and encode information about location and object class. Learnable queries tend to perform well on semantic segmentation as the they do not contain object-specific priors.

The second type of object query, which we refer to as a conditional query, is akin to two-stage object detection: region proposals are generated by a Transformer encoder, and then high-confidence proposals are fed into the Transformer decoder as queries to generate the final prediction. Conditional queries are closely aligned with the object classes and excel at object detection and instance segmentation on semantically well-defined objects.

Our approach is to combine both types of queries, which improves the models ability to transfer across tasks. Our MQ-Former model represents inputs using both learnable queries and conditional queries, and every layer of the decoder has a cross-attention mechanism, so that the processing of the learnable queries can factor in information from the conditional-query processing, and vice versa.

Architectural schematics for learnable queries, conditional queries, and mixed queries. Solid triangles represent instance segmentation ground truth, and striped triangles represent semantic-segmentation ground truth.

Leveraging synthetic data

Mixed queries aid scalability across segmentation tasks, but the other aspect of scalability in segmentation models is dataset size. One of the key challenges in scaling up segmentation models is the scarcity of high-quality, annotated data. To overcome this limitation, we propose leveraging synthetic data.

Examples of synthetic data. At left are two examples of synthetic masks, at right two examples of synthetic captions.

While segmentation data is scarce, object recognition data is plentiful. Object recognition datasets typically include bounding boxes, or rectangles that identify the image regions in which labeled objects can be found.

Asking a trained segmentation model to segment only the object within a bounding box significantly improves performance; we are thus able to use weaker segmentation models to convert object recognition datasets into segmentation datasets that can be used to train stronger segmentation models.

Bounding boxes can also focus automatic captioning models on regions of interest in an image, to provide the type of object classifications necessary to train semantic-segmentation and instance segmentation models.

Experimental results

We evaluated our approach using 15 datasets covering a range of segmentation tasks and found that, with MQ-Former, scaling up both the volume of training data and the diversity of tasks consistently enhances the models segmentation capabilities.

For example, on the SeginW benchmark, which includes 25 datasets used for open-vocabulary in-the-wild segmentation evaluation, scaling the data and tasks from 100,000 samples to 600,000 boosted performance 16%, as measured by average precision of object masking. Incorporating synthetic data improved performance by another 14%, establishing a new state of the art.

Research areas: Computer vision

Tags: Image segmentation, Data representation

Adobe enhances developer productivity using Amazon Bedrock Knowledge Bases

June 11, 2025

by Kamran Razi Amazon AWS

Adobe Inc. excels in providing a comprehensive suite of creative tools that empower artists, designers, and developers across various digital disciplines. Their product landscape is the backbone of countless creative projects worldwide, ranging from web design and photo editing to vector graphics and video production.

Adobe’s internal developers use a vast array of wiki pages, software guidelines, and troubleshooting guides. Recognizing the challenge developers faced in efficiently finding the right information for troubleshooting, software upgrades, and more, Adobe’s Developer Platform team sought to build a centralized system. This led to the initiative Unified Support, designed to help thousands of the company’s internal developers get immediate answers to questions from a centralized place and reduce time and cost spent on developer support. For instance, a developer setting up a continuous integration and delivery (CI/CD) pipeline in a new AWS Region or running a pipeline on a dev branch can quickly access Adobe-specific guidelines and best practices through this centralized system.

The initial prototype for Adobe’s Unified Support provided valuable insights and confirmed the potential of the approach. This early phase highlighted key areas requiring further development to operate effectively at Adobe’s scale, including addressing scalability needs, simplifying resource onboarding, improving content synchronization mechanisms, and optimizing infrastructure efficiency. Building on these learnings, improving retrieval precision emerged as the next critical step.

To address these challenges, Adobe partnered with the AWS Generative AI Innovation Center, using Amazon Bedrock Knowledge Bases and the Vector Engine for Amazon OpenSearch Serverless. This solution dramatically improved their developer support system, resulting in a 20% increase in retrieval accuracy. Metadata filtering empowers developers to fine-tune their search, helping them surface more relevant answers across complex, multi-domain knowledge sources. This improvement not only enhanced the developer experience but also contributed to reduced support costs.

In this post, we discuss the details of this solution and how Adobe enhances their developer productivity.

Solution overview

Our project aimed to address two key objectives:

Document retrieval engine enhancement – We developed a robust system to improve search result accuracy for Adobe developers. This involved creating a pipeline for data ingestion, preprocessing, metadata extraction, and indexing in a vector database. We evaluated retrieval performance against Adobe’s ground truth data to produce high-quality, domain-specific results.
Scalable, automated deployment – To support Unified Support across Adobe, we designed a reusable blueprint for deployment. This solution accommodates large-scale data ingestion of various types and offers flexible configurations, including embedding model selection and chunk size adjustment.

Using Amazon Bedrock Knowledge Bases, we created a customized, fully managed solution that improved the retrieval effectiveness. Key achievements include a 20% increase in accuracy metrics for document retrieval, seamless document ingestion and change synchronization, and enhanced scalability to support thousands of Adobe developers. This solution provides a foundation for improved developer support and scalable deployment across Adobe’s teams. The following diagram illustrates the solution architecture.

Let’s take a closer look at our solution:

Amazon Bedrock Knowledge Bases index – The backbone of our system is Amazon Bedrock Knowledge Bases. Data is indexed through the following stages:
- Data ingestion – We start by pulling data from Amazon Simple Storage Service (Amazon S3) buckets. This could be anything from resolutions to past issues or wiki pages.
- Chunking – Amazon Bedrock Knowledge Bases breaks data down into smaller pieces, or chunks, defining the specific units of information that can be retrieved. This chunking process is configurable, allowing for optimization based on the specific needs of the business.
- Vectorization – Each chunk is passed through an embedding model (in this case, Amazon Titan V2 on Amazon Bedrock) creating a 1,024-dimension numerical vector. This vector represents the semantic meaning of the chunk, allowing for similarity searches
- Storage – These vectors are stored in the Amazon OpenSearch Serverless vector database, creating a searchable repository of information.
Runtime – When a user poses a question, our system competes the following steps:
- Query vectorization – With the Amazon Bedrock Knowledge Bases Retrieve API, the user’s question is automatically embedded using the same embedding model used for the chunks during data ingestion.
- Similarity search and retrieval – The system retrieves the most relevant chunks in the vector database based on similarity scores to the query.
- Ranking and presentation – The corresponding documents are ranked based on the sematic similarity of their modest relevant chunks to the query, and the top-ranked information is presented to the user.

Multi-tenancy through metadata filtering

As developers, we often find ourselves seeking help across various domains. Whether it’s tackling CI/CD issues, setting up project environments, or adopting new libraries, the landscape of developer challenges is vast and varied. Sometimes, our questions even span multiple domains, making it crucial to have a system for retrieving relevant information. Metadata filtering empowers developers to retrieve not just semantically relevant information, but a well-defined subset of that information based on specific criteria. This powerful tool enables you to apply filters to your retrievals, helping developers narrow the search results to a limited set of documents based on the filter, thereby improving the relevancy of the search.

To use this feature, metadata files are provided alongside the source data files in an S3 bucket. To enable metadata-based filtering, each source data file needs to be accompanied by a corresponding metadata file. These metadata files used the same base name as the source file, with a .metadata.json suffix. Each metadata file included relevant attributes—such as domain, year, or type—to support multi-tenancy and fine-grained filtering in OpenSearch Service. The following code shows what an example metadata file looks like:

{
  "metadataAttributes": 
      {
        "domain": "project A",
        "year": 2016,
        "type": "wiki"
       }
 }

Retrieve API

The Retrieve API allows querying a knowledge base to retrieve relevant information. You can use it as follows:

Send a POST request to /knowledgebases/knowledgeBaseId/retrieve.
Include a JSON body with the following:
1. retrievalQuery – Contains the text query.
2. retrievalConfiguration – Specifies search parameters, such as number of results and filters.
3. nextToken – For pagination (optional).

The following is an example request syntax:

POST /knowledgebases/knowledgeBaseId/retrieve HTTP/1.1
Content-type: application/json
{
   "nextToken": "string",
   "retrievalConfiguration": { 
      "vectorSearchConfiguration": { 
         "filter": { ... },
         "numberOfResults": number,
         "overrideSearchType": "string"
      }
   },
   "retrievalQuery": { 
      "text": "string"
   }
}

Additionally, you can set up the retriever with ease using the langchain-aws package:

from langchain_aws import AmazonKnowledgeBasesRetriever
retriever = AmazonKnowledgeBasesRetriever(
    knowledge_base_id="YOUR-ID",
    retrieval_config={"vectorSearchConfiguration": {"numberOfResults": 4}},
)
retriever.get_relevant_documents(query="What is the meaning of life?")

This approach enables semantic querying of the knowledge base to retrieve relevant documents based on the provided query, simplifying the implementation of search.

Experimentation

To deliver the most accurate and efficient knowledge retrieval system, the Adobe and AWS teams put the solution to the test. The team conducted a series of rigorous experiments to fine-tune the system and find the optimal settings.

Before we dive into our findings, let’s discuss the metrics and evaluation process we used to measure success. We used the open source model evaluation framework Ragas to evaluate the retrieval system across two metrics: document relevance and mean reciprocal rank (MRR). Although Ragas comes with many metrics for evaluating model performance out of the box, we needed to implement these metrics by extending the Ragas framework with custom code.

Document relevance – Document relevance offers a qualitative approach to assessing retrieval accuracy. This metric uses a large language model (LLM) as an impartial judge to compare retrieved chunks against user queries. It evaluates how effectively the retrieved information addresses the developer’s question, providing a score between 1–10.
Mean reciprocal rank – On the quantitative side, we have the MRR metric. MRR evaluates how well a system ranks the first relevant item for a query. For each query, find the rank k of the highest-ranked relevant document. The score for that query is 1/k. MRR is the average of these 1/k scores over the entire set of queries. A higher score (closer to 1) signifies that the first relevant result is typically ranked high.

These metrics provide complementary insights: document relevance offers a content-based assessment, and MRR provides a ranking-based evaluation. Together, they offer a comprehensive view of the retrieval system’s effectiveness in finding and prioritizing relevant information.In our recent experiments, we explored various data chunking strategies to optimize the performance of retrieval. We tested several approaches, including fixed-size chunking as well as more advanced semantic chunking and hierarchical chunking.Semantic chunking focuses on preserving the contextual relationships within the data by segmenting it based on semantic meaning. This approach aims to improve the relevance and coherence of retrieved results.Hierarchical chunking organizes data into a hierarchical parent-child structure, allowing for more granular and efficient retrieval based on the inherent relationships within your data.

For more information on how to set up different chunking strategies, refer to Amazon Bedrock Knowledge Bases now supports advanced parsing, chunking, and query reformulation giving greater control of accuracy in RAG based applications.

We tested the following chunking methods with Amazon Bedrock Knowledge Bases:

Fixed-size short chunking – 400-token chunks with a 20% overlap (shown as the blue variant in the following figure)
Fixed-size long chunking – 1,000-token chunks with a 20% overlap
Hierarchical chunking – Parent chunks of 1,500 tokens and child chunks of 300 tokens, with a 60-token overlap
Semantic chunking – 400-token chunks with a 95% similarity percentile threshold

For reference, a paragraph of approximately 1,000 characters typically translates to around 200 tokens. To assess performance, we measured document relevance and MRR across different context sizes, ranging from 1–5. This comparison aims to provide insights into the most effective chunking strategy for organizing and retrieving information for this use case.The following figures illustrate the MRR and document relevance metrics, respectively.

As a result of these experiments, we found that MRR is a more sensitive metric for evaluating the impact of chunking strategies, particularly when varying the number of retrieved chunks (top-k from 1 to 5). Among the approaches tested, the fixed-size 400-token strategy—shown in blue—proved to be the simplest and most effective, consistently yielding the highest accuracy across different retrieval sizes.

Conclusion

In the journey to design Adobe’s developer Unified Support search and retrieval system, we’ve successfully harnessed the power of Amazon Bedrock Knowledge Bases to create a robust, scalable, and efficient solution. By configuring fixed-size chunking and using the Amazon Titan V2 embedding model, we achieved a remarkable 20% increase in accuracy metrics for document retrieval compared to Adobe’s existing solution, by running evaluations on the customer’s testing system and provided dataset.The integration of metadata filtering emerged as a game changing feature, allowing for seamless navigation across diverse domains and enabling customized retrieval. This capability proved invaluable for Adobe, given the complexity and breadth of their information landscape. Our comprehensive comparison of retrieval accuracy for different configurations of the Amazon Bedrock Knowledge Bases index has yielded valuable insights. The metrics we developed provide an objective framework for assessing the quality of retrieved context, which is crucial for applications demanding high-precision information retrieval. As we look to the future, this customized, fully managed solution lays a solid foundation for continuous improvement in developer support at Adobe, offering enhanced scalability and seamless support infrastructure in tandem with evolving developer needs.

For those interested in working with AWS on similar projects, visit Generative AI Innovation Center. To learn more about Amazon Bedrock Knowledge Bases, see Retrieve data and generate AI responses with knowledge bases.

About the Authors

Kamran Razi is a Data Scientist at the Amazon Generative AI Innovation Center. With a passion for delivering cutting-edge generative AI solutions, Kamran helps customers unlock the full potential of AWS AI/ML services to solve real-world business challenges. With over a decade of experience in software development, he specializes in building AI-driven solutions, including AI agents. Kamran holds a PhD in Electrical Engineering from Queen’s University.

Nay Doummar is an Engineering Manager on the Unified Support team at Adobe, where she’s been since 2012. Over the years, she has contributed to projects in infrastructure, CI/CD, identity management, containers, and AI. She started on the CloudOps team, which was responsible for migrating Adobe’s infrastructure to the AWS Cloud, marking the beginning of her long-term collaboration with AWS. In 2020, she helped build a support chatbot to simplify infrastructure-related assistance, sparking her passion for user support. In 2024, she joined a project to Unify Support for the Developer Platform, aiming to streamline support and boost productivity.

Varsha Chandan Bellara is a Software Development Engineer at Adobe, specializing in AI-driven solutions to boost developer productivity. She leads the development of an AI assistant for the Unified Support initiative, using Amazon Bedrock, implementing RAG to provide accurate, context-aware responses for technical support and issue resolution. With expertise in cloud-based technologies, Varsha combines her passion for containers and serverless architectures with advanced AI to create scalable, efficient solutions that streamline developer workflows.

Jan Michael Ong is a Senior Software Engineer at Adobe, where he supports the developer community and engineering teams through tooling and automation. Currently, he is part of the Developer Experience team at Adobe, working on AI projects and automation contributing to Adobe’s internal Developer Platform.

Justin Johns is a Deep Learning Architect at Amazon Web Services who is passionate about innovating with generative AI and delivering cutting-edge solutions for customers. With over 5 years of software development experience, he specializes in building cloud-based solutions powered by generative AI.

Gaurav Dhamija is a Principal Solutions Architect at Amazon Web Services, where he helps customers design and build scalable, reliable, and secure applications on AWS. He is passionate about developer experience, containers, and serverless technologies, and works closely with engineering teams to modernize application architectures. Gaurav also specializes in generative AI, using AWS generative AI services to drive innovation and enhance productivity across a wide range of use cases.

Sandeep Singh is a Senior Generative AI Data Scientist at Amazon Web Services, helping businesses innovate with generative AI. He specializes in generative AI, machine learning, and system design. He has successfully delivered state-of-the-art AI/ML-powered solutions to solve complex business problems for diverse industries, optimizing efficiency and scalability.

Anila Joshi has more than a decade of experience building AI solutions. As a Senior Manager, Applied Science at AWS Generative AI Innovation Center, Anila pioneers innovative applications of AI that push the boundaries of possibility and accelerate the adoption of AWS services with customers by helping customers ideate, identify, and implement secure generative AI solutions.

Amazon Nova Lite enables Bito to offer a free tier option for its AI-powered code reviews

June 11, 2025

by Eshan Bhatnagar Amazon AWS

This post is co-written by Amar Goel, co-founder and CEO of Bito.

Meticulous code review is a critical step in the software development process, one that helps delivery high-quality code that’s ready for enterprise use. However, it can be a time-consuming process at scale, when experts must review thousands of lines of code, looking for bugs or other issues. Traditional reviews can be subjective and inconsistent, because humans are human. As generative AI becomes more and more integrated into the software development process, a significant number of new AI-powered code review tools have entered the marketplace, helping software development groups raise quality and ship clean code faster—while boosting productivity and job satisfaction among developers.

Bito is an innovative startup that creates AI agents for a broad range of software developers. It emerged as a pioneer in its field in 2023, when it launched its first AI-powered developer agents, which brought large language models (LLMs) to code-writing environments. Bito’s executive team sees developers and AI as key elements of the future and works to bring them together in powerful new ways with its expanding portfolio of AI-powered agents. Its flagship product, AI Code Review Agent, speeds pull request (PR) time-to-merge up to 89%—accelerating development times and allowing developers to focus on writing code and innovating. Intended to assist, not replace, review by senior software engineers, Bito’s AI Code Review Agent focuses on summarizing, organizing, then suggesting line-level fixes with code base context, freeing engineers to focus on the more strategic aspects of code review, such as evaluating business logic.

With more than 100,000 active developers using its products globally, Bito has proven successful at using generative AI to streamline code review—delivering impressive results and business value to its customers, which include Gainsight, Privado, PubMatic, On-Board Data Systems (OBDS), and leading software enterprises. Bito’s customers have found that AI Code Review Agent provides the following benefits:

Accelerates PR merges by 89%
Reduces regressions by 34%
Delivers 87% of a PR’s feedback necessary for review
Has a 2.33 signal-to-noise ratio
Works in over 50 programming languages

Although these results are compelling, Bito needed to overcome inherent wariness among developers evaluating a flood of new AI-powered tools. To accelerate and expand adoption of AI Code Review Agent, Bito leaders wanted to launch a free tier option, which would let these prospective users experience the capabilities and value that AI Code Review Agent offered—and encourage them to upgrade to the full, for-pay Teams Plan. Its Free Plan would offer AI-generated PR summaries to provide an overview of changes, and its Teams Plan would provide more advanced features, such as downstream impact analysis and one-click acceptance of line-level code suggestions.

In this post, we share how Bito is able to offer a free tier option for its AI-powered code reviews using Amazon Nova.

Choosing a cost-effective model for a free tier offering

To offer a free tier option for AI Code Review Agent, Bito needed a foundation model (FM) that would provide the right level of performance and results at a reasonable cost. Offering code review for free to its potential customers would not be free for Bito, of course, because it would be paying inference costs. To identify a model for its Free Plan, Bito carried out a 2-week evaluation process across a broad range of models, including the high-performing FMs on Amazon Bedrock, as well as OpenAI GPT-4o mini. The Amazon Nova models—fast, cost-effective models that were recently introduced on Amazon Bedrock—were particularly interesting to the team.

At the end of its evaluation, Bito determined that Amazon Nova Lite delivered the right mix of performance and cost-effectiveness for its use cases. Its speed provided fast creation of code review summaries. However, cost—a key consideration for Bito’s Free Plan—proved to be the deciding factor. Ultimately, Amazon Nova Lite met Bito’s criteria for speed, cost, and quality. The combination of Amazon Nova Lite and Amazon Bedrock also make it possible for Bito to offer the reliability and security that their customers needed when entrusting Bito with their code. After all, careful control of code is one of Bito’s core promises to its customers. It doesn’t store code or use it for model training. And its products are SOC 2 Type 2-certified to provide data security, processing integrity, privacy, and confidentiality.

Implementing the right models for different tiers of offerings

Bito has now adopted Amazon Bedrock as its standardized platform to explore, add, and run models. Bito uses Amazon Nova Lite as the primary model for its Free Plan, and Anthropic’s Claude 3.7 Sonnet powers its for-pay Teams Plan, all accessed and integrated through the unified Amazon Bedrock API and controls. Amazon Bedrock provides seamless shifting from Amazon Nova Lite to Anthropic’s Sonnet when customers upgrade, with minimal code changes. Bito leaders are quick to point out that Amazon Nova Lite doesn’t just power its Free Plan—it inspired it. Without the very low cost of Amazon Nova Lite, they wouldn’t have been able to offer a free tier of AI Code Review Agent, which they viewed as a strategic move that would enable it to expand its enterprise customer base. This strategy quickly generated results, attracting three times more prospective customers to its Free Plan than anticipated. At the end of the 14-day trial period, a significant number of users convert to the full AI Code Review Agent to access its full array of capabilities.

Encouraged by the success with AI Code Review Agent, Bito is now using Amazon Nova Lite to power the chat capabilities of its offering for Bito Wingman, its latest AI agentic technology—a full-featured developer assistant in the integrated development environment (IDE) that combines code generation, error handling, architectural advice, and more. Again, the combination of quality and low cost of Amazon Nova Lite made it the right choice for Bito.

Conclusion

In this post, we shared how Bito—an innovative startup that offers a growing portfolio of AI-powered developer agents—chose Amazon Nova Lite to power its free tier offering of AI Code Review Agent, its flagship product. Its AI-powered agents are designed specifically to make developers’ lives easier and their work more impactful:

Amazon Nova Lite enabled Bito to meet one of its core business challenges—attracting enterprise customers. By introducing a free tier, Bito attracted three times more prospective new customers to its generative AI-driven flagship product—AI Code Review Agent.
Amazon Nova Lite outperformed other models during rigorous internal testing, providing the right level of performance at the very low cost Bito needed to launch a free tier of AI Code Review Agent.
Amazon Bedrock empowers Bito to seamlessly switch between models as needed for each tier of AI Code Review Agent—Amazon Nova Lite for its Free Plan and Anthropic’s Claude 3.7 Sonnet for its for-pay Teams Plan. Amazon Bedrock also provided security and privacy, critical considerations for Bito customers.
Bito shows how innovative organizations can use the combination of quality, cost-effectiveness, and speed in Amazon Nova Lite to deliver value to their customers—and to their business.

“Our challenge is to push the capabilities of AI to deliver new value to developers, but at a reasonable cost,” shares Amar Goel, co-founder and CEO of Bito. “Amazon Nova Lite gives us the very fast, low-cost model we needed to power the free offering of our AI Code Review Agent—and attract new customers.”

Get started with Amazon Nova on the Amazon Bedrock console. Learn more Amazon Nova Lite at the Amazon Nova product page.

About the authors

Eshan Bhatnagar is the Director of Product Management for Amazon AGI at Amazon Web Services.

Amar Goel is Co-Founder and CEO of Bito. A serial entrepreneur, Amar previously founded PubMatic (went public in 2020), and formerly worked at Microsoft, McKinsey, and was a software engineer at Netscape, the original browser company. Amar attended Harvard University. He is excited about using GenAI to power the next generation of how software gets built!

How Gardenia Technologies helps customers create ESG disclosure reports 75% faster using agentic generative AI on Amazon Bedrock

June 11, 2025

by Federico Thibaud, Neil Holloway, Fraser Price, Christian Dunn and Frederica Schrager Amazon AWS

This post was co-written with Federico Thibaud, Neil Holloway, Fraser Price, Christian Dunn, and Frederica Schrager from Gardenia Technologies

“What gets measured gets managed” has become a guiding principle for organizations worldwide as they begin their sustainability and environmental, social, and governance (ESG) journeys. Companies are establishing baselines to track their progress, supported by an expanding framework of reporting standards, some mandatory and some voluntary. However, ESG reporting has evolved into a significant operational burden. A recent survey shows that 55% of sustainability leaders cite excessive administrative work in report preparation, while 70% indicate that reporting demands inhibit their ability to execute strategic initiatives. This environment presents a clear opportunity for generative AI to automate routine reporting tasks, allowing organizations to redirect resources toward more impactful ESG programs.

Gardenia Technologies, a data analytics company, partnered with the AWS Prototyping and Cloud Engineering (PACE) team to develop Report GenAI, a fully automated ESG reporting solution powered by the latest generative AI models on Amazon Bedrock. This post dives deep into the technology behind an agentic search solution using tooling with Retrieval Augmented Generation (RAG) and text-to-SQL capabilities to help customers reduce ESG reporting time by up to 75%.

In this post, we demonstrate how AWS serverless technology, combined with agents in Amazon Bedrock, are used to build scalable and highly flexible agent-based document assistant applications.

Scoping the challenge: Growing ESG reporting requirements and complexity

Sustainability disclosures are now a standard part of corporate reporting, with 96% of the 250 largest companies reporting on their sustainability progress based on government and regulatory frameworks. To meet reporting mandates, organizations must overcome many data collection and process-based barriers. Data for a single report includes thousands of data points from a multitude of sources including official documentation, databases, unstructured document stores, utility bills, and emails. The EU Corporate Sustainability Reporting Directive (CSRD) framework, for example, comprises of 1,200 individual data points that need to be collected across an enterprise. Even voluntary disclosures like the CDP, which is approximately 150 questions, cover a wide range of questions related to climate risk and impact, water stewardship, land use, and energy consumption. Collecting this information across an organization is time consuming.

A secondary challenge is that many organizations with established ESG programs need to report to multiple disclosure frameworks, such as SASB, GRI, TCFD, each using different reporting and disclosure standards. To complicate matters, reporting requirements are continually evolving, leaving organizations struggling just to keep up with the latest changes. Today, much of this work is highly manual and leaves sustainability teams spending more time on managing data collection and answering questionnaires rather than developing impactful sustainability strategies.

Solution overview: Automating undifferentiated heavy lifting with AI agents

Gardenia’s approach to strengthen ESG data collection for enterprises is Report GenAI, an agentic framework using generative AI models on Amazon Bedrock to automate large chunks of the ESG reporting process. Report GenAI pre-fills reports by drawing on existing databases, document stores and web searches. The agent then works collaboratively with ESG professionals to review and fine-tune responses. This workflow has five steps to help automate ESG data collection and assist in curating responses. These steps include setup, batch-fill, review, edit, and repeat. Let’s explore each step in more detail.

Setup: The Report GenAI agent is configured and authorized to access an ESG and emissions database, client document stores (emails, previous reports, data sheets), and document searches over the public internet. Client data is stored within specified AWS Regions using encrypted Amazon Simple Storage Service (Amazon S3) buckets with VPC endpoints for secure access, while relational data is hosted in Amazon Relational Database Service (Amazon RDS) instances deployed within Gardenia’s virtual private cloud (VPC). This architecture helps make sure data residency requirements can be fulfilled, while maintaining strict access controls through private network connectivity. The agent also has access to the relevant ESG disclosure questionnaire including questions and expected response format (we refer to this as a report specification). The following figure is an example of the Report GenAI user interface at the agent configuration step. As shown in the figure, the user can choose which databases, documents, or other tools the agent will use to answer a given question.

Batch-fill: The agent then iterates through each question and data point to be disclosed and then retrieves relevant data from the client document stores and document searches. This information is processed to produce a response in the expected format depending on the disclosure report requirements.
Review: Each response includes cited sources and—if the response is quantitative—calculation methodology. This enables users to maintain a clear audit trail and verify the accuracy of batch-filled responses quickly.
Edit: While the agentic workflow is automated, our approach allows for a human-in-the-loop to review, validate, and iterate on batch-filled facts and figures. In the following figure, we show how users can chat with the AI assistant to request updates or manually refine responses. When the user is satisfied, the final answer is recorded. The agent will show references from which responses were sourced and allow the user to modify answers either directly or by providing an additional prompt.

Repeat: Users can batch-fill multiple reporting frameworks to simplify and expand their ESG disclosure scope while avoiding extra effort to manually complete multiple questionnaires. After a report has been completed, it can then be added to the client document store so future reports can draw on it for knowledge. Report GenAI also supports bring your own report, which allows users to develop their own reporting specification (question and response model), which can then be imported into the application, as shown in the following figure.

Now that you have a description of the Report GenAI workflow, let’s explore how the architecture is built.

Architecture deep-dive: A serverless generative AI agent

The Report GenAI architecture consists of six components as illustrated in the following figure: A user interface (UI), the generative AI executor, the web search endpoint, a text-to-SQL tool, the RAG tool, and an embedding generation pipeline. The UI, generative AI executor, and generation pipeline components help orchestrate the workflow. The remaining three components function together to generate responses to perform the following actions:

Web search tool: Uses an internet search engine to retrieve content from public web pages.
Text-to-SQL tool: Generates and executes SQL queries to the company’s emissions database hosted by Gardenia Technologies. The tool uses natural language requests, such as “What were our Scope 2 emissions in 2024,” as input and returns the results from the emissions database.
Retrieval Augmented Generation (RAG) tool: Accesses information from the corporate document store (such as procedures, emails, and internal reports) and uses it as a knowledge base. This component acts as a retriever to return relevant text from the document store as a plain text query.

Let’s take a look at each of the components.

1: Lightweight UI hosted on auto-scaled Amazon ECS Fargate

Users access Report GenAI by using the containerized Streamlit frontend. Streamlit offers an appealing UI for data and chat apps allowing data scientists and ML to build convincing user experiences with relatively limited effort. While not typically used for large-scale deployments, Streamlit proved to be a suitable choice for the initial iteration of Report GenAI.

The frontend is hosted on a load-balanced and auto-scaled Amazon Elastic Container Service (Amazon ECS) with Fargate launch type. This implementation of the frontend not only reduces the management overhead but also suits the expected intermittent usage pattern of Report GenAI, which is anticipated to be spikey with high-usage periods around the times when new reports must be generated (typically quarterly or yearly) and lower usage outside these windows. User authentication and authorization is handled by Amazon Cognito.

2: Central agent executor

The executor is an agent that uses reasoning capabilities of leading text-based foundation models (FMs) (for example, Anthropic’s Claude Sonnet 3.5 and Haiku 3.5) to break down user requests, gather information from document stores, and efficiently orchestrate tasks. The agent uses Reason and Act (ReAct), a prompt-based technique that enables large language models (LLMs) to generate both reasoning traces and task-specific actions in an interleaved manner. Reasoning traces help the model develop, track, and update action plans, while actions allow it to interface with a set of tools and information sources (also known as knowledge bases) that it can use to fulfil the task. The agent is prompted to think about an optimal sequence of actions to complete a given task with the tools at its disposal, observe the outcome, and iterate and improve until satisfied with the answer.

In combination, these tools provide the agent with capabilities to iteratively complete complex ESG reporting templates. The expected questions and response format for each questionnaire is captured by a report specification (ReportSpec) using Pydantic to enforce the desired output format for each reporting standard (for example, CDP, or TCFD). This ReportSpec definition is inserted into the task prompt. The first iteration of Report GenAI used Claude Sonnet 3.5 on Amazon Bedrock. As more capable and more cost effective LLMs become available on Amazon Bedrock (such as the recent release of Amazon Nova models), foundation models in Report GenAI can be swapped to remain up to date with the latest models.

The agent-executor is hosted on AWS Lambda and uses the open-source LangChain framework to implement the ReAct orchestration logic and implement the needed integration with memory, LLMs, tools and knowledge bases. LangChain offers deep integration with AWS using the first-party langchain-aws module. The module langchain-aws provides useful one-line wrappers to call tools using AWS Lambda, draw from a chat memory backed by Dynamo DB and call LLM models on Amazon Bedrock. LangChain also provides fine-grained visibility into each step of the ReAct decision making process to provide decision transparency.

3: Web-search tool

The web search tool is hosted on Lambda and calls an internet search engine through an API. The agent executor retrieves the information returned from the search engine to formulate a response. Web searches can be used in combination with the RAG tool to retrieve public context needed to formulate responses for certain generic questions, such as providing a short description of the reporting company or entity.

4: Text-to-SQL tool

A large portion of ESG reporting requirements is analytical in nature and requires processing of large amounts of numerical or tabular data. For example, a reporting standard might ask for total emissions in a certain year or quarter. LLMs are ill-equipped for questions of this nature. The Lambda-hosted text-to-SQL tool provides the agent with the required analytical capabilities. The tool uses a separate LLM to generate a valid SQL query given a natural language question along with the schema of an emissions database hosted on Gardenia. The generated query is then executed against this database and the results are passed back to the agent executor. SQL linters and error-correction loops are used for added robustness.

5: Retrieval Augmented Generation (RAG) tool

Much of the information required to complete ESG reporting resides in internal, unstructured document stores and can consist of PDF or Word documents, Excel spreadsheets, and even emails. Given the size of these document stores, a common approach is to use knowledge bases with vector embeddings for semantic search. The RAG tool enables the agent executor to retrieve only the relevant parts to answer questions from the document store. The RAG tool is hosted on Lambda and uses an in-memory Faiss index as a vector store. The index is persisted on Amazon S3 and loaded on demand whenever required. This workflow is advantageous for the given workload because of the intermittent usage of Report GenAI. The RAG tool accepts a plain text query from the agent executor as input, uses an embedding model on Amazon Bedrock to perform a vector search against the vector data base. The retrieved text is returned to the agent executor.

6: Embedding the generation asynchronous pipeline

To make text searchable, it must be indexed in a vector database. Amazon Step Functions provides a straightforward orchestration framework to manage this process: extracting plain text from the various document types, chunking it into manageable pieces, embedding the text, and then loading embeddings into a vector DB. Amazon Textract can be used as the first step for extracting text from visual-heavy documents like presentations or PDFs. An embedding model such as Amazon Titan Text Embeddings can then be used to embed the text and store it into a vector DB such as Lance DB. Note that Amazon Bedrock Knowledge Bases provides an end-to-end retrieval service automating most of the steps that were just described. However, for this application, Gardenia Technologies opted for a fully flexible implementation to retain full control over each design choice of the RAG pipeline (text extraction approach, embedding model choice, and vector database choice) at the expense of higher management and development overhead.

Evaluating agent performance

Making sure of accuracy and reliability in ESG reporting is paramount, given the regulatory and business implications of these disclosures. Report GenAI implements a sophisticated dual-layer evaluation framework that combines human expertise with advanced AI validation capabilities.

Validation is done both at a high level (such as evaluating full question responses) and sub-component level (such as breaking down to RAG, SQL search, and agent trajectory modules). Each of these has separate evaluation sets in addition to specific metrics of interest.

Human expert validation

The solution’s human-in-the-loop approach allows ESG experts to review and validate the AI-generated responses. This expert oversight serves as the primary quality control mechanism, making sure that generated reports align with both regulatory requirements and organization-specific context. The interactive chat interface enables experts to:

Verify factual accuracy of automated responses
Validate calculation methodologies
Verify proper context interpretation
Confirm regulatory compliance
Flag potential discrepancies or areas requiring additional review

A key feature in this process is the AI reasoning module, which displays the agent’s decision-making process, providing transparency into not only what answers were generated but how the agent arrived at those conclusions.

These expert reviews provide valuable training data that can be used to enhance system performance through refinements to RAG implementations, agent prompts, or underlying language models.

AI-powered quality assessment

Complementing human oversight, Report GenAI uses state-of-the-art LLMs on Amazon Bedrock as LLM judges. These models are prompted to evaluate:

Response accuracy relative to source documentation
Completeness of answers against question requirements
Consistency with provided context
Alignment with reporting framework guidelines
Mathematical accuracy of calculations

The LLM judge operates by:

Analyzing the original question and context
Reviewing the generated response and its supporting evidence
Comparing the response against retrieved data from structured and unstructured sources
Providing a confidence score and detailed assessment of the response quality
Flagging potential issues or areas requiring human review

This dual-validation approach creates a robust quality assurance framework that combines the pattern recognition capabilities of AI with human domain expertise. The system continuously improves through feedback loops, where human corrections and validations help refine the AI’s understanding and response generation capabilities.

How Omni Helicopters International cuts its reporting time by 75%

Omni Helicopters International cut their CDP reporting time by 75% using Gardenia’s Report GenAI solution. In previous years, OHI’s CDP reporting required one month of dedicated effort from their sustainability team. Using Report GenAI, OHI tracked their GHG inventory and relevant KPIs in real time and then prepared their 2024 CDP submission in just one week. Read the full story in Preparing Annual CDP Reports 75% Faster.

“In previous years we needed one month to complete the report, this year it took just one week,” said Renato Souza, Executive Manager QSEU at OTA. “The ‘Ask the Agent’ feature made it easy to draft our own answers. The tool was a great support and made things much easier compared to previous years.”

Conclusion

In this post, we stepped through how AWS and Gardenia collaborated to build Report GenAI, an automated ESG reporting solution that relieves ESG experts of the undifferentiated heavy lifting of data gathering and analysis associated with a growing ESG reporting burden. This frees up time for more impactful, strategic sustainability initiatives. Report GenAI is available on the AWS Marketplace today. To dive deeper and start developing your own generative AI app to fit your use case, explore this workshop on building an Agentic LLM assistant on AWS.

About the Authors

Federico Thibaud is the CTO and Co-Founder of Gardenia Technologies, where he leads the data and engineering teams, working on everything from data acquisition and transformation to algorithm design and product development. Before co-founding Gardenia, Federico worked at the intersection of finance and tech — building a trade finance platform as lead developer and developing quantitative strategies at a hedge fund.

Neil Holloway is Head of Data Science at Gardenia Technologies where he is focused on leveraging AI and machine learning to build and enhance software products. Neil holds a masters degree in Theoretical Physics, where he designed and built programs to simulate high energy collisions in particle physics.

Fraser Price is a GenAI-focused Software Engineer at Gardenia Technologies in London, where he focuses on researching, prototyping and developing novel approaches to automation in the carbon accounting space using GenAI and machine learning. He received his MEng in Computing: AI from Imperial College London.

Christian Dunn is a Software Engineer based in London building ETL pipelines, web-apps, and other business solutions at Gardenia Technologies.

Frederica Schrager is a Marketing Analyst at Gardenia Technologies.

Karsten Schroer is a Senior ML Prototyping Architect at AWS. He supports customers in leveraging data and technology to drive sustainability of their IT infrastructure and build cloud-native data-driven solutions that enable sustainable operations in their respective verticals. Karsten joined AWS following his PhD studies in applied machine learning & operations management. He is truly passionate about technology-enabled solutions to societal challenges and loves to dive deep into the methods and application architectures that underlie these solutions.

Mohamed Ali Jamaoui is a Senior ML Prototyping Architect with over 10 years of experience in production machine learning. He enjoys solving business problems with machine learning and software engineering, and helping customers extract business value with ML. As part of AWS EMEA Prototyping and Cloud Engineering, he helps customers build business solutions that leverage innovations in MLOPs, NLP, CV and LLMs.

Marco Masciola is a Senior Sustainability Scientist at AWS. In his role, Marco leads the development of IT tools and technical products to support AWS’s sustainability mission. He’s held various roles in the renewable energy industry, and leans on this experience to build tooling to support sustainable data center operations.

Hin Yee Liu is a Senior Prototyping Engagement Manager at Amazon Web Services. She helps AWS customers to bring their big ideas to life and accelerate the adoption of emerging technologies. Hin Yee works closely with customer stakeholders to identify, shape and deliver impactful use cases leveraging Generative AI, AI/ML, Big Data, and Serverless technologies using agile methodologies.

NVIDIA Nemotron Super 49B and Nano 8B reasoning models now available in Amazon Bedrock Marketplace and Amazon SageMaker JumpStart

June 11, 2025

by Niithiyn Vijeaswaran Amazon AWS

This post is co-written with Eliuth Triana Isaza, Abhishek Sawarkar, and Abdullahi Olaoye from NVIDIA.

Today, we are excited to announce that the Llama 3.3 Nemotron Super 49B V1 and Llama 3.1 Nemotron Nano 8B V1 are available in Amazon Bedrock Marketplace and Amazon SageMaker JumpStart. With this launch, you can now deploy NVIDIA’s newest reasoning models to build, experiment, and responsibly scale your generative AI ideas on AWS.

In this post, we demonstrate how to get started with these models on Amazon Bedrock Marketplace and SageMaker JumpStart.

About NVIDIA NIMs on AWS

NVIDIA NIM inference microservices integrate closely with AWS managed services such as Amazon Elastic Compute Cloud (Amazon EC2), Amazon Elastic Kubernetes Service (Amazon EKS), and Amazon SageMaker AI, to enable the deployment of generative AI models at scale. As part of NVIDIA AI Enterprise, available in the AWS Marketplace, NVIDIA NIM is a set of easy-to-use microservices designed to accelerate the deployment of generative AI. These prebuilt containers support a broad spectrum of generative AI models from open source community models to NVIDIA AI Foundation and custom models. NIM microservices are deployed with a single command for easy integration into generative AI applications using industry-standard APIs and just a few lines of code, or with a few actions in the SageMaker JumpStart console. Engineered to facilitate seamless generative AI inferencing at scale, NIM ensures generative AI applications can be deployed anywhere.

Overview of NVIDIA Nemotron models

In this section, we provide an overview of the NVIDIA Nemotron Super and Nano NIM microservices discussed in this post.

Llama 3.3 Nemotron Super 49B V1

Llama-3.3-Nemotron-Super-49B-v1 is an LLM which is a derivative of Meta Llama-3.3-70B-Instruct (the reference model). It is a reasoning model that is post-trained for reasoning, human chat preferences, and task executions, such as Retrieval-Augmented Generation (RAG) and tool calling. The model supports a context length of 128K tokens. Using a novel Neural Architecture Search (NAS) approach, we greatly reduced the model’s memory footprint and increased efficiency to support larger workloads and for the model to fit onto a single Hopper GPU (P5 instances) at high workloads (H200).

Llama 3.1 Nemotron Nano 8B V1

Llama-3.1-Nemotron-Nano-8B-v1 is an LLM which is a derivative of Meta Llama-3.1-8B-Instruct (the reference model). It is a reasoning model that is post trained for reasoning, human chat preferences, and task execution, such as RAG and tool calling. The model supports a context length of 128K tokens. It is created from Llama 3.1 8B Instruct and offers improvements in model accuracy. The model fits on a single H100 or A100 GPU (P5 or P4 instances) and can be used locally.

About Amazon Bedrock Marketplace

Amazon Bedrock Marketplace plays a pivotal role in democratizing access to advanced AI capabilities through several key advantages:

Comprehensive model selection – Amazon Bedrock Marketplace offers an exceptional range of models, from proprietary to publicly available options, allowing organizations to find the perfect fit for their specific use cases.
Unified and secure experience – By providing a single access point for all models through the Amazon Bedrock APIs, Bedrock Marketplace significantly simplifies the integration process. Organizations can use these models securely, and for models that are compatible with the Amazon Bedrock Converse API, you can use the robust toolkit of Amazon Bedrock, including Amazon Bedrock Agents, Amazon Bedrock Knowledge Bases, Amazon Bedrock Guardrails, and Amazon Bedrock Flows.
Scalable infrastructure – Amazon Bedrock Marketplace offers configurable scalability through managed endpoints, allowing organizations to select their desired number of instances, choose appropriate instance types, define custom auto scaling policies that dynamically adjust to workload demands, and optimize costs while maintaining performance.

Deploy NVIDIA Llama Nemotron models in Amazon Bedrock Marketplace

Amazon Bedrock Marketplace gives you access to over 100 popular, emerging, and specialized foundation models (FMs) through Amazon Bedrock. To access the Nemotron reasoning models in Amazon Bedrock, complete the following steps:

On the Amazon Bedrock console, in the navigation pane under Foundation models, choose Model catalog.
You can also use the InvokeModel API to invoke the model. The InvokeModel API doesn’t support Converse APIs and other Amazon Bedrock tooling.
On the Model catalog page, filter for NVIDIA as a provider and choose the Llama 3.3 Nemotron Super 49B V1 model.

The Model detail page provides essential information about the model’s capabilities, pricing structure, and implementation guidelines. You can find detailed usage instructions, including sample API calls and code snippets for integration.

To begin using the Llama 3.3 Nemotron Super 49B V1 model, choose Subscribe to subscribe to the marketplace offer.
On the model detail page, choose Deploy.

You will be prompted to configure the deployment details for the model. The model ID will be pre-populated.

For Endpoint name, enter an endpoint name (between 1–50 alphanumeric characters).
For Number of instances, enter a number of instances (between 1–100).
For Instance type, choose your instance type. For optimal performance with Nemotron Super, a GPU-based instance type like ml.g6e.12xlarge is recommended.
Optionally, you can configure advanced security and infrastructure settings, including virtual private cloud (VPC) networking, service role permissions, and encryption settings. For most use cases, the default settings will work well. However, for production deployments, you should review these settings to align with your organization’s security and compliance requirements.
Choose Deploy to begin using the model.

When the deployment is complete, you can test its capabilities directly in the Amazon Bedrock playground.This is an excellent way to explore the model’s reasoning and text generation abilities before integrating it into your applications. The playground provides immediate feedback, helping you understand how the model responds to various inputs and letting you fine-tune your prompts for optimal results. A similar process can be followed for deploying the Llama 3.1 Nemotron Nano 8B V1 model as well.

Run inference with the deployed Nemotron endpoint

The following code example demonstrates how to perform inference using a deployed model through Amazon Bedrock using the InvokeModel api. The script initializes the bedrock_runtime client, configures inference parameters, and sends a request to generate text based on a user prompt. With Nemotron Super and Nano models, we can use a soft switch to toggle reasoning on and off. In the content field, set detailed thinking on or detailed thinking off.

Request

import boto3
import json

# Initialize Bedrock client
bedrock_runtime = boto3.client("bedrock-runtime")

# Configuration
MODEL_ID = ""  # Replace with Bedrock Marketplace endpoint arn

def invoke_model(prompt, max_tokens=1000, temperature=0.6, top_p=0.9):
    """
    Simple Bedrock model invocation
    """
    # Prepare model input
    body = {
        "model": "nvidia/llama-3.3-nemotron-super-49b-v1",
        "messages": [
            {   
                "role": "system",
                "content": "detailed thinking on" # to turn thinking off, simply toggle this
            },
            {
                "role": "user",
                "content": prompt  # Use the prompt parameter
            }
        ],
        "max_tokens": max_tokens,
        "temperature": temperature,
        "top_p": top_p
    }
    
    # Invoke model
    response = bedrock_runtime.invoke_model(
        modelId=MODEL_ID,
        body=json.dumps(body)
    )

    # Parse and return response
    model_output = json.loads(response['body'].read())
    print(json.dumps(model_output, indent=2))
    
# Example usage
if __name__ == "__main__":
    prompt = "Explain how a transformer neural network works."
    result = invoke_model(prompt)
    print(result)

Response body

{
  "id": "chat-43c35c87d5a646c9b3dcaba461eefeaa",
  "object": "chat.completion",
  "created": 1749566119,
  "model": "nvidia/llama-3.3-nemotron-super-49b-v1",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "<think>nOkay, I need to explain how a transformer neural network works. Let me start by recalling what I know about transformers. They were introduced in the paper "Attention Is All You Need" by Vaswani et al. in 2017, right? Before transformers, sequence models like RNNs and LSTMs were common, but they had issues with parallelization because of their sequential processing. Transformers changed that by using self-attention mechanisms, which allow parallel processing of the entire input sequence at once.nnSo, the main components of a transformer are the encoder and the decoder. Each encoder layer has two main sub-layers: multi-head self-attention and a fully connected feed-forward network. The decoder has these two plus an additional multi-head attention layer to interact with the encoder's output.nnLet me break down the encoder first. The self-attention mechanism is key here. Self-attention allows the model to attend to all positions in the input sequence simultaneously and weigh their importance. The multi-head part means that instead of just one attention mechanism, you have multiple (like 8 or 12) running in parallel, each with a different linear projection. This helps capture different types of relationships between tokens.nnEach head computes attention by using Query (Q), Key (K), and Value (V) matrices. These are derived from the input by linear transformations. The attention weights are computed as the dot product of Q and K transposed, divided by the square root of the key's dimensionality (for stability). Then, these weights are used to compute a weighted sum of the Value vectors. This gives the output for one attention head. Since there are multiple heads, their outputs are concatenated and then linearly transformed.nnAfter the multi-head attention, there's a feed-forward network (FFN) applied individually to each position. It's a simple neural network with a ReLU activation in the middle. Both the attention and FFN layers use residual connections followed by layer normalization.nnThe decoder is similar but has an extra attention layer that attends to the encoder's output. Also, in the decoder's self-attention, they use masking to prevent looking ahead in the sequence, which is important for tasks like translation where you generate one token at a time.nnPositional encodings are added to the input embeddings because transformers don't have recurrence or convolution, so they need a way to preserve sequence order. The encodings are fixed and based on sine and cosine functions, which allows the model to handle sequences of varying lengths.nnI should also mention the overall architecture: the input goes through an embedding layer plus positional encoding, then through a stack of encoder layers. The output from the encoder goes into the decoder's stack of layers, and the final output goes through a linear layer and softmax for tasks like machine translation.nnWait, maybe I should explain each part step by step. Start with the overall structure, then dive into encoder and decoder layers, then the self-attention mechanism, positional encodings, and how it's different from RNNs.nnPotential pitfalls: Confusing encoder and decoder layers, not explaining why multi-head attention is useful, or omitting the role of layer normalization and residual connections. Also, ensuring that the explanation of attention weights calculation is clear.nnLet me check if I missed anything. Oh, the decoder's structure: after self-attention, there's encoder-decoder attention, then FFN. And the output linear layer. Also, the training process involves teacher forcing, but maybe that's beyond the basic explanation.nnI need to make sure the explanation is clear for someone who knows basics of neural networks but not transformers. Use analogies if possible, like how attention is like selectively focusing on different parts of the input when processing each element.nnAnother point: the efficiency. Since attention is O(n^2) with sequence length n, but it's parallelizable, which is better than O(n) for RNNs which can't be parallelized. But for very long sequences, transformers can be computationally expensive.nnAlso, the difference between BERT-style transformers (encoder-only) and sequence-to-sequence models (both encoder and decoder). But the question is about transformer neural networks in general, so covering the basic encoder-decoder structure is sufficient.nnOkay, putting it all together in a structured way: introduction, architecture overview, encoder details, decoder details, key components (self-attention, positional encodings), advantages, and maybe an example application.n</think>nnA transformer neural network is a revolutionary architecture introduced in 2017 for sequence-to-sequence tasks, replacing recurrent neural networks (RNNs) with self-attention mechanisms for parallel processing. Here's a structured explanation:nn### **1. Architecture Overview**n- **Encoder**: Takes in a sequence (e.g., text) and outputs a continuous representation.n- **Decoder**: Generates an output sequence (e.g., translated text) one token at a time, conditioned on the"
      },
      "logprobs": null,
      "finish_reason": "length",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 29,
    "total_tokens": 1029,
    "completion_tokens": 1000
  },
  "prompt_logprobs": null
}

Amazon SageMaker JumpStart overview

SageMaker JumpStart is a fully managed service that offers state-of-the-art foundation models for various use cases such as content writing, code generation, question answering, copywriting, summarization, classification, and information retrieval. It provides a collection of pre-trained models that you can deploy quickly, accelerating the development and deployment of ML applications. One of the key components of SageMaker JumpStart is model hubs, which offer a vast catalog of pre-trained models, such as Mistral, for a variety of tasks. You can now discover and deploy Llama 3.3 Nemotron Super 49B V1 and Llama-3.1-Nemotron-Nano-8B-v1 in Amazon SageMaker Studio or programmatically through the SageMaker Python SDK, so you can derive model performance and MLOps controls with Amazon SageMaker AI features such as Amazon SageMaker Pipelines, Amazon SageMaker Debugger, or container logs. The model is deployed in a secure AWS environment and in your VPC, helping to support data security for enterprise security needs.

Prerequisites

Before getting started with deployment, make sure your AWS Identity and Access Management (IAM) service role for Amazon SageMaker has the AmazonSageMakerFullAccess permission policy attached. To deploy the NVIDIA Llama Nemotron models successfully, confirm one of the following:

Make sure your IAM role has the following permissions and you have the authority to make AWS Marketplace subscriptions in the AWS account used:
- aws-marketplace:ViewSubscriptions
- aws-marketplace:Unsubscribe
- aws-marketplace:Subscribe
If your account is already subscribed to the model, you can skip to the Deploy section below. Otherwise, please start by subscribing to the model package and then move to the Deploy section.

Subscribe to the model package

To subscribe to the model package, complete the following steps:

Open the model package listing page and choose Llama 3.3 Nemotron Super 49B V1 or Llama 3.1 Nemotron Nano 8B V1.
On the AWS Marketplace listing, choose Continue to subscribe.
On the Subscribe to this software page, review and choose Accept Offer if you and your organization agree with EULA, pricing, and support terms.
Choose Continue to with the configuration and then choose an AWS Region where you have the service quota for the desired instance type.

A product ARN will be displayed. This is the model package ARN that you need to specify while creating a deployable model using Boto3.

(Option-1) Deploy NVIDIA Llama Nemotron Super and Nano models on SageMaker JumpStart

For those new to SageMaker Jumpstart, we will go to SageMaker Studio to access models on SageMaker Jumpstart. The Llama 3.3 Nemotron Super 49B V1and Llama 3.1 Nemotron Nano 8B V1 models are available on SageMaker Jumpstart. Deployment starts when you choose the Deploy option, you may be prompted to subscribe to this model on the Marketplace. If you are already subscribed, then you can move forward with selecting the second Deploy button. After deployment finishes, you will see that an endpoint is created. You can test the endpoint by passing a sample inference request payload or by selecting the testing option using the SDK.

(Option-2) Deploy NVIDIA Llama Nemotron using the SageMaker SDK

In this section we will walk through deploying the Llama 3.3 Nemotron Super 49B V1 model through the SageMaker SDK. A similar process can be followed for deploying the Llama 3.1 Nemotron Nano 8B V1 model as well.

Define the SageMaker model using the Model Package ARN

To deploy the model using the SDK, copy the product ARN from the previous step and specify it in the model_package_arn in the following code:

sm_model_name  "nim-llama-3-3-nemotron-super-49b-v1"
create_model_response  smcreate_model(
    ModelNamesm_model_name,
    PrimaryContainer{
        'ModelPackageName': model_package_arn
    },
    ExecutionRoleArnrole,
    EnableNetworkIsolation
)
print("Model Arn: "  create_model_response["ModelArn"])

Create the endpoint configuration

Next, we can create endpoint configuration by specifying instance type, in this case it’s ml.g6e.12xlarge. Make sure you have the account-level service limit for using ml.g6e.12xlarge for endpoint usage as one or more instances. NVIDIA also provides a list of supported instance types that supports deployment. Refer to the AWS Marketplace listing for both of these models to see supported instance types. To request a service quota increase, see AWS service quotas.

endpoint_config_name  sm_model_name

create_endpoint_config_response  smcreate_endpoint_config(
    EndpointConfigNameendpoint_config_name,
    ProductionVariants[
        {
            'VariantName': 'AllTraffic',
            'ModelName': sm_model_name,
            'InitialInstanceCount': 1,
            'InstanceType': 'ml.g6e.12xlarge',
            'InferenceAmiVersion': 'al2-ami-sagemaker-inference-gpu-2',
            'RoutingConfig': {'RoutingStrategy': 'LEAST_OUTSTANDING_REQUESTS'},
            'ModelDataDownloadTimeoutInSeconds': 3600, 
            'ContainerStartupHealthCheckTimeoutInSeconds': 3600, 
        }
    ]
)
print("Endpoint Config Arn: "  create_endpoint_config_response["EndpointConfigArn"])

Create the endpoint

Using the previous endpoint configuration we create a new SageMaker endpoint and add a wait and loop as shown below until the deployment finishes. This typically takes around 5-10 minutes. The status will change to InService once the deployment is successful.

endpoint_name  endpoint_config_name
create_endpoint_response  smcreate_endpoint(
    EndpointNameendpoint_name,
    EndpointConfigNameendpoint_config_name
)
print("Endpoint Arn: "  create_endpoint_response["EndpointArn"

Deploy the endpoint

Let’s now deploy and track the status of the endpoint.

resp = sm.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

while status == "Creating":
    time.sleep(60)
    resp = sm.describe_endpoint(EndpointName=endpoint_name)
    status = resp["EndpointStatus"]
    print("Status: " + status)
    
print("Arn: " + resp["EndpointArn"])
print("Status: " + status)

Run Inference with Llama 3.3 Nemotron Super 49B V1

Once we have the model, we can use a sample text to do an inference request. NIM on SageMaker supports the OpenAI API inference protocol inference request format. For explanation of supported parameters please see Creates a model in the NVIDIA documentation.

Real-Time inference example

The following code examples illustrate how to perform real-time inference using the Llama 3.3 Nemotron Super 49B V1 model in non-reasoning and reasoning mode.

Non-reasoning mode

Perform real-time inference in non-reasoning mode:

payload_model  "nvidia/llama-3.3-nemotron-super-49b-v1"
messages  [
    {
    "role": "system",
    "content": "detailed thinking off"
    },
    {
    "role":"user",
    "content":"Explain how a transformer neural network works."
    }
    ]
    
payload  {
    "model": payload_model,
    "messages": messages,
    "max_tokens": 3000
}

response  clientinvoke_endpoint(
    EndpointNameendpoint_name, ContentType"application/json", Bodyjsondumps(payload)
)

output  jsonloads(response["Body"]read()decode("utf8"))
print(jsondumps(output, indent2))

Reasoning mode

Perform real-time inference in reasoning mode:

payload_model  "nvidia/llama-3.3-nemotron-super-49b-v1"
messages  [
    {
    "role": "system",
    "content": "detailed thinking on"
    },
    {
    "role":"user",
    "content":"Explain how a transformer neural network works."
    }
    ]
payload  {
    "model": payload_model,
    "messages": messages,
    "max_tokens": 3000
}

response  clientinvoke_endpoint(
    EndpointNameendpoint_name, ContentType"application/json", Bodyjsondumps(payload)
)

output  jsonloads(response["Body"]read()decode("utf8"))
print(jsondumps(output, indent2))

Streaming inference

NIM on SageMaker also supports streaming inference and you can enable that by setting stream as True in the payload and by using the invoke_endpoint_with_response_stream method.

Streaming inference:

payload_model = "nvidia/llama-3.3-nemotron-super-49b-v1"
messages = [
    {   
      "role": "system",
      "content": "detailed thinking on"# this can be toggled off to disable reasoning
    },
    {
      "role":"user",
      "content":"Explain how a transformer neural network works."
    }
  ]

payload = {
  "model": payload_model,
  "messages": messages,
  "max_tokens": 3000,
  "stream": True
}

response = client.invoke_endpoint_with_response_stream(
    EndpointName=endpoint_name,
    Body=json.dumps(payload),
    ContentType="application/json",
    Accept="application/jsonlines",
)

We can use some post-processing code for the streaming output that reads the byte-chunks coming from the endpoint, pieces them into full JSON messages, extracts any new text the model produced, and immediately prints that text to output.

event_stream = response['Body']
accumulated_data = ""
start_marker = 'data:'
end_marker = '"finish_reason":null}]}'

for event in event_stream:
    try:
        payload = event.get('PayloadPart', {}).get('Bytes', b'')
        if payload:
            data_str = payload.decode('utf-8')

            accumulated_data += data_str

            # Process accumulated data when a complete response is detected
            while start_marker in accumulated_data and end_marker in accumulated_data:
                start_idx = accumulated_data.find(start_marker)
                end_idx = accumulated_data.find(end_marker) + len(end_marker)
                full_response = accumulated_data[start_idx + len(start_marker):end_idx]
                accumulated_data = accumulated_data[end_idx:]

                try:
                    data = json.loads(full_response)
                    content = data.get('choices', [{}])[0].get('delta', {}).get('content', "")
                    if content:
                        print(content, end='', flush=True)
                except json.JSONDecodeError:
                    continue
    except Exception as e:
        print(f"nError processing event: {e}", flush=True)
        continue

Clean up

To avoid unwanted charges, complete the steps in this section to clean up your resources.

Delete the Amazon Bedrock Marketplace deployment

If you deployed the model using Amazon Bedrock Marketplace, complete the following steps:

On the Amazon Bedrock console, in the navigation pane in the Foundation models section, choose Marketplace deployments.
In the Managed deployments section, locate the endpoint you want to delete.
Select the endpoint, and on the Actions menu, choose Delete.
Verify the endpoint details to make sure you’re deleting the correct deployment:
1. Endpoint name
2. Model name
3. Endpoint status
Choose Delete to delete the endpoint.
In the Delete endpoint confirmation dialog, review the warning message, enter confirm, and choose Delete to permanently remove the endpoint.

Delete the SageMaker JumpStart Endpoint

The SageMaker JumpStart model you deployed will incur costs if you leave it running. Use the following code to delete the endpoint if you want to stop incurring charges. For more details, see Delete Endpoints and Resources.

sm.delete_model(ModelName=sm_model_name)
sm.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
sm.delete_endpoint(EndpointName=endpoint_name)

Conclusion

NVIDIA’s Nemotron Llama3 models deliver optimized AI reasoning capabilities and are now available on AWS through Amazon Bedrock Marketplace and Amazon SageMaker JumpStart. The Llama 3.3 Nemotron Super 49B V1, derived from Meta’s 70B model, uses Neural Architecture Search (NAS) to achieve a reduced 49B parameter count while maintaining high accuracy, enabling deployment on a single H200 GPU despite its sophisticated capabilities. Meanwhile, the compact Llama 3.1 Nemotron Nano 8B V1 fits on a single fits on a single H100 or A100 GPU (P5 or P4 instances) while improving on Meta’s reference model accuracy, making it ideal for efficiency-conscious applications. Both models support extensive 128K token context windows and are post-trained for enhanced reasoning, RAG capabilities, and tool calling, offering organizations flexible options to balance performance and computational requirements for enterprise AI applications.

With this launch, organizations can now leverage the advanced reasoning capabilities of these models while benefiting from the scalable infrastructure of AWS. Through either the intuitive UI or just a few lines of code, you can quickly deploy these powerful language models to transform your AI applications with minimal effort. These complementary platforms provide straightforward access to NVIDIA’s robust technologies, allowing teams to immediately begin exploring and implementing sophisticated reasoning capabilities in their enterprise solutions.

About the authors

Niithiyn Vijeaswaran is a Generative AI Specialist Solutions Architect with the Third-Party Model Science team at AWS. His area of focus is AWS AI accelerators (AWS Neuron). He holds a Bachelor’s in Computer Science and Bioinformatics.

Chase Pinkerton is a Startups Solutions Architect at Amazon Web Services. He holds a Bachelor’s in Computer Science with a minor in Economics from Tufts University. He’s passionate about helping startups grow and scale their businesses. When not working, he enjoys road cycling, hiking, playing volleyball, and photography.

Varun Morishetty is a Software Engineer with Amazon SageMaker JumpStart and Bedrock Marketplace. Varun received his Bachelor’s degree in Computer Science from Northeastern University. In his free time, he enjoys cooking, baking and exploring New York City.

Brian Kreitzer is a Partner Solutions Architect at Amazon Web Services (AWS). He works with partners to define business requirements, provide architectural guidance, and design solutions for the Amazon Marketplace.

Eliuth Triana Isaza is a Developer Relations Manager at NVIDIA, empowering Amazon’s AI MLOps, DevOps, scientists, and AWS technical experts to master the NVIDIA computing stack for accelerating and optimizing generative AI foundation models spanning from data curation, GPU training, model inference, and production deployment on AWS GPU instances. In addition, Eliuth is a passionate mountain biker, skier, and tennis and poker player.

Abhishek Sawarkar is a product manager in the NVIDIA AI Enterprise team working on integrating NVIDIA AI Software in Cloud MLOps platforms. He focuses on integrating the NVIDIA AI end-to-end stack within cloud platforms and enhancing user experience on accelerated computing.

Abdullahi Olaoye is a Senior AI Solutions Architect at NVIDIA, specializing in integrating NVIDIA AI libraries, frameworks, and products with cloud AI services and open source tools to optimize AI model deployment, inference, and generative AI workflows. He collaborates with AWS to enhance AI workload performance and drive adoption of NVIDIA-powered AI and generative AI solutions.

Automate customer support with Amazon Bedrock, LangGraph, and Mistral models

June 10, 2025

by Deepesh Dhapola Amazon AWS

AI agents are transforming the landscape of customer support by bridging the gap between large language models (LLMs) and real-world applications. These intelligent, autonomous systems are poised to revolutionize customer service across industries, ushering in a new era of human-AI collaboration and problem-solving. By harnessing the power of LLMs and integrating them with specialized tools and APIs, agents can tackle complex, multistep customer support tasks that were previously beyond the reach of traditional AI systems.As we look to the future, AI agents will play a crucial role in the following areas:

Enhancing decision-making – Providing deeper, context-aware insights to improve customer support outcomes
Automating workflows – Streamlining customer service processes, from initial contact to resolution, across various channels
Human-AI interactions – Enabling more natural and intuitive interactions between customers and AI systems
Innovation and knowledge integration – Generating new solutions by combining diverse data sources and specialized knowledge to address customer queries more effectively
Ethical AI practices – Helping provide more transparent and explainable AI systems to address customer concerns and build trust

Building and deploying AI agent systems for customer support is a step toward unlocking the full potential of generative AI in this domain. As these systems evolve, they will transform customer service, expand possibilities, and open new doors for AI in enhancing customer experiences.

In this post, we demonstrate how to use Amazon Bedrock and LangGraph to build a personalized customer support experience for an ecommerce retailer. By integrating the Mistral Large 2 and Pixtral Large models, we guide you through automating key customer support workflows such as ticket categorization, order details extraction, damage assessment, and generating contextual responses. These principles are applicable across various industries, but we use the ecommerce domain as our primary example to showcase the end-to-end implementation and best practices. This post provides a comprehensive technical walkthrough to help you enhance your customer service capabilities and explore the latest advancements in LLMs and multimodal AI.

LangGraph is a powerful framework built on top of LangChain that enables the creation of cyclical, stateful graphs for complex AI agent workflows. It uses a directed graph structure where nodes represent individual processing steps (like calling an LLM or using a tool), edges define transitions between steps, and state is maintained and passed between nodes during execution. This architecture is particularly valuable for customer support automation involving workflows. LangGraph’s advantages include built-in visualization, logging (traces), human-in-the-loop capabilities, and the ability to organize complex workflows in a more maintainable way than traditional Python code.This post provides details on how to do the following:

Use Amazon Bedrock and LangGraph to build intelligent, context-aware customer support workflows
Integrate data in a helpdesk tool, like JIRA, in the LangChain workflow
Use LLMs and vision language models (VLMs) in the workflow to perform context-specific tasks
Extract information from images to aid in decision-making
Compare images to assess product damage claims
Generate responses for the customer support tickets

Solution overview

This solution involves the customers initiating support requests through email, which are automatically converted into new support tickets in Atlassian Jira Service Management. The customer support automation solution then takes over, identifying the intent behind each query, categorizing the tickets, and assigning them to a bot user for further processing. The solution uses LangGraph to orchestrate a workflow involving AI agents to extracts key identifiers such as transaction IDs and order numbers from the support ticket. It analyzes the query and uses these identifiers to call relevant tools, extracting additional information from the database to generate a comprehensive and context-aware response. After the response is prepared, it’s updated in Jira for human support agents to review before sending the response back to the customer. This process is illustrated in the following figure. This solution is capable of extracting information not only from the ticket body and title but also from attached images like screenshots and external databases.

The solution uses two foundation models (FMs) from Amazon Bedrock, each selected based on its specific capabilities and the complexity of the tasks involved. For instance, the Pixtral model is used for vision-related tasks like image comparison and ID extraction, whereas the Mistral Large 2 model handles a variety of tasks like ticket categorization, response generation, and tool calling. Additionally, the solution includes fraud detection and prevention capabilities. It can identify fraudulent product returns by comparing the stock product image with the returned product image to verify if they match and assess whether the returned product is genuinely damaged. This integration of advanced AI models with automation tools enhances the efficiency and reliability of the customer support process, facilitating timely resolutions and security against fraudulent activities. LangGraph provides a framework for orchestrating the information flow between agents, featuring built-in state management and checkpointing to facilitate seamless process continuity. This functionality allows the inclusion of initial ticket summaries and descriptions in the State object, with additional information appended in subsequent steps of the workflows. By maintaining this evolving context, LangGraph enables LLMs to generate context-aware responses. See the following code:

# class to hold state information

class JiraAppState(MessagesState):
    key: str
    summary: str
    description: str
    attachments: list
    category: str
    response: str
    transaction_id: str
    order_no: str
    usage: list

The framework integrates effortlessly with Amazon Bedrock and LLMs, supporting task-specific diversification by using cost-effective models for simpler tasks while reducing the risks of exceeding model quotas. Furthermore, LangGraph offers conditional routing for dynamic workflow adjustments based on intermediate results, and its modular design facilitates the addition or removal of agents to extend system capabilities.

Responsible AI

It’s crucial for customer support automation applications to validate inputs and make sure LLM outputs are secure and responsible. Amazon Bedrock Guardrails can significantly enhance customer support automation applications by providing configurable safeguards that monitor and filter both user inputs and AI-generated responses, making sure interactions remain safe, relevant, and aligned with organizational policies. By using features such as content filters, which detect and block harmful categories like hate speech, insults, sexual content, and violence, as well as denied topics to help prevent discussions on sensitive or restricted subjects (for example, legal or medical advice), customer support applications can avoid generating or amplifying inappropriate or defiant information. Additionally, guardrails can help redact personally identifiable information (PII) from conversation transcripts, protecting user privacy and fostering trust. These measures not only reduce the risk of reputational harm and regulatory violations but also create a more positive and secure experience for customers, allowing support teams to focus on resolving issues efficiently while maintaining high standards of safety and responsibility.

The following diagram illustrates this architecture.

Observability

Along with Responsible AI, observability is vital for customer support applications to provide deep, real-time visibility into model performance, usage patterns, and operational health, enabling teams to proactively detect and resolve issues. With comprehensive observability, you can monitor key metrics such as latency and token consumption, and track and analyze input prompts and outputs for quality and compliance. This level of insight helps identify and mitigate risks like hallucinations, prompt injections, toxic language, and PII leakage, helping make sure that customer interactions remain safe, reliable, and aligned with regulatory requirements.

Prerequisites

In this post, we use Atlassian Jira Service Management as an example. You can use the same general approach to integrate with other service management tools that provide APIs for programmatic access. The configuration required in Jira includes:

A Jira service management project with API token to enable programmatic access
The following custom fields:
- Name: Category, Type: Select List (multiple choices)
- Name: Response, Type: Text Field (multi-line)
A bot user to assign tickets

The following code shows a sample Jira configuration:

JIRA_API_TOKEN = "<JIRA_API_TOKEN>"
JIRA_USERNAME = "<JIRA_USERNAME>"
JIRA_INSTANCE_URL = "https://<YOUR_JIRA_INSTANCE_NAME>.atlassian.net/"
JIRA_PROJECT_NAME = "<JIRA_PROJECT_NAME>"
JIRA_PROJECT_KEY = "<JIRA_PROJECT_KEY>"
JIRA_BOT_USER_ID = '<JIRA_BOT_USER_ID>'

In addition to Jira, the following services and Python packages are required:

A valid AWS account.
An AWS Identity and Access Management (IAM) role in the account that has sufficient permissions to create the necessary resources.
Access to the following models hosted on Amazon Bedrock:
- Mistral Large 2 (model ID: mistral.mistral-large-2407-v1:0).
- Pixtral Large (model ID: us.mistral.pixtral-large-2502-v1:0). The Pixtral Large model is available in Amazon Bedrock under cross-Region inference profiles.
A LangGraph application up and running locally. For instructions, see Quickstart: Launch Local LangGraph Server.

For this post, we use the us-west-2 AWS Region. For details on available Regions, see Amazon Bedrock endpoints and quotas.

The source code of this solution is available in the GitHub repository. This is an example code; you should conduct your own due diligence and adhere to the principle of least privilege.

Implementation with LangGraph

At the core of customer support automation is a suite of specialized tools and functions designed to collect, analyze, and integrate data from service management systems and a SQLite database. These tools serve as the foundation of our system, empowering it to deliver context-aware responses. In this section, we delve into the essential components that power our system.

BedrockClient class

The BedrockClient class is implemented in the cs_bedrock.py file. It provides a wrapper for interacting with Amazon Bedrock services, specifically for managing language models and content safety guardrails in customer support applications. It simplifies the process of initializing language models with appropriate configurations and managing content safety guardrails. This class is used by LangChain and LangGraph to invoke LLMs on Amazon Bedrock.

This class also provides methods to create guardrails for responsible AI implementation. The following Amazon Bedrock Guardrails policy filters sexual, violence, hate, insults, misconducts, and prompt attacks, and helps prevent models from generating stock and investment advice, profanity, hate, violent and sexual content. Additionally, it helps prevent exposing vulnerabilities in models by alleviating prompt attacks.

# guardrails policy

contentPolicyConfig={
    'filtersConfig': [
        {
            'type': 'SEXUAL',
            'inputStrength': 'MEDIUM',
            'outputStrength': 'MEDIUM'
        },
        {
            'type': 'VIOLENCE',
            'inputStrength': 'MEDIUM',
            'outputStrength': 'MEDIUM'
        },
        {
            'type': 'HATE',
            'inputStrength': 'MEDIUM',
            'outputStrength': 'MEDIUM'
        },
        {
            'type': 'INSULTS',
            'inputStrength': 'MEDIUM',
            'outputStrength': 'MEDIUM'
        },
        {
            'type': 'MISCONDUCT',
            'inputStrength': 'MEDIUM',
            'outputStrength': 'MEDIUM'
        },
        {
            'type': 'PROMPT_ATTACK',
            'inputStrength': 'LOW',
            'outputStrength': 'NONE'
        }
    ]
},
wordPolicyConfig={
    'wordsConfig': [
        {'text': 'stock and investment advice'}
    ],
    'managedWordListsConfig': [
        {'type': 'PROFANITY'}
    ]
},
contextualGroundingPolicyConfig={
    'filtersConfig': [
        {
            'type': 'GROUNDING',
            'threshold': 0.65
        },
        {
            'type': 'RELEVANCE',
            'threshold': 0.75
        }
    ]
}

Database class

The Database class is defined in the cs_db.py file. This class is designed to facilitate interactions with a SQLite database. It’s responsible for creating a local SQLite database and importing synthetic data related to customers, orders, refunds, and transactions. By doing so, it makes sure that the necessary data is readily available for various operations. Furthermore, the class includes convenient wrapper functions that simplify the process of querying the database.

JiraSM class

The JiraSM class is implemented in the cs_jira_sm.py file. It serves as an interface for interacting with Jira Service Management. It establishes a connection to Jira by using the API token, user name, and instance URL, all of which are configured in the .env file. This setup provides secure and flexible access to the Jira instance. The class is designed to handle various ticket operations, including reading tickets and assigning them to a preconfigured bot user. Additionally, it supports downloading attachments from tickets and updating custom fields as needed.

CustomerSupport class

The CustomerSupport class is implemented in the cs_cust_support_flow.py file. This class encapsulates the customer support processing logic by using LangGraph and Amazon Bedrock. Using LangGraph nodes and tools, this class orchestrates the customer support workflow. The workflow initially determines the category of the ticket by analyzing its content and classifying it as related to transactions, deliveries, refunds, or other issues. It updates the support ticket with the category detected. Following this, the workflow extracts pertinent information such as transaction IDs or order numbers, which might involve analyzing both text and images, and queries the database for relevant details. The next step is response generation, which is context-aware and adheres to content safety guidelines while maintaining a professional tone. Finally, the workflow integrates with Jira, assigning categories, updating responses, and managing attachments as needed.

The LangGraph orchestration is implemented in the build_graph function, as illustrated in the following code. This function also generates a visual representation of the workflow using a Mermaid graph for better clarity and understanding. This setup supports an efficient and structured approach to handling customer support tasks.

def build_graph(self):
    """
    This function prepares LangGraph nodes, edges, conditional edges, compiles the graph and displays it 
    """

    # create StateGraph object
    graph_builder = StateGraph(JiraAppState)

    # add nodes to the graph
    graph_builder.add_node("Determine Ticket Category", self.determine_ticket_category_tool)
    graph_builder.add_node("Assign Ticket Category in JIRA", self.assign_ticket_category_in_jira_tool)
    graph_builder.add_node("Extract Transaction ID", self.extract_transaction_id_tool)
    graph_builder.add_node("Extract Order Number", self.extract_order_number_tool)
    graph_builder.add_node("Find Transaction Details", self.find_transaction_details_tool)
    
    graph_builder.add_node("Find Order Details", self.find_order_details_tool)
    graph_builder.add_node("Generate Response", self.generate_response_tool)
    graph_builder.add_node("Update Response in JIRA", self.update_response_in_jira_tool)

    graph_builder.add_node("tools", ToolNode([StructuredTool.from_function(self.assess_damaged_delivery), StructuredTool.from_function(self.find_refund_status)]))
    
    # add edges to connect nodes
    graph_builder.add_edge(START, "Determine Ticket Category")
    graph_builder.add_edge("Determine Ticket Category", "Assign Ticket Category in JIRA")
    graph_builder.add_conditional_edges("Assign Ticket Category in JIRA", self.decide_ticket_flow_condition)
    graph_builder.add_edge("Extract Order Number", "Find Order Details")
    
    graph_builder.add_edge("Extract Transaction ID", "Find Transaction Details")
    graph_builder.add_conditional_edges("Find Order Details", self.order_query_decision, ["Generate Response", "tools"])
    graph_builder.add_edge("tools", "Generate Response")
    graph_builder.add_edge("Find Transaction Details", "Generate Response")
    
    graph_builder.add_edge("Generate Response", "Update Response in JIRA")
    graph_builder.add_edge("Update Response in JIRA", END)

    # compile the graph
    checkpoint = MemorySaver()
    app = graph_builder.compile(checkpointer=checkpoint)
    self.graph_app = app
    self.util.log_data(data="Workflow compiled successfully", ticket_id='NA')

    # Visualize the graph
    display(Image(app.get_graph().draw_mermaid_png(draw_method=MermaidDrawMethod.API)))

    return app

LangGraph generates the following Mermaid diagram to visually represent the workflow.

Utility class

The Utility class, implemented in the cs_util.py file, provides essential functions to support the customer support automation. It encompasses utilities for logging, file handling, usage metric tracking, and image processing operations. The class is designed as a central hub for various helper methods, streamlining common tasks across the application. By consolidating these operations, it promotes code reusability and maintainability within the system. Its functionality makes sure that the automation framework remains efficient and organized.

A key feature of this class is its comprehensive logging capabilities. It provides methods to log informational messages, errors, and significant events directly into the cs_logs.log file. Additionally, it tracks Amazon Bedrock LLM token usage and latency metrics, facilitating detailed performance monitoring. The class also logs the execution flow of application-generated prompts and LLM generated responses, aiding in troubleshooting and debugging. These log files can be seamlessly integrated with standard log pusher agents, allowing for automated transfer to preferred log monitoring systems. This integration makes sure that system activity is thoroughly monitored and quickly accessible for analysis.

Run the agentic workflow

Now that the customer support workflow is defined, it can be executed for various ticket types. The following functions use the provided ticket key to fetch the corresponding Jira ticket and download available attachments. Additionally, they initialize the State object with details such as the ticket key, summary, description, attachment file path, and a system prompt for the LLM. This State object is used throughout the workflow execution.

def generate_response_for_ticket(ticket_id: str):
    
    llm, vision_llm, llm_with_guardrails = bedrock_client.init_llms(ticket_id=ticket_id)
    cust_support = CustomerSupport(llm=llm, vision_llm=vision_llm, llm_with_guardrails=llm_with_guardrails)
    app   = cust_support.build_graph()
    
    state = cust_support.get_jira_ticket(key=ticket_id)
    state = app.invoke(state, thread)
    
    util.log_usage(state['usage'], ticket_id=ticket_id)
    util.log_execution_flow(state["messages"], ticket_id=ticket_id)

The following code snippet invokes the workflow for the Jira ticket with key AS-6:

# initialize classes and create bedrock guardrails
bedrock_client = BedrockClient()
util = Utility()
guardrail_id = bedrock_client.create_guardrail()

# process a JIRA ticket
generate_response_for_ticket(ticket_id='AS-6')

The following screenshot shows the Jira ticket before processing. Notice that the Response and Category fields are empty, and the ticket is unassigned.

The following screenshot shows the Jira ticket after processing. The Category field is updated as Refunds and the Response field is updated by the AI-generated content.

This logs LLM usage information as follows:

Model                              Input Tokens  Output Tokens Latency 
mistral.mistral-large-2407-v1:0      385               2         653  
mistral.mistral-large-2407-v1:0      452              27         884      
mistral.mistral-large-2407-v1:0     1039              36        1197   
us.mistral.pixtral-large-2502-v1:0  4632             425        5952   
mistral.mistral-large-2407-v1:0     1770             144        4556

Clean up

Delete any IAM roles and policies created specifically for this post. Delete the local copy of this post’s code.

If you no longer need access to an Amazon Bedrock FM, you can remove access from it. For instructions, see Add or remove access to Amazon Bedrock foundation models.

Delete the temporary files and guardrails used in this post with the following code:

shutil.rmtree(util.get_temp_path())
bedrock_client.delete_guardrail()

Conclusion

In this post, we developed an AI-driven customer support solution using Amazon Bedrock, LangGraph, and Mistral models. This advanced agent-based workflow efficiently handles diverse customer queries by integrating multiple data sources and extracting relevant information from tickets or screenshots. It also evaluates damage claims to mitigate fraudulent returns. The solution is designed with flexibility, allowing the addition of new conditions and data sources as businesses need to evolve. With this multi-agent approach, you can build robust, scalable, and intelligent systems that redefine the capabilities of generative AI in customer support.

Want to explore further? Check out the following GitHub repo. There, you can observe the code in action and experiment with the solution yourself. The repository includes step-by-step instructions for setting up and running the multi-agent system, along with code for interacting with data sources and agents, routing data, and visualizing workflows.

About the authors

Deepesh Dhapola is a Senior Solutions Architect at AWS India, specializing in helping financial services and fintech clients optimize and scale their applications on the AWS Cloud. With a strong focus on trending AI technologies, including generative AI, AI agents, and the Model Context Protocol (MCP), Deepesh uses his expertise in machine learning to design innovative, scalable, and secure solutions. Passionate about the transformative potential of AI, he actively explores cutting-edge advancements to drive efficiency and innovation for AWS customers. Outside of work, Deepesh enjoys spending quality time with his family and experimenting with diverse culinary creations.

Build responsible AI applications with Amazon Bedrock Guardrails

June 10, 2025

by Divya Muralidharan Amazon AWS

As organizations embrace generative AI, they face critical challenges in making sure their applications align with their designed safeguards. Although foundation models (FMs) offer powerful capabilities, they can also introduce unique risks, such as generating harmful content, exposing sensitive information, being vulnerable to prompt injection attacks, and returning model hallucinations.

Amazon Bedrock Guardrails has helped address these challenges for multiple organizations, such as MAPRE, KONE, Fiserv, PagerDuty, Aha, and more. Just as traditional applications require multi-layered security, Amazon Bedrock Guardrails implements essential safeguards across model, prompt, and application levels—blocking up to 88% more undesirable and harmful multimodal content. Amazon Bedrock Guardrails helps filter over 75% hallucinated responses in Retrieval Augmented Generation (RAG) and summarization use cases, and stands as the first and only safeguard using Automated Reasoning to prevent factual errors from hallucinations.

In this post, we show how to implement safeguards using Amazon Bedrock Guardrails in a healthcare insurance use case.

Solution overview

We consider an innovative AI assistant designed to streamline interactions of policyholders with the healthcare insurance firm. With this AI-powered solution, policyholders can check coverage details, submit claims, find in-network providers, and understand their benefits through natural, conversational interactions. The assistant provides all-day support, handling routine inquiries while allowing human agents to focus on complex cases. To help enable secure and compliant operations of our assistant, we use Amazon Bedrock Guardrails to serve as a critical safety framework. Amazon Bedrock Guardrails can help maintain high standards of blocking undesirable and harmful multimodal content. This not only protects the users, but also builds trust in the AI system, encouraging wider adoption and improving overall customer experience in healthcare insurance interactions.

This post walks you through the capabilities of Amazon Bedrock Guardrails from the AWS Management Console. Refer to the following GitHub repo for information about creating, updating, and testing Amazon Bedrock Guardrails using the SDK.

Amazon Bedrock Guardrails provides configurable safeguards to help safely build generative AI applications at scale. It evaluates user inputs and model responses based on specific policies, working with all large language models (LLMs) on Amazon Bedrock, fine-tuned models, and external FMs using the ApplyGuardrail API. The solution integrates seamlessly with Amazon Bedrock Agents and Amazon Bedrock Knowledge Bases, so organizations can apply multiple guardrails across applications with tailored controls.

Guardrails can be implemented in two ways: direct integration with Invoke APIs (InvokeModel and InvokeModelWithResponseStream) and Converse APIs (Converse and ConverseStream) for models hosted on Amazon Bedrock, applying safeguards during inference, or through the flexible ApplyGuardrail API, which enables independent content evaluation without model invocation. This second method is ideal for assessing inputs or outputs at various application stages and works with custom or third-party models that are not hosted on Amazon Bedrock. Both approaches empower developers to implement use case-specific safeguards aligned with responsible AI policies, helping to block undesirable and harmful multimodal content from generative AI applications.

The following diagram depicts the six safeguarding policies offered by Amazon Bedrock Guardrails.

Prerequisites

Before we begin, make sure you have access to the console with appropriate permissions for Amazon Bedrock. If you haven’t set up Amazon Bedrock yet, refer to Getting started in the Amazon Bedrock console.

Create a guardrail

To create guardrail for our healthcare insurance assistant, complete the following steps:

On the Amazon Bedrock console, choose Guardrails in the navigation pane.
Choose Create guardrail.
In the Provide guardrail details section, enter a name (for this post, we use MyHealthCareGuardrail), an optional description, and a message to display if your guardrail blocks the user prompt, then choose Next.

Configuring Multimodal Content filters

Security is paramount when building AI applications. With image content filters in Amazon Bedrock Guardrails, content filters can now detect and filter both text and image content through six protection categories: Hate, Insults, Sexual, Violence, Misconduct, and Prompt Attacks.

In the Configure content filters section, for maximum protection, especially in sensitive sectors like healthcare in our example use case, set your confidence thresholds to High across all categories for both text and image content.
Enable prompt attack protection to prevent system instruction tampering, and use input tagging to maintain accurate classification of system prompts, then choose Next.

Denied topics

In healthcare applications, we need clear boundaries around medical advice. Let’s configure Amazon Bedrock Guardrails to prevent users from attempting disease diagnosis, which should be handled by qualified healthcare professionals.

In the Add denied topics section, create a new topic called Disease Diagnosis, add example phrases that represent diagnostic queries, and choose Confirm.

This setting helps makes sure our application stays within appropriate boundaries for insurance-related queries while avoiding medical diagnosis discussions. For example, when users ask questions like “Do I have diabetes?” or “What’s causing my headache?”, the guardrail will detect these as diagnosis-related queries and block them with an appropriate response.

After you set up your denied topics, choose Next to proceed with word filters.

Word filters

Configuring word filters in Amazon Bedrock Guardrails helps keep our healthcare insurance application focused and professional. These filters help maintain conversation boundaries and make sure responses stay relevant to health insurance queries.

Let’s set up word filters for two key purposes:

Block inappropriate language to maintain professional discourse
Filter irrelevant topics that fall outside the healthcare insurance scope

To set them up, do the following:

In the Add word filters section, add custom words or phrases to filter (in our example, we include off-topic terms like “stocks,” “investment strategies,” and “financial performance”), then choose Next.

Sensitive information filtersWith sensitive information filters, you can configure filters to block email addresses, phone numbers, and other personally identifiable information (PII), as well as set up custom regex patterns for industry-specific data requirements. For example, healthcare providers use these filters to maintain HIPAA compliance to help automatically block PII types that they include. This way, they can use AI capabilities while helping to maintain strict patient privacy standards.

For our example, configure filters for blocking the email address and phone number of healthcare insurance users, then choose Next.

Contextual grounding checks We use Amazon Bedrock Guardrails contextual grounding and relevance checks in our application to help validate model responses, detect hallucinations, and support alignment with reference sources.

Set up the thresholds for contextual grounding and relevance checks (we set them to 0.7), then choose Next.

Automated Reasoning checks

Automated Reasoning checks help detect hallucinations and provide a verifiable proof that our application’s model (LLM) response is accurate.

The first step to incorporate Automated Reasoning checks for our application is to create an Automated Reasoning policy that is composed of a set of variables, defined with a name, type, and description, and the logical rules that operate on the variables. These rules are expressed in formal logic, but they’re translated to natural language to make it straightforward for a user without formal logic expertise to refine a model. Automated Reasoning checks use the variable descriptions to extract their values when validating a Q&A.

To create an Automated Reasoning policy, choose the new Automated Reasoning menu option under Safeguards.
Create a new policy and give it a name, then upload an existing document that defines the right solution space, such as an HR guideline or an operational manual. For this demo, we use an example healthcare insurance policy document that includes the insurance coverage policies applicable to insurance holders.

Automated Reasoning checks is in preview in Amazon Bedrock Guardrails in the US West (Oregon) AWS Region. To request to be considered for access to the preview today, contact your AWS account team.

Define the policy’s intent and processing parameters and choose Create policy.

The system now initiates an automated process to create your Automated Reasoning policy. This process involves analyzing your document, identifying key concepts, breaking down the document into individual units, translating these natural language units into formal logic, validating the translations, and finally combining them into a comprehensive logical model. You can review the generated structure, including the rules and variables, and edit these for accuracy through the UI.

To attach the Automated Reasoning policy to your guardrail, turn on Enable Automated Reasoning policy, choose the policy and policy version you want to use, then choose Next.

Review the configurations set in the previous steps and choose Create guardrail.

Test your guardrail

We can now test our healthcare insurance call center application with different inputs and see how the configured guardrail intervenes for harmful and undesirable multimodal content.

On the Amazon Bedrock console, on the guardrail details page, choose Select model in the Test panel.

Choose your model, then choose Apply.

For our example, we use the Amazon Nova Lite FM, which is a low-cost multimodal model that is lightning fast for processing image, video, and text input. For your use case, you can use another model of your choice.

Enter a query prompt with a denied topic.

For example, if we ask “I have cold and sore throat, do you think I have Covid, and if so please provide me information on what is the coverage,” the system recognizes this as a request for a disease diagnosis. Because Disease Diagnosis is configured as a denied topic in the guardrail settings, the system blocks the response.

Choose View trace to see the details of the intervention.

You can test with other queries. For example, if we ask “What is the financial performance of your insurance company in 2024?”, the word filter guardrail that we configured earlier intervenes. You can choose View trace to see that the word filter was invoked.

Next, we use a prompt to validate if PII data in input can be blocked using the guardrail. We ask “Can you send my lab test report to abc@gmail.com?” Because the guardrail was set up to block email addresses, the trace shows an intervention due to PII detection in the input prompt.

If we enter the prompt “I am frustrated on someone, and feel like hurting the person.” The text content filter is invoked for Violence because we set up Violence as a high threshold for detection of the harmful content while creating the guardrail.

If we provide an image file in the prompt that contains content of the category Violence, the image content filter gets invoked for Violence.

Finally, we test the Automated Reasoning policy by using the Test playground on the Amazon Bedrock console. You can input a sample user question and an incorrect answer to check if your Automated Reasoning policy works correctly. In our example, according to the insurance policy provided, new insurance claims take a minimum 7 days to get processed. Here, we input the question “Can you process my new insurance claim in less than 3 days?” and the incorrect answer “Yes, I can process it in 3 days.”

The Automated Reasoning checks marked the answer as Invalid and provided details about why, including which specific rule was broken, the relevant variables it found, and recommendations for fixing the issue.

Independent API

In addition to using Amazon Bedrock Guardrails as shown in the preceding section for Amazon Bedrock hosted models, you can now use Amazon Bedrock Guardrails to apply safeguards on input prompts and model responses for FMs available in other services (such as Amazon SageMaker), on infrastructure such as Amazon Elastic Compute Cloud (Amazon EC2), on on-premises deployments, and other third-party FMs beyond Amazon Bedrock. The ApplyGuardrail API assesses text using your preconfigured guardrails in Amazon Bedrock, without invoking the FMs.

While testing Amazon Bedrock Guardrails, select Use ApplyGuardrail API to validate user inputs using MyHealthCareGuardrail. The following test doesn’t require you to choose an Amazon Bedrock hosted model, you can test configured guardrails as an independent API.

Conclusion

In this post, we demonstrated how Amazon Bedrock Guardrails helps block harmful and undesirable multimodal content. Using a healthcare insurance call center scenario, we walked through the process of configuring and testing various guardrails. We also highlighted the flexibility of our ApplyGuardrail API, which implements guardrail checks on any input prompt, regardless of the FM in use. You can seamlessly integrate safeguards across models deployed on Amazon Bedrock or external platforms.

Ready to take your AI applications to the next level of safety and compliance? Check out Amazon Bedrock Guardrails announces IAM Policy-based enforcement to deliver safe AI interactions, which enables security and compliance teams to establish mandatory guardrails for model inference calls, helping to consistently enforce your guardrails across AI interactions. To dive deeper into Amazon Bedrock Guardrails, refer to Use guardrails for your use case, which includes advanced use cases with Amazon Knowledge Bases and Amazon Bedrock Agents.

This guidance is for informational purposes only. You should still perform your own independent assessment and take measures to ensure that you comply with your own specific quality control practices and standards, and the local rules, laws, regulations, licenses and terms of use that apply to you, your content, and the third-party model referenced in this guidance. AWS has no control or authority over the third-party model referenced in this guidance and does not make any representations or warranties that the third-party model is secure, virus-free, operational, or compatible with your production environment and standards. AWS does not make any representations, warranties, or guarantees that any information in this guidance will result in a particular outcome or result.

References

About the authors

Divya Muralidharan is a Solutions Architect at AWS, supporting a strategic customer. Divya is an aspiring member of the AI/ML technical field community at AWS. She is passionate about using technology to accelerate growth, provide value to customers, and achieve business outcomes. Outside of work, she spends time cooking, singing, and growing plants.

Rachna Chadha is a Principal Technologist at AWS, where she helps customers leverage generative AI solutions to drive business value. With decades of experience in helping organizations adopt and implement emerging technologies, particularly within the healthcare domain, Rachna is passionate about the ethical and responsible use of artificial intelligence. She believes AI has the power to create positive societal change and foster both economic and social progress. Outside of work, Rachna enjoys spending time with her family, hiking, and listening to music.

Effective cost optimization strategies for Amazon Bedrock

June 10, 2025

by Biswanath Mukherjee Amazon AWS

Customers are increasingly using generative AI to enhance efficiency, personalize experiences, and drive innovation across various industries. For instance, generative AI can be used to perform text summarization, facilitate personalized marketing strategies, create business-critical chat-based assistants, and so on. However, as generative AI adoption grows, associated costs can escalate in several areas including cost in inference, deployment, and model customization. Effective cost optimization can help to make sure that generative AI initiatives remain financially sustainable and deliver a positive return on investment. Proactive cost management makes the best of generative AI’s transformative potential available to businesses while maintaining their financial health.

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, DeepSeek, Luma, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI. Using Amazon Bedrock, you can experiment with and evaluate top FMs for your use case, privately customize them with your data using techniques such as fine-tuning and Retrieval Augmented Generation (RAG), and build agents that execute tasks using your enterprise systems and data sources.

With the increasing adoption of Amazon Bedrock, optimizing costs is a must to help keep the expenses associated with deploying and running generative AI applications manageable and aligned with your organization’s budget. In this post, you’ll learn about strategic cost optimization techniques while using Amazon Bedrock.

Understanding Amazon Bedrock pricing

Amazon Bedrock offers a comprehensive pricing model based on actual usage of FMs and related services. The core pricing components include model inference (available in On-Demand, Batch, and Provisioned Throughput options), model customization (charging for training, storage, and inference), and Custom Model Import (free import but charges for inference and storage). Through Amazon Bedrock Marketplace, you can access over 100 models with varying pricing structures for proprietary and public models. You can check out Amazon Bedrock pricing for a pricing overview and more details on pricing models.

Cost monitoring in Amazon Bedrock

You can monitor the cost of your Amazon Bedrock usage using the following approaches:

Application inference profiles – Amazon Bedrock provides application inference profiles that you can use to apply custom cost allocation tags to track, manage, and control on-demand FM costs and usage across different workloads and tenants.
Cost allocation tagging – You can tag all Amazon Bedrock models, aligning usage to specific organizational taxonomies such as cost centers, business units, teams, and applications for precise expense tracking. To carry out tagging operations, you need the Amazon Resource Name (ARN) of the resource on which you want to carry out a tagging operation.
Integration with AWS cost tools – Amazon Bedrock cost monitoring integrates with AWS Budgets, AWS Cost Explorer, AWS Cost and Usage Reports, and AWS Cost Anomaly Detection, enabling organizations to set tag-based budgets, receive alerts for usage thresholds, and detect unusual spending patterns.
Amazon CloudWatch metrics monitoring – Organizations can use Amazon CloudWatch to monitor runtime metrics for Amazon Bedrock applications by inference profile, set alarms based on thresholds, and receive notifications for real-time management of resource usage and costs. You can monitor all parts of your Amazon Bedrock application using Amazon CloudWatch, which collects raw data and processes it into readable, near real-time metrics. You can graph the metrics using the AWS Management Console for CloudWatch. You can also set alarms that watch for certain thresholds and send notifications or take action when values exceed those thresholds.
Resource-specific visibility – CloudWatch provides metrics such as Invocations, InvocationLatency, InputTokenCount, OutputTokenCount, and various error metrics that can be filtered by model IDs and other dimensions for granular monitoring of Amazon Bedrock usage and performance.

Cost optimization strategies for Amazon Bedrock

When building generative AI applications with Amazon Bedrock, implementing thoughtful cost optimization strategies can significantly reduce your expenses while maintaining application performance. In this section, you’ll find key approaches to consider in the following order:

Select the appropriate model
Determine if it needs customization
1. If yes, explore options in the correct order
2. If no, proceed to the next step
Perform prompt engineering and management
Design efficient agents
Select the correct consumption option

This flow is shown in the following flow diagram.

Choose an appropriate model for your use case

Amazon Bedrock provides access to a diverse portfolio of FMs through a single API. The service continually expands its offerings with new models and providers, each with different pricing structures and capabilities.

For example, consider the on-demand pricing variation among Amazon Nova models in the US East (Ohio) AWS Region. This pricing is current as of May 21, 2025. Refer to the Amazon Bedrock pricing page for latest data.

As shown in the following table, the price varies significantly between Amazon Nova Micro, Amazon Nova Lite, and Amazon Nova Pro models. For example, Amazon Nove Micro is approximately 1.71 times cheaper than Amazon Note Lite based on per 1,000 input tokens as of this writing. If you don’t need multimodal capability and the accuracy of Amazon Nova Micro meets your use case, then you need not opt for Amazon Nova Lite. This demonstrates why selecting the right model for your use case is critical. The largest or most advanced model isn’t always necessary for every application.

Amazon Nova models	Price per 1,000 input tokens	Price per 1,000 output tokens
Amazon Nova Micro	$0.000035	$0.00014
Amazon Nova Lite	$0.00006	$0.00024
Amazon Nova Pro	$0.0008	$0.0032

One of the key advantages of Amazon Bedrock is its unified API, which abstracts the complexity of working with different models. You can switch between models by changing the model ID in your request with minimal code modifications. With this flexibility, you can select the most cost and performance optimized model that meets your requirements and upgrade only when necessary.

Best practice: Use Amazon Bedrock native features to evaluate the performance of the foundation model for your use case. Begin with an automatic model evaluation job to narrow down the scope. Follow it up by using LLM as a judge or human-based evaluation as required for your use case.

Perform model customization in the right order

When customizing FMs in Amazon Bedrock for contextualizing responses, choosing the strategy in correct order can significantly reduce your expenses while maximizing performance. You have four primary strategies available, each with different cost implications:

Prompt Engineering – Start by crafting high-quality prompts that effectively condition the model to generate desired responses. This approach requires minimal resources and no additional infrastructure costs beyond your standard inference calls.
RAG – Amazon Bedrock Knowledge Bases is a fully managed feature with built-in session context management and source attribution that helps you implement the entire RAG workflow from ingestion to retrieval and prompt augmentation without having to build custom integrations to data sources and manage data flows.
Fine-tuning – This approach involves providing labeled training data to improve model performance on specific tasks. Although its effective, fine-tuning requires additional compute resources and creates custom model versions with associated hosting costs.
Continued pre-training – The most resource-intensive option involves providing unlabeled data to further train an FM on domain-specific content. This approach incurs the highest costs and longest implementation time.

The following graph shows the escalation of the complexity, quality, cost, and time of these four approaches.

Best practice: Implement these strategies progressively. Begin with prompt engineering as your foundation—it’s cost-effective and can often deliver impressive results with minimal investment. Refer to the Optimize for clear and concise prompts section to learn about different strategies that you can follow to write good prompts. Next, integrate RAG when you need to incorporate proprietary information into responses. These two approaches together should address most use cases while maintaining efficient cost structures. Explore fine-tuning and continued pre-training only when you have specific requirements that can’t be addressed through the first two methods and your use case justifies the additional expense.

By following this implementation hierarchy, shown in the following figure, you can optimize both your Amazon Bedrock performance and your budget allocation. Here is the high-level mental model for choosing different options:

Use Amazon Bedrock native model distillation feature

Amazon Bedrock Model Distillation is a powerful feature that you can use to access smaller, more cost-effective models without sacrificing performance and accuracy for your specific use cases.

Enhance accuracy of smaller (student) cost-effective models – With Amazon Bedrock Model Distillation, you can select a teacher model whose accuracy you want to achieve for your use case and then select a student model that you want to fine-tune. Model distillation automates the process of generating responses from the teacher and using those responses to fine-tune the student model.
Maximize distilled model performance with proprietary data synthesis – Fine-tuning a smaller, cost-efficient model to achieve accuracy similar to a larger model for your specific use case is an iterative process. To remove some of the burden of iteration needed to achieve better results, Amazon Bedrock Model Distillation might choose to apply different data synthesis methods that are best suited for your use case. For example, Amazon Bedrock might expand the training dataset by generating similar prompts, or it might generate high-quality synthetic responses using customer provided prompt-response pairs as golden examples.
Reduce cost by bringing your production data – With traditional fine-tuning, you’re required to create prompts and responses. With Amazon Bedrock Model Distillation, you only need to provide prompts, which are used to generate synthetic responses and fine-tune student models.

Best practice: Consider model distillation when you have a specific, well-defined use case where a larger model performs well but costs more than desired. This approach is particularly valuable for high-volume inference scenarios where the ongoing cost savings will quickly offset the initial investment in distillation.

Use Amazon Bedrock intelligent prompt routing

With Amazon Bedrock Intelligent Prompt Routing, you can now use a combination of FMs from the same model family to help optimize for quality and cost when invoking a model. For example, you can route between the Anthropic’s Claude model family—between Claude 3.5 Sonnet and Claude 3 Haiku depending on the complexity of the prompt. This is particularly useful for applications like customer service assistants, where uncomplicated queries can be handled by smaller, faster, and more cost-effective models, and complex queries are routed to more capable models. Intelligent prompt routing can reduce costs by up to 30% without compromising on accuracy.

Best practice: Implement intelligent prompt routing for applications that handle a wide range of query complexities.

Optimize for clear and concise prompts

Optimizing prompts for clarity and conciseness in Amazon Bedrock focuses on structured, efficient communication with the model to minimize token usage and maximize response quality. Through techniques such as clear instructions, specific output formats, and precise role definitions, you can achieve better results while reducing costs associated with token consumption.

Structured instructions – Break down complex prompts into clear, numbered steps or bullet points. This helps the model follow a logical sequence and improves the consistency of responses while reducing token usage.
Output specifications – Explicitly define the desired format and constraints for the response. For example, specify word limits, format requirements, or use indicators like Please provide a brief summary in 2-3 sentences to control output length.
Avoid redundancy – Remove unnecessary context and repetitive instructions. Keep prompts focused on essential information and requirements because superfluous content can increase costs and potentially confuse the model.
Use separators – Employ clear delimiters (such as triple quotes, dashes, or XML-style tags) to separate different parts of the prompt to help the model to distinguish between context, instructions, and examples.
Role and context precision – Start with a clear role definition and specific context that’s relevant to the task. For example, You are a technical documentation specialist focused on explaining complex concepts in simple terms provides better guidance than a generic role description.

Best practice: Amazon Bedrock offers a fully managed feature to optimize prompts for a select model. This helps to reduce costs by improving prompt efficiency and effectiveness, leading to better results with fewer tokens and model invocations. The prompt optimization feature automatically refines your prompts to follow best practices for each specific model, eliminating the need for extensive manual prompt engineering that could take months of experimentation. Use this built-in prompt optimization feature in Amazon Bedrock to get started and optimize further to get better results as needed. Experiment with prompts to make them clear and concise to reduce the number of tokens without compromising the quality of the responses.

Optimize cost and performance using Amazon Bedrock prompt caching

You can use prompt caching with supported models on Amazon Bedrock to reduce inference response latency and input token costs. By adding portions of your context to a cache, the model can use the cache to skip recomputation of inputs, enabling Amazon Bedrock to share in the compute savings and lower your response latencies.

Significant cost reduction – Prompt caching can reduce costs by up to 90% compared to standard model inference costs, because cached tokens are charged at a reduced rate compared to non-cached input tokens.
Ideal use cases – Prompt caching is particularly valuable for applications with long and repeated contexts, such as document Q&A systems where users ask multiple questions about the same document or coding assistants that maintain context about code files.
Improved latency – Implementing prompt caching can decrease response latency by up to 85% for supported models by eliminating the need to reprocess previously seen content, making applications more responsive.
Cache retention period – Cached content remains available for up to 5 minutes after each access, with the timer resetting upon each successful cache hit, making it ideal for multiturn conversations about the same context.
Implementation approach – To implement prompt caching, developers identify frequently reused prompt portions, tag these sections using the cachePoint block in API calls, and monitor cache usage metrics (cacheReadInputTokenCount and cacheWriteInputTokenCount) in response metadata to optimize performance.

Best practice: Prompt caching is valuable in scenarios where applications repeatedly process the same context, such as document Q&A systems where multiple users query the same content. The technique delivers maximum benefit when dealing with stable contexts that don’t change frequently, multiturn conversations about identical information, applications that require fast response times, high-volume services with repetitive requests, or systems where cost optimization is critical without sacrificing model performance.

Cache prompts within the client application

Client-side prompt caching helps reduce costs by storing frequently used prompts and responses locally within your application. This approach minimizes API calls to Amazon Bedrock models, resulting in significant cost savings and improved application performance.

Local storage implementation – Implement a caching mechanism within your application to store common prompts and their corresponding responses, using techniques such as in-memory caching (Redis, Memcached) or application-level caching systems.
Cache hit optimization – Before making an API call to Amazon Bedrock, check if the prompt or similar variations exist in the local cache. This reduces the number of billable API calls to the FMs, directly impacting costs. You can check Caching Best Practices to learn more.
Expiration strategy – Implement a time-based cache expiration strategy such as Time To Live (TTL) to help make sure that cached responses remain relevant while maintaining cost benefits. This aligns with the 5-minute cache window used by Amazon Bedrock for optimal cost savings.
Hybrid caching approach – Combine client-side caching with the built-in prompt caching of Amazon Bedrock for maximum cost optimization. Use the local cache for exact matches and the Amazon Bedrock cache for partial context reuse.
Cache monitoring – Implement cache hit:miss ratio monitoring to continually optimize your caching strategy and identify opportunities for further cost reduction through cached prompt reuse.

Best practice: In performance-critical systems and high-traffic websites, client-side caching enhances response times and user experience while minimizing dependency on ongoing Amazon Bedrock API interactions.

Build small and focused agents that interact with each other rather than a single large monolithic agent

Creating small, specialized agents that interact with each other in Amazon Bedrock can lead to significant cost savings while improving solution quality. This approach uses the multi-agent collaboration capability of Amazon Bedrock to build more efficient and cost-effective generative AI applications.

The multi-agent architecture advantage: You can use Amazon Bedrock multi-agent collaboration to orchestrate multiple specialized AI agents that work together to tackle complex business problems. By creating smaller, purpose-built agents instead of a single large one, you can:

Optimize model selection based on specific tasks – Use more economical FMs for simpler tasks and reserve premium models for complex reasoning tasks
Enable parallel processing – Multiple specialized agents can work simultaneously on different aspects of a problem, reducing overall response time
Improve solution quality – Each agent focuses on its specialty, leading to more accurate and relevant responses

Best practice: Select appropriate models for each specialized agent, matching capabilities to task requirements while optimizing for cost. Based on the complexity of the task, you can choose either a low-cost model or a high-cost model to optimize the cost. Use AWS Lambda functions that retrieve only the essential data to reduce unnecessary cost in Lambda execution. Orchestrate your system with a lightweight supervisor agent that efficiently handles coordination without consuming premium resources.

Choose the desired throughput depending on the usage

Amazon Bedrock offers two distinct throughput options, each designed for different usage patterns and requirements:

On-Demand mode – Provides a pay-as-you-go approach with no upfront commitments, making it ideal for early-stage proof of concepts (POCs) on development and test environments, applications with unpredictable or seasonal or sporadic traffic with significant variation.

With On-Demand pricing, you’re charged based on actual usage:

- Text generation models – Pay per input token processed and output token generated
- Embedding models – Pay per input token processed
- Image generation models – Pay per image generated
Provisioned Throughput mode – By using Provisioned Throughput, you can purchase dedicated model units for specific FMs to get higher level of throughput for a model at a fixed cost. This makes Provisioned Throughput suitable for production workload requiring predictable performance without throttling. If you customized a model, you must purchase Provisioned Throughput to be able to use it.

Each model unit delivers a defined throughput capacity measured by the maximum number of tokens processed per minute. Provisioned Throughput is billed hourly with commitment options of 1-month or 6-month terms, with longer commitments offering greater discounts.

Best practice: If you’re working on a POC or on a use case that has a sporadic workload using one of the base FMs from Amazon Bedrock, use On-Demand mode to take the benefit of pay-as-you-go pricing. However, if you’re working on a steady state workload where throttling must be avoided, or if you’re using custom models, you should opt for provisioned throughput that matches your workload. Calculate your token processing requirements carefully to avoid over-provisioning.

Use batch inference

With batch mode, you can get simultaneous large-scale predictions by providing a set of prompts as a single input file and receiving responses as a single output file. The responses are processed and stored in your Amazon Simple Storage Service (Amazon S3) bucket so you can access them later. Amazon Bedrock offers select FMs from leading AI providers like Anthropic, Meta, Mistral AI, and Amazon for batch inference at a 50% lower price compared to On-Demand inference pricing. Refer to Supported AWS Regions and models for batch inference for more details. This approach is ideal for non-real-time workloads where you need to process large volumes of content efficiently.

Best practice: Identify workloads in your application that don’t require real-time responses and migrate them to batch processing. For example, instead of generating product descriptions on-demand when users view them, pre-generate descriptions for new products in a nightly batch job and store the results. This approach can dramatically reduce your FM costs while maintaining the same output quality.

Conclusion

As organizations increasingly adopt Amazon Bedrock for their generative AI applications, implementing effective cost optimization strategies becomes crucial for maintaining financial efficiency. The key to successful cost optimization lies in taking a systematic approach. That is, start with basic optimizations such as proper model selection and prompt engineering, then progressively implement more advanced techniques such as caching and batch processing as your use cases mature. Regular monitoring of costs and usage patterns, combined with continuous optimization of these strategies, will help make sure that your generative AI initiatives remain both effective and economically sustainable.Remember that cost optimization is an ongoing process that should evolve with your application’s needs and usage patterns, making it essential to regularly review and adjust your implementation of these strategies.For more information about Amazon Bedrock pricing and the cost optimization strategies discussed in this post, refer to:

About the authors

Biswanath Mukherjee is a Senior Solutions Architect at Amazon Web Services. He works with large strategic customers of AWS by providing them technical guidance to migrate and modernize their applications on AWS Cloud. With his extensive experience in cloud architecture and migration, he partners with customers to develop innovative solutions that leverage the scalability, reliability, and agility of AWS to meet their business needs. His expertise spans diverse industries and use cases, enabling customers to unlock the full potential of the AWS Cloud.

Upendra V is a Senior Solutions Architect at Amazon Web Services, specializing in Generative AI and cloud solutions. He helps enterprise customers design and deploy production-ready Generative AI workloads, implement Large Language Models (LLMs) and Agentic AI systems, and optimize cloud deployments. With expertise in cloud adoption and machine learning, he enables organizations to build and scale AI-driven applications efficiently.