An innovative financial services leader finds the right AI solution: Robinhood and Amazon Nova

This post is cowritten with Renyu Chen and Dev Tagare from Robinhood.

Robinhood has been a pioneer and disruptor in the once staid world of online brokerages. Founded in 2013, the company transformed an industry better known for gatekeeping into an open platform accessible to all. Robinhood pioneered commission-free trading, and harnessed the power of technology and intuitive design to create a seamless and engaging experience for modern investors. To this day, the company continues to disrupt the financial services industry by launching groundbreaking product innovations on AWS.

Such innovations have made Robinhood one of the fastest growing brokerages in history, with more than 25 million customers worldwide and a global reputation as an innovator and technology leader. Fueled by its mission of “democratizing finance for all,” the company’s focus on accessibility, particularly for first-time investors, has kept Robinhood as one of the top finance apps on the Apple App Store for more than a decade and earned Robinhood accolades such as an award from Fast Company magazine as one of World’s 50 Most Innovative Companies. This annual ranking highlights companies that are reshaping industries and culture through innovation.

Robinhood’s Chief Executive Officer, Vlad Tenev, explains why this focus is important to Robinhood:

“Our belief is, the more we lower the barriers to entry, the more we level the playing field and allow people to invest their money at a younger age, the better off our economy will be and the better off society will be.”

Built to operate in the cloud, Robinhood uses AWS to power its online business, deliver and update its mobile trading app, securely store information and data, and perform business analytics. Robinhood recently used AI to improve customer experience and expand accessibility. For example, in 2025, the company will launch Robinhood Cortex, an AI investment tool that is designed to provide real-time insights to help users better navigate markets, identify potential opportunities, and stay up to date on the latest market moving news. Cortex is an exciting step forward, providing a level premium investment and market digests that have historically been reserved for institutional investors and wealthy individuals.

As Robinhood customers are able to do more on the platform, the company is working with AWS to explore new generative AI solutions such as Amazon Nova, a family of foundation models (FMs) that make generative AI development faster and more efficient, with exceptional price performance. These new solutions will help the company accommodate rapid expansion of customer requirements.

In this post, we share how Robinhood delivers democratized finance and real-time market insights using generative AI and Amazon Nova.

An AI/ML journey built on customer obsession

Robinhood, like all financial services firms, operates in a highly regulated environment. Historically, the industry was seen as slow-moving and wary of new technologies. Robinhood’s founders put technology at the forefront by initially building a no-frills, no-fee app that, by design, would make investing accessible to everyone, not just the very wealthy. As Robinhood grew, it attracted a wider variety of customers who need the speed, reliability, security, and low cost the platform offers, but who also want a richer set of services for different and novel use cases.

Robinhood listens closely to these active traders. As Renyu Chen, staff machine learning (ML) engineer at Robinhood, explains,

“We wanted to create a seamless journey for AI/ML applications to go from experimentation to Robinhood scale. We looked to the AWS team to help meet the AI/ML needs of our developers while providing advanced ML tooling to serve our most sophisticated ‘active trader’ customers. This would also require a plug-and-play approach that could adopt the latest generative AI technologies from open source, model providers, and home-grown platform tooling.”

Robinhood explored various generative AI solutions during 2023, concluding that the best way to get to Robinhood scale was with Amazon Bedrock, a fully managed service that helps users build generative AI models. Amazon Bedrock offers an extensive selection of FMs from various providers, and allows a high level of customization and security through a single API.

According to Robinhood’s Renyu Chen,

“For us, the security of our customers’ data comes first. Nothing is more important. With Amazon Bedrock, data stays under our control. When we query a model, the input and output never leave our virtual private cloud. When we fine-tune a foundation model, it is based on a private copy of that model. This means our customers’ data is not shared with model providers, and is not used to improve the base models.”

To meet the needs of Robinhood’s ever-growing base of power users, Robinhood is exploring Amazon Nova, estimating that the price per token using Amazon Nova can be up to 80% lower than other models they have tested, which would make it cost-effective to power new high-demand use cases such as a fraud investigation assistant, enhanced document processing, and AI-created content generation.

In addition, AWS generative AI solutions working through Amazon Nova can power new agentic workflows for Robinhood, in which autonomous AI agents can independently make decisions, adapt to changing situations, and execute actions.

“Robinhood offers its customers simplicity, speed, security, and cost savings. Working developer-to-developer with the Robinhood team and building together, we can design generative AI solutions that meet Robinhood’s priorities and customer-focused goals. For example, Amazon Nova models can be easily customized with Amazon Bedrock Model Distillation, which ‘distills’ knowledge from a larger, more capable ‘teacher’ model to a smaller, faster, and cost-efficient ‘student’ model. This solution can help Robinhood use models such as DeepSeek to explore exciting new use cases quickly, securely, and at a 75% lower cost than equivalent offerings from competitors.”

– Dushan Tharmal, Principal Product Manager, Amazon Artificial General Intelligence (AGI).

Amazon Nova: More services, greater value for Robinhood and its customers

Working with AWS on its ambitious AI journey, Robinhood is able to rapidly scale new services for customers without needing the costly structures, staff, and infrastructure found at traditional brokerages. With support from AWS, Robinhood is able to offer a richer customer experience while remaining true to its mission of simplicity, clarity, low cost, speed, security, and reliability.

“We see that Amazon Nova can be a great match for our mission. Amazon Nova offers the lowest latency responses at very low cost, and is accurate and lightning-fast across a wide range of interactive and high-volume Robinhood applications. And, consistent with Robinhood’s commitment to simplicity and low cost for its customers, using Amazon Nova models through Amazon Bedrock makes these large-scale tasks significantly easier, cheaper, and more cost-effective.”

– Dev Tagare, Robinhood’s head of AI.

Learn more about Amazon Nova and how it can deliver frontier intelligence and industry leading price-performance for your organization.


About the authors

Renyu Chen is a Staff AI Engineer at Robinhood Markets

Dev Tagare is the Head of AI at Robinhood Markets

Uchenna Egbe is a GenAI Solutions Architect at AWS FSI,

Trevor Spires is a GenAI Solutions Architect at AWS FinTech.

Read More

Build conversational interfaces for structured data using Amazon Bedrock Knowledge Bases

Build conversational interfaces for structured data using Amazon Bedrock Knowledge Bases

Organizations manage extensive structured data in databases and data warehouses. Large language models (LLMs) have transformed natural language processing (NLP), yet converting conversational queries into structured data analysis remains complex. Data analysts must translate business questions into SQL queries, creating workflow bottlenecks.

Amazon Bedrock Knowledge Bases enables direct natural language interactions with structured data sources. The system interprets database schemas and context, converting natural language questions into accurate queries while maintaining data reliability standards. You can chat with your structured data by setting up structured data ingestion from AWS Glue Data Catalog tables and Amazon Redshift clusters in a few steps, using the power of Amazon Bedrock Knowledge Bases structured data retrieval.

This post provides instructions to configure a structured data retrieval solution, with practical code examples and templates. It covers implementation samples and additional considerations, empowering you to quickly build and scale your conversational data interfaces. Through clear examples and proven methodologies, organizations can transform their data access capabilities and accelerate decision-making processes.

Solution overview

The solution demonstrates how to build a conversational application using Amazon Bedrock Knowledge Bases structured data retrieval. Developers often face challenges integrating structured data into generative AI applications. This includes difficulties training LLMs to convert natural language queries to SQL queries based on complex database schemas, as well as making sure appropriate data governance and security controls are in place. Amazon Bedrock Knowledge Bases alleviates these complexities by providing a managed natural language to SQL (NL2SQL) module. Amazon Bedrock Knowledge Bases offers an end-to-end managed workflow for you to build custom generative AI applications that can access and incorporate contextual information from a variety of structured and unstructured data sources. Using advanced NLP, Amazon Bedrock Knowledge Bases can transform natural language queries into SQL queries, so you can retrieve data directly from the source without the need to move or preprocess the data.

This solution includes Amazon Bedrock Knowledge Bases, Amazon Redshift, AWS Glue, and Amazon Simple Storage Service (Amazon S3). The solution architecture consists of two parts: a data ingestion pipeline, and a structured data retrieval application using Amazon Bedrock Knowledge Bases.

Amazon Bedrock Knowledge Bases structured data retrieval supports Amazon Redshift as the query engine and multiple data ingestion options. The data ingestion pipeline is a one-time setup, and supports multiple ingestion options. In this post, we discuss a common data ingestion use case using Amazon S3, AWS Glue, and Amazon Redshift.

You can configure Amazon Bedrock Knowledge Bases structured data retrieval to retrieve data from AWS Glue databases and S3 datasets. This setup uses automatic mounting of the Data Catalog in Amazon Redshift. With this ingestion option, you can seamlessly integrate existing S3 datasets and Data Catalog tables into your Retrieval Augmented Generation (RAG) applications with the access permissions configured through Lake Formation. The following diagram illustrates this pipeline.

Data flow from Dataset to Amazon S3, AWS Glue, and Amazon Redshift

The following screenshot shows the configuration options on the Amazon Bedrock console.

Storage metadata configuration interface with Redshift and Glue Data Catalog options

After the data ingestion is configured and the knowledge bases data source sync job is complete, users can ask natural language questions, and Amazon Bedrock Knowledge Bases will generate the SQL, execute the SQL against the query engine, and process it through the LLM to provide a user-friendly response. The following diagram illustrates a sample architecture of the structured data retrieval workflow.

Amazon Bedrock Knowledge Bases structured data retrieval conversational flow

The data retrieval workflow consists of the following steps:

  1. In a RAG application, the user can ask a natural language data analytics question through the chat interface, such as “What is the sales revenue for the Month of February 2025?”
  2. The natural language query is sent to Amazon Bedrock Knowledge Bases for data retrieval and processing.
  3. Amazon Bedrock Knowledge Bases generates a SQL query based on the underlying data schema configured during the knowledge base creation.
  4. The SQL query is executed against the query engine (Amazon Redshift) to retrieve data from a structured data store (AWS Glue tables). The query can include multiple joins and aggregation.
  5. The generated SQL response is sent to an LLM along with additional context to generate a response in natural language.
  6. The response is sent back to the user. The user can ask follow-up questions based on the retrieved response, such as “What is the product that generated highest revenue in this period?”

Amazon Bedrock Knowledge Bases structured data retrieval supports three different APIs to meet your data retrieval requirements:

  • Retrieval and response generation – The retrieval and response generation API, similar to the solution workflow we’ve discussed, generates a SQL query, retrieves data through the query engine, and processes it through the LLM to generate a natural language response
  • Retrieval only – The retrieval only API generates a SQL query, retrieves data through the query engine, and returns the data without processing it through an LLM
  • Generate SQL queries – The generate SQL query API returns the raw SQL query that was generated by Amazon Bedrock Knowledge Bases, which can be used for review and further processing by applications

The following screenshot shows the configuration options on the Amazon Bedrock console.

Amazon Bedrock Knowledge Bases retrieval API

Code resources and templates

The solution uses the following notebooks:

  • Data ingestion notebookStructured-rag-s3-glue-ingestion includes the step-by-step guide to ingest an open dataset to Amazon S3, configure AWS Glue tables using crawlers, and set up the Amazon Redshift Serverless query engine.
  • Structured data retrieval notebookStructured-rag-s3-glue-retrieval walks through the implementation steps and provides sample code for configuring Amazon Bedrock Knowledge Bases structured data retrieval using Amazon S3, AWS Glue, and the Amazon Redshift query engine.

For more details, refer to the GitHub repo.

Prerequisites

To implement the solution provided in this post, you must have an AWS account. Additionally, access to the required foundation models must be enabled in Amazon Bedrock.

Set up the data ingestion pipeline

To set up the data ingestion pipeline, we load the sample dataset in an S3 bucket and configure AWS Glue as data storage and a Redshift Serverless workgroup as the query engine. Complete the following steps in data ingestion notebook:

  1. For data ingestion, download the following sample ecommerce dataset, convert it to a pandas data frame, and upload it to an S3 bucket using Amazon SageMaker Data Wrangler.
  2. Create an AWS Glue database and table using an AWS Glue crawler by crawling the source S3 bucket with the dataset. You can update this step to crawl your own S3 bucket or use your existing Data Catalog tables as storage metadata.
  3. Use the data ingestion notebook to create a Redshift Serverless namespace and workgroup in the default VPC. If you plan to use your own Redshift Serverless workgroup or Amazon Redshift provisioned cluster, you can skip this step.

Set up the structured data retrieval solution

In this section, we detail the steps to set up the structured data retrieval component of the solution.

Amazon Bedrock Knowledge Bases supports multiple data access patterns, including AWS Identity and Access Management (IAM), AWS Secrets Manager, and database users. For this post, we demonstrate the setup option with IAM access. You can use IAM access with the Redshift Serverless workgroup configured as part of the ingestion workflow or an existing Redshift Serverless or provisioned cluster to compete these steps.

Complete the following steps in structured data retrieval notebook:

  1. Create an execution role with the necessary policies for accessing data from Amazon Redshift, AWS Glue, and the S3 bucket.
  2. Invoke the CreateKnowledgeBase API to create the knowledge base with the execution role and knowledge base configurations. In the knowledge base configuration, the AWS Glue database and tables are used as storage metadata with Amazon Redshift as the query engine.
  3. After you create the knowledge base, you must complete additional steps to make sure the IAM execution role has the necessary permissions to execute the query in Amazon Redshift and retrieve data from AWS Glue. The notebook includes the necessary instructions to create and grant database access to the execution role, and grant AWS Lake Formation permissions.
  4. The ingestion job will sync the data store schema metadata about AWS Glue database and tables with the NL2SQL module. This schema metadata will be used while generating the SQL query during structured data retrieval.
  5. After the knowledge base sync job is complete, you can use the three data retrieval APIs – retrieve and generate response, retrieval only, and generate SQL query – to query and validate the structured data retrieval solution.

For more details, refer to Create a knowledge base by connecting to a structured data store.

Clean up

We have included cleanup instructions in both the data ingestion and structured data retrieval notebooks to clean up resources after the end-to-end solution is implemented and validated.

Conclusion

Amazon Bedrock Knowledge Bases simplifies data analysis by converting natural language questions into SQL queries, eliminating the need for specialized database expertise. The service integrates with Amazon Redshift, AWS Glue, and Amazon S3, allowing business analysts, data scientists, and operations teams to query data directly using conversation-like questions. It maintains data security through built-in governance controls and access permissions. Customers can deploy this managed service to enable users to analyze data using natural language questions, while maintaining data integrity and security standards.

To learn more, refer to Build a knowledge base by connecting to a structured data store and Amazon Bedrock Knowledge Bases now supports structured data retrieval.


About the authors

George Belsian is a Senior Cloud Application Architect at Amazon Web Services, helping organizations navigate the complexities of cloud adoption, AI integration, and data-driven innovation. By transforming legacy systems into cloud-based platforms and incorporating AI/ML capabilities, he helps businesses create new opportunities for growth, optimize their processes, and deliver scalable solutions.

Sandeep Singh is a Senior Generative AI Data Scientist at Amazon Web Services, helping businesses innovate with generative AI. He specializes in generative AI, machine learning, and system design. He has successfully delivered state-of-the-art AI/ML-powered solutions to solve complex business problems for diverse industries, optimizing efficiency and scalability.

Mani Khanuja is a Principal Generative AI Specialist SA and author of the book Applied Machine Learning and High-Performance Computing on AWS. She leads machine learning projects in various domains such as computer vision, natural language processing, and generative AI. She speaks at internal and external conferences such AWS re:Invent, Women in Manufacturing West, YouTube webinars, and GHC 23. In her free time, she likes to go for long runs along the beach.

Gopikrishnan Anilkumar is a Principal Technical Product Manager in AWS Agentic AI organization. He has over 10 years of product management experience across a variety of domains and is passionate about AI/ML.

Read More

How Apollo Tyres is unlocking machine insights using agentic AI-powered Manufacturing Reasoner

How Apollo Tyres is unlocking machine insights using agentic AI-powered Manufacturing Reasoner

This is a joint post co-authored with Harsh Vardhan, Global Head, Digital Innovation Hub, Apollo Tyres Ltd.

Apollo Tyres, headquartered in Gurgaon, India, is a prominent international tire manufacturer with production facilities in India and Europe. The company advertises its products under its two global brands: Apollo and Vredestein, and its products are available in over 100 countries through a vast network of branded, exclusive, and multiproduct outlets. The product portfolio of the company includes the entire range of passenger car, SUV, MUV, light truck, truck-bus, two-wheeler, agriculture, industrial, specialty, bicycle, and off-the-road tires and retreading materials.

Apollo Tyres has started an ambitious digital transformation journey to streamline its entire business value process, including manufacturing. The company collaborated with Amazon Web Services (AWS) to implement a centralized data lake using AWS services. Additionally, Apollo Tyres enhanced its capabilities by unlocking insights from the data lake using generative AI powered by Amazon Bedrock across business values.

In this pursuit, they developed Manufacturing Reasoner, powered by Amazon Bedrock Agents, a custom solution that automates multistep tasks by seamlessly connecting with the company’s systems, APIs, and data sources. The solution has been developed, deployed, piloted, and scaled out to identify areas to improve, standardize, and benchmark the cycle time beyond the total effective equipment performance (TEEP) and overall equipment effectiveness (OEE) of highly automated curing presses. The data flow of curing machines is connected to the AWS Cloud through the industrial Internet of Things (IoT), and machines are sending real-time sensor, process, operational, events, and condition monitoring data to the AWS Cloud.

In this post, we share how Apollo Tyres used generative AI with Amazon Bedrock to harness the insights from their machine data in a natural language interaction mode to gain a comprehensive view of its manufacturing processes, enabling data-driven decision-making and optimizing operational efficiency.

The challenge: Reducing dry cycle time for highly automated curing presses and improving operational efficiency

Before the Manufacturing Reasoner solution, plant engineers were conducting manual analysis to identify bottlenecks and focus areas using an industrial IoT descriptive dashboard for the dry cycle time (DCT) of curing presses across all machines, SKUs, cure mediums, suppliers, machine type, subelements, sub-subelements, and more. The analysis and identification of these focus areas across curing presses among millions of parameters on real-time operations used to consume from approximately 7 hours per issue to an average of 2 elapsed hours per issue. Additionally, subelemental level analysis (that is, bottleneck analysis of subelemental and sub-subelemental activities) wasn’t possible using traditional root cause analysis (RCA) tools. The analysis required subject matter experts (SMEs) from various departments such as manufacturing, technology, industrial engineering, and others to come together and perform RCA. As the insights were not generated in real time, corrective actions were delayed.

Solution impact

With the agentic AI Manufacturing Reasoner, the goal was to empower their plant engineers to perform corrective actions on accelerated RCA insights to reduce curing DCT. This agentic AI solution and virtual experts (agents) help plant engineers interact with industrial IoT connected to big data in natural language (English) to retrieve relevant insights and provide insightful recommendations for resolving operational issues in DCT processes. The RCA agent offers detailed insights and self-diagnosis or recommendations, identifying which of the over 25 automated subelements or activities should be focused on across more than 250 automated curing presses, more than 140 stock-keeping units (SKUs), three types of curing mediums, and two types of machine suppliers. The goal is to achieve the best possible reduction in DCT across three plants. Through this innovation, plant engineers now have a thorough understanding of their manufacturing bottlenecks. This comprehensive view supports data-driven decision-making and enhances operational efficiency. They realized an approximate 88% reduction in effort in assisting RCA for DCT through self-diagnosis of bottleneck areas on streaming and real-time data. The generative AI assistant reduces the DCT RCA from up to 7 hours per issue to less than 10 minutes per issue. Overall, the targeted benefit is expected to save approximately 15 million Indian rupees (INR) per year just in the passenger car radial (PCR) division across their three manufacturing plants.

This virtual reasoner also offers real-time triggers to highlight continuous anomalous shifts in DCT for mistake-proofing or error prevention in line with the Poka-yoke approach, leading to appropriate preventative actions. The following are additional benefits offered by the Manufacturing Reasoner:

  • Observability of elemental-wise cycle time along with graphs and statistical process control (SPC) charts, press-to-press direct comparison on the real-time streaming data
  • On-demand RCA on streaming data, along with daily alerts to manufacturing SMEs

“Imagine a world where business associates make real-time, data-driven decisions, and AI collaborates with humans. Our transformative generative AI solution is designed, developed, and deployed to make this vision a reality. This in-house Manufacturing Reasoner, powered by generative AI, is not about replacing human intelligence; it is about amplifying it.”

– Harsh Vardhan, Global Head, Digital Innovation Hub, Apollo Tyres Ltd.

Solution overview

By using Amazon Bedrock features, Apollo Tyres implemented an advanced auto-diagnosis Manufacturing Reasoner designed to streamline RCA and enhance decision-making. This tool uses a generative AI–based machine root cause reasoner that facilitated accurate analysis through natural language queries, provided predictive insights, and referenced a reliable Amazon Redshift database for actionable data. The system enabled proactive maintenance by predicting potential issues, optimizing cycle times, and reducing inefficiencies. Additionally, it supported staff with dynamic reporting and visualization capabilities, significantly improving overall productivity and operational efficiency.

The following diagram illustrates the multibranch workflow.

The following diagram illustrates the process flow.

To enable the workflow, Apollo Tyres followed these steps:

  1. Users ask their questions in natural language through the UI, which is a Chainlit application hosted on Amazon Elastic Compute Cloud (Amazon EC2).
  2. The question asked is picked up by the primary AI agent, which classifies the complexity of the question and decides which agent to be called for the multistep reasoning with help of different AWS services.
  3. Amazon Bedrock Agents uses Amazon Bedrock Knowledge Bases and the vector database capabilities of Amazon OpenSearch Service to extract relevant context for the request:
    1. Complex transformation engine agent – This agent works as an on-demand and complex transformation engine for the context and specific question.
    2. RCA agent – This agent for Amazon Bedrock constructs a multistep, multi–large language model (LLM) workflow to perform detailed automated RCA, which is particularly useful for complex diagnostic scenarios.
  4. The primary agent calls the explainer agent and visualization agent concurrently using multiple threads:
    1. Explainer agent – This agent for Amazon Bedrock uses Anthropic’s Claude Haiku model to generate explanations in two parts:
      1. Evidence – Provides a step-by-step logical explanation of the executed query or CTE.
      2. Conclusion – Offers a brief answer to the question, referencing Amazon Redshift records.
    2. Visualization agent – This agent for Amazon Bedrock generates Plotly chart code for creating visual charts using Anthropic’s Claude Sonnet model.
  5. The primary agent combines the outputs (records, explanation, chart code) from both agents and streams them to the application.
  6. The UI renders the result to the user by dynamically displaying the statistical plots and formatting the records in a table.
  7. Amazon Bedrock Guardrails helped setting up tailored filters and response limits, which made sure that interactions with machine data were not only secure but also relevant and compliant with established operational guidelines. The guardrails also helped to prevent errors and inaccuracies by automatically verifying the validity of information, which was essential for accurately identifying the root causes of manufacturing problems.

The following screenshot shows an example of the Manufacturing Reasoner response.

The following diagram shows an example of the Manufacturing Reasoner dynamic chart visualization.

“As we integrate this generative AI solution, built on Amazon Bedrock, to automate RCA into our plant curing machines, we’ve seen a profound transformation in how we diagnose issues and optimize operations,” says Vardhan. “The precision of generative AI–driven insights has enabled plant engineers to not only accelerate problem finding from an average of 2 hours per scenario to less than 10 minutes now but also refine focus areas to make improvements in cycle time (beyond TEEP). Real-time alerts notify process SMEs to act on bottlenecks immediately and advanced diagnosis features of the solution provide subelement-level information about what’s causing deviations.”

Lessons learned

Apollo Tyres learned the following takeaways from this journey:

  • Applying generative AI to streaming real-time industrial IoT data requires extensive research due to the unique nature of each use case. To develop an effective manufacturing reasoner for automated RCA scenarios, Apollo Tyres explored several strategies from the prototype to the proof-of-concept stages.
  • In the beginning, the solution faced significant delays in response times when using Amazon Bedrock, particularly when multiple agents were involved. The initial response times exceeded 1 minute for data retrieval and processing by all three agents. To address this issue, efforts were made to optimize performance. By carefully selecting appropriate LLMs and small language models (SLMs) and disabling unused workflows within the agent, the response time was successfully reduced to approximately 30–40 seconds. These optimizations played a crucial role in boosting the solution’s efficiency and responsiveness, leading to smoother operations and an enhanced user experience across the system.
  • While using the capabilities of LLMs to generate code for visualizing data through charts, Apollo Tyres faced challenges when dealing with extensive datasets. Initially, the generated code often contained inaccuracies or failed to handle large volumes of data correctly. To address this issue, they embarked on a process of continuous refinement, iterating multiple times to enhance the code generation process. Their efforts focused on developing a dynamic approach that could accurately generate chart code capable of efficiently managing data within a data frame, regardless of the number of records involved. Through this iterative approach, they significantly improved the reliability and robustness of the chart generation process, making sure that it could handle substantial datasets without compromising accuracy or performance.
  • Consistency issues were effectively resolved by making sure the correct data format is ingested into the Amazon data lake for the knowledge base, structured as follows:
{
"Question": <Question in natural language>, 
"Query": < Complex Transformation Engine scripts >, 
“Metadata” :<metadata>
}

Next steps

The Apollo Tyres team is scaling the successful solution from tire curing to various areas across different locations, advancing towards the industry 5.0 goal. To achieve this, Amazon Bedrock will play a pivotal role in extending the multi-agentic Retrieval Augmented Generation (RAG) solution. This expansion involves using specialized agents, each dedicated to specific functionalities. By implementing agents with distinct roles, the team aims to enhance the solution’s capabilities across diverse operational domains.

Furthermore, the team is focused on benchmarking and optimizing the time required to deliver accurate responses to queries. This ongoing effort will streamline the process, providing faster and more efficient decision-making and problem-solving capabilities across the extended solution.Apollo Tyres is also exploring generative AI using Amazon Bedrock for its other manufacturing and nonmanufacturing processes.

Conclusion

In summary, Apollo Tyres used generative AI through Amazon Bedrock and Amazon Bedrock Agents to transform raw machine data into actionable insights, achieving a holistic view of their manufacturing operations. This enabled more informed, data-driven decision-making and enhanced operational efficiency. By integrating generative AI–based manufacturing reasoners and RCA agents, they developed a machine cycle time diagnosis assistant capable of pinpointing focus areas across more than 25 subprocesses, more than 250 automated curing presses, more than 140 SKUs, three curing mediums, and two machine suppliers. This solution helped drive targeted improvements in DCT across three plants, with targeted annualized savings of approximately INR 15 million within the PCR segment alone and achieving an approximate 88% reduction in manual effort for root cause analysis.

“By embracing this agentic AI-driven approach, Apollo Tyres is redefining operational excellence—unlocking hidden capacity through advanced ‘asset sweating’ while enabling our plant engineers to communicate with machines in natural language. These bold, in-house AI initiatives are not just optimizing today’s performance but actively building the firm foundation for intelligent factories of the future driven by data and human-machine collaboration.”

– Harsh Vardhan.

To learn more about Amazon Bedrock and getting started, refer to Getting started with Amazon Bedrock. If you have feedback about this post, leave a comment in the comments section.


About the authors

Harsh Vardhan is a distinguished global leader in Business-first AI-first Digital Transformation with over two- decades of industry experience. As the Global Head of the Digital Innovation Hub at Apollo Tyres Limited, he leads industrialisation of AI-led Digital Manufacturing, Industry 4.0/5.0 excellence, and fostering enterprise-wide AI-first innovation culture. He is A+ contributor in field of Advanced AI with Arctic code vault badge, Strategic Intelligence member at World Economic Forum, and executive member of CII National Committee. He is an avid reader and loves to drive.

Gautam Kumar is a Solutions Architect at Amazon Web Services. He helps various Enterprise customers to design and architect innovative solutions on AWS. Outside work, he enjoys travelling and spending time with family.

Deepak Dixit is a Solutions Architect at Amazon Web Services, specializing in Generative AI and cloud solutions. He helps enterprises architect scalable AI/ML workloads, implement Large Language Models (LLMs), and optimize cloud-native applications.

Read More

Extend your Amazon Q Business with PagerDuty Advance data accessor

Extend your Amazon Q Business with PagerDuty Advance data accessor

This blog post is co-written with Jacky Leybman from PagerDuty.

As organizations scale their digital operations, they face unprecedented challenges in managing and extracting value from their vast data ecosystems, particularly when it comes to data accessibility and quality. The complexity of modern IT operations demands solutions that can efficiently integrate, process, and deliver actionable insights.

In this post, we demonstrate how organizations can enhance their incident management capabilities by integrating PagerDuty Advance, an innovative set of agentic and generative AI capabilities that automate response workflows and provide real-time insights into operational health, with Amazon Q Business. We show how to configure PagerDuty Advance as a data accessor for Amazon Q indexes, so you can search and access enterprise knowledge across multiple systems during incident response. We also explore the key benefits of this integration, including improved search capabilities across connected platforms and enhanced data processing for faster incident resolution, supported by robust security features. The post includes a step-by-step implementation guide to help you set up this integration in your environment.

Understanding the components

PagerDuty, a leading digital operations management platform that helps organizations prevent and resolve business-impacting incidents, uses sophisticated ML and AI to automate response workflows and provide real-time insights into operational health. As the first incident management platform to integrate with Amazon Q Business, PagerDuty is an enterprise-grade incident management and operational intelligence solution that can be interacted with through your corporate communications tool, and can analyze data across multiple software as a service (SaaS) applications, breaking down data silos that typically hinder AI’s potential to drive operational resilience. PagerDuty Advance is a comprehensive suite of generative and agentic AI capabilities for the PagerDuty platform, purpose-built to elevate operational efficiency with less effort and faster, automated actions supported by intelligent context at every step of the way.

Amazon Q index for independent software vendors (ISVs) is a capability to seamlessly integrate their generative AI applications with customers’ enterprise data and metadata through an Amazon Q index, so customers can search across their application data alongside other enterprise content. This integration capability makes sure that ISVs can offer their customers a unified search experience while maintaining strict security, access controls, and ownership over their data.

When combined with the intelligent search and insight derivation capabilities of Amazon Q index, organizations gain a complete solution that transforms how they handle operational data. The integration enables a variety of use cases that enhance operational efficiency and incident management across the enterprise, as demonstrated by PagerDuty Advance.

The integration creates a relationship where the refined indexing of data from Amazon Q can be combined with PagerDuty real-time incident data, creating a unified view of operational intelligence. Through the Amazon Q Data Accessor capability, PagerDuty Advance can securely access and analyze data from over 100 different SaaS applications typically used by businesses, making previously siloed data actionable and valuable for incident prevention and resolution.

The following video shows this solution in action, as an agent uses PagerDuty Advance to identify an incident cause and request troubleshooting advice.

Q index on PagerDuty Advance Demo

Benefits for enterprises

Enterprises often struggle with incident resolution, spending precious time searching through multiple systems for answers. Imagine a scenario where your team receives a critical alert—with the integration of PagerDuty Advance and Amazon Q index, you can quickly access relevant runbooks for resolution steps from Confluence or identify potentially related GitHub commits that might have triggered the issue. This seamless integration transforms the incident management experience:

  • Improved search capabilities – Amazon Q index augments the generative AI Q&A experience by providing semantically relevant enterprise content across connected systems, resulting in contextually appropriate and actionable results. Teams can quickly locate information across Confluence, GitHub, and other integrated platforms, significantly reducing search time.
  • Enhanced data processing – The system continuously ingests and analyzes operational data, automatically correlating incidents and identifying patterns across connected systems. Through intelligent parsing of documentation and code repositories, it creates automatic links between incidents, relevant documentation, and GitHub changes while providing a unified view of operational data. This analysis converts siloed raw data into actionable insights, enabling automated suggestions for resolution and trend analysis for proactive improvements.
  • Cost optimization – Organizations can achieve significant cost savings through reduced mean time to resolution (MTTR) and optimized resource allocation. By having immediate access to runbooks, past resolution information, and related code changes, teams can resolve incidents faster and more efficiently. The integration streamlines workflows and automates routine tasks, resulting in decreased operational overhead. Teams can accomplish more with existing resources, leading to improved return on investment (ROI) on technology investments.
  • Security benefits – Security is paramount in the integrated solution, with Amazon Q index implementing robust identity-aware access controls. Enterprise index data remains securely stored within the enterprise environment, and the PagerDuty data accessor capability only retrieves relevant content through the Search Relevant Content API—a specialized API designed for enterprise applications to securely search and retrieve contextually relevant information across their data sources—providing secure and reliable data access. Through this identity awareness API, the system authenticates and validates each user’s permissions before returning search results or document access. This means users will only see information from documents they have explicit permissions to access—if a user doesn’t have access to specific Confluence pages or GitHub repositories, those results will be automatically filtered out from their search results. The system features complete end-to-end encryption to protect sensitive operational data, and the role-based access control integrates with your existing identity management systems. This zero-trust security approach maintains compliance with industry standards and provides organizations with granular control over their sensitive operational data, reducing the risk of unauthorized access to confidential information.

Jacky Leybman, Principal Product Manager on PagerDuty, says,

“Our PagerDuty customers have asked for a one-stop shop from identifying critical issues to driving resolution. The integration of Amazon Q index with PagerDuty Advance represents a significant milestone, enabling us to provide customers with comprehensive insights, including runbooks and historical information stored in the enterprise environment, and help them resolve issues efficiently, resulting in up to 30% faster MTTR on average. Working with AWS to implement this integration has been a remarkably smooth experience, and we’re already seeing strong interest from numerous enterprise customers eager to test these capabilities. We are very excited to see how customers leverage these capabilities.”

Solution overview

The Amazon Q Business data accessor, a secure interface component that bridges enterprise applications with Amazon Q index, provides a simple and secure way for enterprises to allow PagerDuty Advance to access their Amazon Q index to provide relevant answers to user queries through PagerDuty Advance.

The integration of PagerDuty Advance with Amazon Q index offers a robust incident management solution that uses enterprise data across multiple platforms. When a user requests information through Slack or Microsoft Teams, the PagerDuty Advance orchestrator processes the query, checking both the PagerDuty knowledge base for relevant incident data and the Amazon Q Business data accessor to search the Amazon Q index. The index can aggregate data from various enterprise systems like Slack, Salesforce, and Atlassian products using built-in Amazon Q Business connectors. The orchestrator uses generative AI to provide users with contextual, actionable insights directly within their communication platform. With this integration, teams can quickly access runbooks, ongoing issue details, and other critical information, enhancing incident response efficiency and reducing resolution times.

The following diagram depicts the overall solution integrating PagerDuty Advance and Amazon Q index.

Prerequisites

Before enabling the Amazon Q index integration on PagerDuty Advance, you need to have the following components and requirements in place:

  • Amazon Q Business set up with AWS IAM Identity Center for user authentication
  • Access to PagerDuty Advance
  • A valid AWS account with appropriate service access

With the Amazon Q Business data accessor, PagerDuty Advance seamlessly integrates with your Amazon Q index. Simply complete the basic configuration steps on both the Amazon Q Business and PagerDuty consoles to get started. For more information on how to set up an Amazon Q Business application, see the Amazon Q Business Activation Day workshop.

Add PagerDuty Advance as a data accessor

After creating an Amazon Q Business application with IAM Identity Center, administrators can configure PagerDuty as a data accessor through the Amazon Q Business console. Complete the following steps:

  1. On the Amazon Q Business console, choose Data accessors in the navigation pane.
  2. Choose Add data accessor.
  3. Choose PagerDuty Advance as your data accessor.
  4. For Accessor name, enter a name for your data accessor.
  5. For Data source access, configure your level of access.
    • You can select specific data sources from your Amazon Q index to be available through the data accessor. This makes it possible to control which content is surfaced in the ISV environment. You can use Amazon Q Business pre-built connectors to synchronize content from various systems. For more information, refer to Supported connectors.
  6.  For User access, specify which users or groups can access the Amazon Q index through the data accessor.
    • This option enables you to configure granular permissions for data accessor accessibility and manage organizational access controls.

For more information about data access, refer to Accessing a customer’s Amazon Q index as a data accessor using cross-account access.

After you have added the data accessor, the Amazon Q Business console displays configuration details that you need to share with PagerDuty Advance to complete the setup. Note down this information for the next step. Also, you can always come back to retrieve these values on the data accessor’s details page.

Configure Amazon Q for PagerDuty Advance

After PagerDuty has been configured as a data accessor, administrators can enable Amazon Q Business assistance on PagerDuty Advance. The following steps describe how to do it:

  1. On your PagerDuty page, go to Account Settings, then choose PagerDuty Advance.
    Amazon Q configuration on PagerDuty Advance
  2. Turn on Enable Amazon Q Business.
    Enable Amazon Q Business on PagerDuty Advance
  3. Choose Edit configuration values and enter the values you copied when enabling the data accessor in the previous step.
    Data accessor configuration values

Your setup is now complete!

Now you can go to the communication tool where PagerDuty Advance is available and start asking questions. For example, on Slack, you can use /pd amazonq <user query>, as shown in the following screenshot.

PDAdvance Q index demo screen

Clean up

When you’re done using this solution in a given environment, clean up the resources you created.

  1. On your PagerDuty page, go to Account Settings, then choose PagerDuty Advance, and turn off Enable Amazon Q Business.
  2. On AWS console, delete PagerDuty data accessor from the Data accessors console. Deleting this data accessor will delete permissions and access to the data accessor for all users.
  3. Delete the Amazon Q Business application that you created as a prerequisite.
    • Navigate to the Amazon Q Business console.
    • Choose Applications on the left menu.
    • Select the application you created.
    • Choose Delete from under Actions to delete the application.

Deleting the Amazon Q Business application will remove the associated index and data source connectors, and prevent incurring additional costs.

Conclusion

The combination of PagerDuty Advance and Amazon Q index offers businesses an improved way to handle daily operations more effectively. By bringing together PagerDuty’s enterprise-grade incident management solutions with Amazon Q index smart search features, companies can now get specific answers and find relevant information that was previously scattered across different systems safely while maintaining data ownership. This means faster problem-solving and better teamwork across the organization.

In this post, we explored how enterprises can use the integration between PagerDuty Advance and Amazon Q Business, allowing users to streamline their incident management processes and unlock valuable operational gains and insights. We demonstrated how organizations can set up this integration using an Amazon Q data accessor, so teams can access critical information across multiple systems securely and in a cost-effective manner.

Ready to level up your incident management and operational efficiency? Unlock the full potential of your enterprise’s operational intelligence today with the Amazon Q Business console, PagerDuty Advance documentation, and the integration implementation guide.


About the Authors

Jacky LeybmanJacky Leybman is a Principal Product Manager at PagerDuty, leading the development of PagerDuty Advance and AI Agents. With over 19 years of experience in technology and product management, Jacky specializes in leading Agile cross-functional teams to develop and launch innovative digital products. Based in Miami, Florida, Jacky brings extensive expertise in product strategy, team leadership, and artificial intelligence implementations.

Takeshi KobayashiTakeshi Kobayashi is a Senior AI/ML Solutions Architect within the Amazon Q Business team, responsible for developing advanced AI/ML solutions for enterprise customers. With over 14 years of experience at Amazon in AWS, AI/ML, and technology, Takeshi is dedicated to leveraging generative AI and AWS services to build innovative solutions that address customer needs. Based in Seattle, WA, Takeshi is passionate about pushing the boundaries of artificial intelligence and machine learning technologies.

Daniel LopesDaniel Lopes is a Solutions Architect at AWS, where he partners with ISVs to architect solutions that align with their strategic objectives. He specializes in leveraging AWS services to help ISVs transform their product vision into reality, with particular expertise in event-driven architectures, serverless computing, and generative AI. Outside work, Daniel mentors his kids in video games and pop culture.

Read More

Innovate business logic by implementing return of control in Amazon Bedrock Agents

Innovate business logic by implementing return of control in Amazon Bedrock Agents

In the context of distributed systems and microservices architecture, orchestrating communication between diverse components presents significant challenges. However, with the launch of Amazon Bedrock Agents, the landscape is evolving, offering a simplified approach to agent creation and seamless integration of the return of control capability. In this post, we explore how Amazon Bedrock Agents revolutionizes agent creation and demonstrates the efficacy of the return of control capability in orchestrating complex interactions between multiple systems.

Amazon Bedrock Agents simplifies the creation, deployment, and management of agents in distributed systems. By using the power of AWS Lambda and AWS Step Functions, Amazon Bedrock Agents abstracts away the complexities of agent implementation, which means developers can focus on building robust and scalable applications without worrying about infrastructure management.

You can use agents in Amazon Bedrock in various scenarios where you need to handle the return of control to the user or the system. Use cases include conversational assistants, task automation, decision support systems, interactive tutorials and walkthroughs, and virtual assistants. In these use cases, the key aspect of the agents is their ability to handle the return of control to the user or the system. This allows for a more natural and responsive interaction, where the user feels in control of the process while still benefiting from the agent’s guidance and automation capabilities.

Solution overview

In this post, we demonstrate an automated personalized investment portfolio solution using Amazon Bedrock Agents. The solution calls a third-party API to fetch a user’s current investment portfolio. These are then analyzed using foundation models (FMs) available on Amazon Bedrock to produce recommendations inline to the inputs provided by the end user, showcasing a return of control capability integrated with Amazon Bedrock Agents.

This solution uses a combination of synchronous data retrieval and generative AI to provide tailored investment recommendations that align with users’ specific financial goals and risk tolerance. By incorporating machine learning (ML) and simulation techniques, the system can generate personalized portfolios and assess their potential performance, making sure the recommended solutions are optimized for individual needs.

With Amazon Bedrock Agents, the capability to return control to the application invoking the agent can handle external functions and business logic at the application level instead of using a Lambda function. This way, an application can manage external interactions and return the response while the agent continues its orchestration. This is illustrated in the following diagram.

Illustration Diagram

The option to return control is particularly useful in two main scenarios:

  1. Calling an API from an existing application rather than building a new Lambda function with the required authentication and networking configurations
  2. Handling tasks that might run longer than 15 minutes and can’t be accommodated through a Lambda function, instead requiring containers, virtual servers, or workflow orchestration tools such as AWS Step Functions

The following sample code uses Amazon Bedrock Agents with handling return of control in the code. With the Amazon Bedrock Agents feature, you can manage Amazon Bedrock Agents return of control in your backend services and simplify application integrations. To demonstrate this, we have the following four code snippets: external-bedrock-agent-api.py, streamlit-app-portfolio-recommender.py, Portfolio-Recommender-CFN-Template.yaml, and requirements.txt, along with detailed steps to replicate the scenario.

The external-bedrock-agent-api code implements a portfolio recommendation system using Amazon Bedrock Agents and Flask. Here’s a high-level overview of the functions used:

  • fetch_user_data: Processes user profile information such as risk tolerance or investment goals
  • generate_portfolios: Creates sample investment portfolios with different risk levels
  • fetch_custom_portfolio: Combines user data and portfolio generation
  • send_custom_portfolio_as_email: Sends portfolio recommendations by email using an Amazon Simple Email Service (Amazon SES) verified email identity
  • /sns-handler endpoint: This API endpoint receives POST requests with user investment preferences, processes the message containing user preference details, invokes the Amazon Bedrock agent to generate recommendations, and handles email communication of the recommendations

The streamlit-app-portfolio-recommender code is a Streamlit web application for investment portfolio recommendations. The code sets up the webpage with a title and configuration. The app collects several pieces of information through form elements:

  • Email address – Text input
  • Financial goal – Dropdown with options for retirement, wealth accumulation, and passive income
  • Risk tolerance – Dropdown with options for low, medium, and high
  • Investment horizon – Dropdown with options for short-term and long-term
  • Environmental, social, and governance (ESG) preference – Checkbox for environmental, social, and governance preferences
  • Email preference – Checkbox for receiving recommendations by email

The system operates through a Portfolio Generation Function that actively sending POST requests to a local API endpoint. This function transforms user preferences into JSON data and delivers either an API response or error message back to the user.

The process to display results begins when user click the Submit button, which triggers the custom_portfolio function with their specific inputs. The system then displays the portfolio recommendation in a text area for successful executions, while immediately alerting users with an error message if any issues occur during the process.

Solution walkthrough

Follow the steps to set up the environment and test the application in the US East (N. Virginia) us-east-1 Region.

To enable Anthropic’s Claude model on Amazon Bedrock in your AWS account:

  1. On the Amazon Bedrock console, in the left navigation pane under Amazon Bedrock configurations, select Model access
  2. Select Claude 3 Sonnet, as shown in the following screenshot

  1. To create the Amazon Bedrock agents, related action groups, Amazon SageMaker AI domain, sample user profile, and JupyterLab space, follow these steps:

  1. Select the checkbox to acknowledge that the template contains AWS Identity and Access Management (IAM) resources, as shown in the following screenshot

  1. Monitor AWS CloudFormation until it completes the resource creation process. You can verify the successful deployment by checking the Stack details output tab, which will display the AgentId and AgentAliasId values, as shown in the screenshot below.

You will receive an email address verification request email from AWS for in the US East (N. Virginia) Region. Select the link in the email to verify.

After creating your CloudFormation resources, follow these steps to access Amazon SageMaker Studio:

  1. On the Amazon SageMaker AI console, under Admin configurations in the left navigation pane, select Domains
  2. Select the bedrock-return-of-control-demo domain created by the CloudFormation template, as shown in the following screenshot

  1. Select the User profiles tab
  2. To open the SageMaker Studio environment, under User profiles, next to the sagemakeruser profile on the right, select Launch. From the dropdown menu, choose Studio, as shown in the following screenshot

You should now observe the SageMaker Studio home page. This environment is where you will execute Python scripts to set up your application.

To access the JupyterLab environment for this lab, follow these steps:

  1. On the SageMaker Studio console, in the left navigation pane under Applications, select JupyterLab
  2. You’ll find bedrock-agent-space that has been preprovisioned for this lab. Its Status should be Stopped. On the right side under Action, choose Run
  3. Within 30–40 seconds, the JupyterLab application status will change from Starting to Running

  1. When it’s running, under Action, choose Open, as shown in the following screenshot

Three required files are copied under the /home/sagemaker-user/scripts directory: two Python files (external-bedrock-agent-api and streamlit-app-portfolio-recommender) and one requirements.txt file, as shown in the following screenshot. The JupyterLab application environment is under the default directory.

  1. In the File menu, select New. In the dropdown menu, select Terminal to open a new terminal window, as shown in the following screenshot.
  2. Go to the scripts directory where you have the required files in the terminal and enter:
    pip install -r requirements.txt

  3. Enter the following command on the terminal:
    python3 external-bedrock-agent-api.py

  4. Open a new terminal and go to the /home/sagemaker-user/scripts directory and enter:
    streamlit run streamlit-app-portfolio-recommender.py

  5. From the command execution in the terminal, note the port number (8501) and studio URL from the browser. The URL will be in the format of: https://{domainid}.studio.{region}-1.sagemaker.aws/jupyterlab/default/lab/tree/scripts
  6. To access the Streamlit app, modify the Studio URL, replacing everything after the default/ lab/tree/scripts with proxy/[PORT NUMBER]/. The modified Streamlit UI URL will look like this: https://{domainid}.studio.{region}.sagemaker.aws/jupyterlab/default/proxy/8501/
  7. Select all appropriate inputs for generating your custom portfolio recommendation. Choose whether you prefer to receive email notifications or inline recommendations through the application interface by checking the corresponding box. Then choose Submit. Provide the same email address that was verified earlier in this walkthrough.

The sample output and email response are shown in the following demo screenshot.

Cleanup

When you’re done, delete resources you no longer need to avoid ongoing costs. Follow these steps:

  1. Go to the SageMaker AI JupyterLab environment and stop the Amazon SageMaker Studio application or running instance
  2. Delete the resources created by deleting the CloudFormation stack.

The following screenshot demonstrates how to view and stop running instances in the SageMaker AI JupyterLab environment. For more information, refer to Delete a stack from the CloudFormation console.

Amazon Bedrock Agents return of control considerations

When implementing return of control, consider the following:

  • Return of control performance considerations – When implementing return of control, developers should focus on optimizing action execution times and response handling. Each action should be designed to complete within reasonable timeframes to maintain conversation flow. Consider implementing caching mechanisms for frequently accessed data and facilitate efficient state management between return of control cycles. The application should be designed to handle concurrent user sessions effectively while maintaining responsiveness.
  • Return of control limitations – Actions must be defined with clear input and output schemas. Each action should be atomic and focused on a specific task to maintain simplicity and reliability. Consider payload sizes for requests and responses because there might be size limitations. Actions execute sequentially, and the system needs to maintain conversation context throughout the interaction cycle.
  • Security recommendations – Security implementation requires proper authentication and authorization mechanisms for all actions, following the principle of least privilege when defining permissions. Input parameters must be validated before processing, with comprehensive error handling in place. Rate limiting and request validation should be implemented to prevent abuse, and sensitive data handling must comply with security requirements and include proper logging mechanisms for audit trails. Additionally, implement input filtering to prevent prompt injection attacks, configure response filters to protect sensitive information, and set up content scanning for both input and output. Deploy regex-based response filtering to help prevent personally identifiable information (PII) exposure and establish content moderation filters to block inappropriate content.
  • Monitoring and observability – Implement comprehensive logging for all action executions and responses. Monitor key metrics such as action execution times, success rates, and error rates. Set up alerts for abnormal patterns or failures. Use Amazon CloudWatch for monitoring system health and performance. Consider implementing tracing to track request flow through different components of your system. Regular review of metrics and logs helps identify potential issues and optimization opportunities.

Conclusion

In this post, we’ve demonstrated how Amazon Bedrock Agents simplifies agent creation and streamlines the orchestration of complex interactions between microservices using the return of control capability. By abstracting away infrastructure management and providing seamless integration with your application, Amazon Bedrock Agents empowers developers to build resilient and scalable applications with ease. As organizations embrace microservices architecture and distributed systems, tools such as Amazon Bedrock Agents play a pivotal role in accelerating innovation and driving digital transformation.

Resources

For the most current and specific information, refer to:


About the Authors


Vishwanatha Handadi
is a Sr. Solutions Architect within the Global Financial Services vertical, working with Amazon Web Services (AWS) for over 2 years and has over 22 years of experience in the IT industry primarily in data and analytics. At AWS, he drives customers through their cloud transformation journeys by converting complex challenges into actionable roadmaps for both technical and business audiences. He is based out of Bangalore, India.


Mohammed Asadulla Baig
is a Sr. Technical Account Manager with Amazon Web Services (AWS) Enterprise Support. Asad helps customers architect scalable, resilient, and secure solutions. With a keen eye for innovation and a passion for delivering customer success, Asad has established himself as a thought leader in the industry, helping enterprises navigate their cloud transformation journeys with confidence and ease.

Read More

Deploy Qwen models with Amazon Bedrock Custom Model Import

Deploy Qwen models with Amazon Bedrock Custom Model Import

We’re excited to announce that Amazon Bedrock Custom Model Import now supports Qwen models. You can now import custom weights for Qwen2, Qwen2_VL, and Qwen2_5_VL architectures, including models like Qwen 2, 2.5 Coder, Qwen 2.5 VL, and QwQ 32B. You can bring your own customized Qwen models into Amazon Bedrock and deploy them in a fully managed, serverless environment—without having to manage infrastructure or model serving.

In this post, we cover how to deploy Qwen 2.5 models with Amazon Bedrock Custom Model Import, making them accessible to organizations looking to use state-of-the-art AI capabilities within the AWS infrastructure at an effective cost.

Overview of Qwen models

Qwen 2 and 2.5 are families of large language models, available in a wide range of sizes and specialized variants to suit diverse needs:

  • General language models: Models ranging from 0.5B to 72B parameters, with both base and instruct versions for general-purpose tasks
  • Qwen 2.5-Coder: Specialized for code generation and completion
  • Qwen 2.5-Math: Focused on advanced mathematical reasoning
  • Qwen 2.5-VL (vision-language): Image and video processing capabilities, enabling multimodal applications

Overview of Amazon Bedrock Custom Model Import

Amazon Bedrock Custom Model Import enables the import and use of your customized models alongside existing foundation models (FMs) through a single serverless, unified API. You can access your imported custom models on-demand and without the need to manage the underlying infrastructure. Accelerate your generative AI application development by integrating your supported custom models with native Amazon Bedrock tools and features like Amazon Bedrock Knowledge Bases, Amazon Bedrock Guardrails, and Amazon Bedrock Agents. Amazon Bedrock Custom Model Import is generally available in the US-East (N. Virginia), US-West (Oregon), and Europe (Frankfurt) AWS Regions. Now, we’ll explore how you can use Qwen 2.5 models for two common use cases: as a coding assistant and for image understanding. Qwen2.5-Coder is a state-of-the-art code model, matching capabilities of proprietary models like GPT-4o. It supports over 90 programming languages and excels at code generation, debugging, and reasoning. Qwen 2.5-VL brings advanced multimodal capabilities. According to Qwen, Qwen 2.5-VL is not only proficient at recognizing objects such as flowers and animals, but also at analyzing charts, extracting text from images, interpreting document layouts, and processing long videos.

Prerequisites

Before importing the Qwen model with Amazon Bedrock Custom Model Import, make sure that you have the following in place:

  1. An active AWS account
  2. An Amazon Simple Storage Service (Amazon S3) bucket to store the Qwen model files
  3. Sufficient permissions to create Amazon Bedrock model import jobs
  4. Verified that your Region supports Amazon Bedrock Custom Model Import

Use case 1: Qwen coding assistant

In this example, we will demonstrate how to build a coding assistant using the Qwen2.5-Coder-7B-Instruct model

  1. Go to to Hugging Face and search for and copy the Model ID Qwen/Qwen2.5-Coder-7B-Instruct:

You will use Qwen/Qwen2.5-Coder-7B-Instruct for the rest of the walkthrough. We don’t demonstrate fine-tuning steps, but you can also fine-tune before importing.

  1. Use the following command to download a snapshot of the model locally. The Python library for Hugging Face provides a utility called snapshot download for this:
from huggingface_hub import snapshot_download

snapshot_download(repo_id=" Qwen/Qwen2.5-Coder-7B-Instruct", 
                local_dir=f"./extractedmodel/")

Depending on your model size, this could take a few minutes. When completed, your Qwen Coder 7B model folder will contain the following files.

  • Configuration files: Including config.json, generation_config.json, tokenizer_config.json, tokenizer.json, and vocab.json
  • Model files: Four safetensor files and model.safetensors.index.json
  • Documentation: LICENSE, README.md, and merges.txt

  1. Upload the model to Amazon S3, using boto3 or the command line:

aws s3 cp ./extractedfolder s3://yourbucket/path/ --recursive

  1. Start the import model job using the following API call:
response = self.bedrock_client.create_model_import_job(
                jobName="uniquejobname",
                importedModelName="uniquemodelname",
                roleArn="fullrolearn",
                modelDataSource={
                    's3DataSource': {
                        's3Uri': "s3://yourbucket/path/"
                    }
                }
            )
            

You can also do this using the AWS Management Console for Amazon Bedrock.

  1. In the Amazon Bedrock console, choose Imported models in the navigation pane.
  2. Choose Import a model.

  1. Enter the details, including a Model name, Import job name, and model S3 location.

  1. Create a new service role or use an existing service role. Then choose Import model

  1. After you choose Import on the console, you should see status as importing when model is being imported:

If you’re using your own role, make sure you add the following trust relationship as describes in  Create a service role for model import.

After your model is imported, wait for model inference to be ready, and then chat with the model on the playground or through the API. In the following example, we append Python to prompt the model to directly output Python code to list items in an S3 bucket. Remember to use the right chat template to input prompts in the format required. For example, you can get the right chat template for any compatible model on Hugging Face using below code:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-7B-Instruct")

# Instead of using model.chat(), we directly use model.generate()
# But you need to use tokenizer.apply_chat_template() to format your inputs as shown below
prompt = "Write sample boto3 python code to list files in a bucket stored in the variable `my_bucket`"
messages = [
    {"role": "system", "content": "You are a helpful coding assistant."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

Note that when using the invoke_model APIs, you must use the full Amazon Resource Name (ARN) for the imported model. You can find the Model ARN in the Bedrock console, by navigating to the Imported models section and then viewing the Model details page, as shown in the following figure

After the model is ready for inference, you can use Chat Playground in Bedrock console or APIs to invoke the model.

Use case 2: Qwen 2.5 VL image understanding

Qwen2.5-VL-* offers multimodal capabilities, combining vision and language understanding in a single model. This section demonstrates how to deploy Qwen2.5-VL using Amazon Bedrock Custom Model Import and test its image understanding capabilities.

Import Qwen2.5-VL-7B to Amazon Bedrock

Download the model from Huggingface Face and upload it to Amazon S3:

from huggingface_hub import snapshot_download

hf_model_id = "Qwen/Qwen2.5-VL-7B-Instruct"

# Enable faster downloads
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"

# Download model locally
snapshot_download(repo_id=hf_model_id, local_dir=f"./{local_directory}")

Next, import the model to Amazon Bedrock (either via Console or API):

response = bedrock.create_model_import_job(
    jobName=job_name,
    importedModelName=imported_model_name,
    roleArn=role_arn,
    modelDataSource={
        's3DataSource': {
            's3Uri': s3_uri
        }
    }
)

Test the vision capabilities

After the import is complete, test the model with an image input. The Qwen2.5-VL-* model requires proper formatting of multimodal inputs:

def generate_vl(messages, image_base64, temperature=0.3, max_tokens=4096, top_p=0.9):
    processor = AutoProcessor.from_pretrained("Qwen/QVQ-72B-Preview")
    prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    
    response = client.invoke_model(
        modelId=model_id,
        body=json.dumps({
            'prompt': prompt,
            'temperature': temperature,
            'max_gen_len': max_tokens,
            'top_p': top_p,
            'images': [image_base64]
        }),
        accept='application/json',
        contentType='application/json'
    )
    
    return json.loads(response['body'].read().decode('utf-8'))

# Using the model with an image
file_path = "cat_image.jpg"
base64_data = image_to_base64(file_path)

messages = [
    {
        "role": "user",
        "content": [
            {"image": base64_data},
            {"text": "Describe this image."}
        ]
    }
]

response = generate_vl(messages, base64_data)

# Print response
print("Model Response:")
if 'choices' in response:
    print(response['choices'][0]['text'])
elif 'outputs' in response:
    print(response['outputs'][0]['text'])
else:
    print(response)
    

When provided with an example image of a cat (such the following image), the model accurately describes key features such as the cat’s position, fur color, eye color, and general appearance. This demonstrates Qwen2.5-VL-* model’s ability to process visual information and generate relevant text descriptions.

The model’s response:

This image features a close-up of a cat lying down on a soft, textured surface, likely a couch or a bed. The cat has a tabby coat with a mix of dark and light brown fur, and its eyes are a striking green with vertical pupils, giving it a captivating look. The cat's whiskers are prominent and extend outward from its face, adding to the detailed texture of the image. The background is softly blurred, suggesting a cozy indoor setting with some furniture and possibly a window letting in natural light. The overall atmosphere of the image is warm and serene, highlighting the cat's relaxed and content demeanor. 

Pricing

You can use Amazon Bedrock Custom Model Import to use your custom model weights within Amazon Bedrock for supported architectures, serving them alongside Amazon Bedrock hosted FMs in a fully managed way through On-Demand mode. Custom Model Import doesn’t charge for model import. You are charged for inference based on two factors: the number of active model copies and their duration of activity. Billing occurs in 5-minute increments, starting from the first successful invocation of each model copy. The pricing per model copy per minute varies based on factors including architecture, context length, Region, and compute unit version, and is tiered by model copy size. The custom model unites required for hosting depends on the model’s architecture, parameter count, and context length. Amazon Bedrock automatically manages scaling based on your usage patterns. If there are no invocations for 5 minutes, it scales to zero and scales up when needed, though this might involve cold-start latency of up to a minute. Additional copies are added if inference volume consistently exceeds single-copy concurrency limits. The maximum throughput and concurrency per copy is determined during import, based on factors such as input/output token mix, hardware type, model size, architecture, and inference optimizations.

For more information, see Amazon Bedrock pricing.

Clean up

To avoid ongoing charges after completing the experiments:

  1. Delete your imported Qwen models from Amazon Bedrock Custom Model Import using the console or the API.
  2. Optionally, delete the model files from your S3 bucket if you no longer need them.

Remember that while Amazon Bedrock Custom Model Import doesn’t charge for the import process itself, you are billed for model inference usage and storage.

Conclusion

Amazon Bedrock Custom Model Import empowers organizations to use powerful publicly available models like Qwen 2.5, among others, while benefiting from enterprise-grade infrastructure. The serverless nature of Amazon Bedrock eliminates the complexity of managing model deployments and operations, allowing teams to focus on building applications rather than infrastructure. With features like auto scaling, pay-per-use pricing, and seamless integration with AWS services, Amazon Bedrock provides a production-ready environment for AI workloads. The combination of Qwen 2.5’s advanced AI capabilities and Amazon Bedrock managed infrastructure offers an optimal balance of performance, cost, and operational efficiency. Organizations can start with smaller models and scale up as needed, while maintaining full control over their model deployments and benefiting from AWS security and compliance capabilities.

For more information, refer to the Amazon Bedrock User Guide.


About the Authors

Ajit Mahareddy is an experienced Product and Go-To-Market (GTM) leader with over 20 years of experience in Product Management, Engineering, and Go-To-Market. Prior to his current role, Ajit led product management building AI/ML products at leading technology companies, including Uber, Turing, and eHealth. He is passionate about advancing Generative AI technologies and driving real-world impact with Generative AI.

Shreyas Subramanian is a Principal Data Scientist and helps customers by using generative AI and deep learning to solve their business challenges using AWS services. Shreyas has a background in large-scale optimization and ML and in the use of ML and reinforcement learning for accelerating optimization tasks.

Yanyan Zhang is a Senior Generative AI Data Scientist at Amazon Web Services, where she has been working on cutting-edge AI/ML technologies as a Generative AI Specialist, helping customers use generative AI to achieve their desired outcomes. Yanyan graduated from Texas A&M University with a PhD in Electrical Engineering. Outside of work, she loves traveling, working out, and exploring new things.

Dharinee Gupta is an Engineering Manager at AWS Bedrock, where she focuses on enabling customers to seamlessly utilize open source models through serverless solutions. Her team specializes in optimizing these models to deliver the best cost-performance balance for customers. Prior to her current role, she gained extensive experience in authentication and authorization systems at Amazon, developing secure access solutions for Amazon offerings. Dharinee is passionate about making advanced AI technologies accessible and efficient for AWS customers.

Lokeshwaran Ravi is a Senior Deep Learning Compiler Engineer at AWS, specializing in ML optimization, model acceleration, and AI security. He focuses on enhancing efficiency, reducing costs, and building secure ecosystems to democratize AI technologies, making cutting-edge ML accessible and impactful across industries.

June Won is a Principal Product Manager with Amazon SageMaker JumpStart. He focuses on making foundation models easily discoverable and usable to help customers build generative AI applications. His experience at Amazon also includes mobile shopping applications and last mile delivery.

Read More

Build generative AI solutions with Amazon Bedrock

Build generative AI solutions with Amazon Bedrock

Generative AI is revolutionizing how businesses operate, interact with customers, and innovate. If you’re embarking on the journey to build a generative AI-powered solution, you might wonder how to navigate the complexities involved from selecting the right models to managing prompts and enforcing data privacy.

In this post, we show you how to build generative AI applications on Amazon Web Services (AWS) using the capabilities of Amazon Bedrock, highlighting how Amazon Bedrock can be used at each step of your generative AI journey. This guide is valuable for both experienced AI engineers and newcomers to the generative AI space, helping you use Amazon Bedrock to its fullest potential.

Amazon Bedrock is a fully managed service that provides a unified API to access a wide range of high-performing foundation models (FMs) from leading AI companies like Anthropic, Cohere, Meta, Mistral AI, AI21 Labs, Stability AI, and Amazon. It offers a robust set of tools and features designed to help you build generative AI applications efficiently while adhering to best practices in security, privacy, and responsible AI.

Calling an LLM with an API

You want to integrate a generative AI feature into your application through a straightforward, single-turn interaction with a large language model (LLM). Perhaps you need to generate text, answer a question, or provide a summary based on user input. Amazon Bedrock simplifies generative AI application development and scaling through a unified API for accessing diverse, leading FMs. With support for Amazon models and leading AI providers, you have the freedom to experiment without being locked into a single model or provider. With the rapid pace of development in AI, you can seamlessly switch models for optimized performance with no application rewrite required.

Beyond direct model access, Amazon Bedrock expands your options with the Amazon Bedrock Marketplace. This marketplace gives you access to over 100 specialized FMs; you can discover, test, and integrate new capabilities all through fully managed endpoints. Whether you need the latest innovation in text generation, image synthesis, or domain-specific AI, Amazon Bedrock provides the flexibility to adapt and scale your solution with ease.

With one API, you stay agile and can effortlessly switch between models, upgrade to the latest versions, and future-proof your generative AI applications with minimal code changes. To summarize, Amazon Bedrock offers the following benefits:

  • Simplicity: No need to manage infrastructure or deal with multiple APIs
  • Flexibility: Experiment with different models to find the best fit
  • Scalability: Scale your application without worrying about underlying resources

To get started, use the Chat or Text playground to experiment with different FMs, and use the Converse API to integrate FMs into your application.

After you’ve integrated a basic LLM feature, the next step is optimizing the performance and making sure you’re using the right model for your requirements. This brings us to the importance of evaluating and comparing models.

Choosing the right model for your use case

Selecting the right FM for your use case is crucial, but with so many options available, how do you know which one will give you the best performance for your application? Whether it’s for generating more relevant responses, summarizing information, or handling nuanced queries, choosing the best model is key to providing optimal performance.

You can use Amazon Bedrock model evaluation to rigorously test different FMs to find the one that delivers the best results for your use case. Whether you’re in the early stages of development or preparing for launch, selecting the right model can make a significant difference in the effectiveness of your generative AI solutions.

The model evaluation process consists of the following components:

  • Automatic and human evaluation: Begin by experimenting with different models using automated evaluation metrics like accuracy, robustness, or toxicity. You can also bring in human evaluators to measure more subjective factors, such as friendliness, style, or how well the model aligns with your brand voice.
  • Custom datasets and metrics: Evaluate the performance of models using your own datasets or pre-built options. Customize the metrics that matter most for your project, making sure the selected model aligns with your business or operational goals.
  • Iterative feedback: Throughout the development process, run evaluations iteratively, allowing for faster refinement. This helps you compare models side by side, so you can make a data-driven decision when selecting the FM that fits your use case.

Imagine you’re building a customer support AI assistant for an ecommerce service. You can model evaluation to test multiple FMs with real customer queries, evaluating which model provides the most accurate, friendly, and contextually appropriate responses. By comparing models side by side, you can choose the model that will deliver the best possible user experience for your customers. After you’ve evaluated and selected the ideal model, the next step is making sure it aligns with your business needs. Off-the-shelf models might perform well, but for a truly tailored experience, you need more customization. This leads to the next important step in your generative AI journey: personalizing models to reflect your business context. You need to make sure the model generates the most accurate and contextually relevant responses. Even the best FMs will not have access to the latest or domain-specific information critical to your business. To solve this, the model needs to use your proprietary data sources, making sure its outputs reflect the most up-to-date and relevant information. This is where you can use Retrieval Augmented Generation (RAG) to enrich the model’s responses by incorporating your organization’s unique knowledge base.

Enriching model responses with your proprietary data

A publicly available LLM might perform well on general knowledge tasks, but struggle with outdated information or lack context from your organization’s proprietary data. You need a way to provide the model with the most relevant, up-to-date insights to provide accuracy and contextual depth. There are two key approaches that you can use to enrich model responses:

  • RAG: Use RAG to dynamically retrieve relevant information at query time, enriching model responses without requiring retraining
  • Fine-tuning: Use RAG to customize your chosen model by training it on proprietary data, improving its ability to handle organization-specific tasks or domain knowledge

We recommend starting with RAG because of its flexible and straightforward to implement. You can then fine-tune the model for deeper domain adaptation if needed. RAG dynamically retrieves relevant information at query time, making sure model responses stay accurate and context aware. In this approach, data is first processed and indexed in a vector database or similar retrieval system. When a user submits a query, Amazon Bedrock searches this indexed data to find relevant context, which is injected into the prompt. The model then generates a response based on both the original query and the retrieved insights without requiring additional training.

Amazon Bedrock Knowledge Bases automates the RAG pipeline—including data ingestion, retrieval, prompt augmentation, and citations—reducing the complexity of setting up custom integrations. By seamlessly integrating proprietary data, you can make sure that the models generate accurate, contextually rich, and continuously updated responses.

Bedrock Knowledge Bases supports various data types to tailor AI-generated responses to business-specific needs:

  • Unstructured data: Extract insights from text-heavy sources like documents, PDFs, and emails
  • Structured data: Enable natural language queries on databases, data lakes, and warehouses without moving or preprocessing data
  • Multimodal data: Process both text and visual elements in documents and images using Amazon Bedrock Data Automation
  • GraphRAG: Enhance knowledge retrieval with graph-based relationships, enabling AI to understand entity connections for more context-aware responses

With these capabilities, Amazon Bedrock reduces data silos, making it straightforward to enrich AI applications with both real-time and historical knowledge. Whether working with text, images, structured datasets, or interconnected knowledge graphs, Amazon Bedrock provides a fully managed, scalable solution without the need for complex infrastructure. To summarize, using RAG with Amazon Bedrock offers the following benefits:

  • Up-to-date information: Responses include the latest data from your knowledge bases
  • Accuracy: Reduces the risk of incorrect or irrelevant answers
  • No extra infrastructure: You can avoid setting up and managing your own vector databases or custom integrations

When your model is pulling from the most accurate and relevant data, you might find that its general behavior still needs some refinement perhaps in its tone, style, or understanding of industry-specific language. This is where you can further fine-tune the model to align it even more closely with your business needs.

Tailoring models to your business needs

Out-of-the-box FMs provide a strong starting point, but they often lack the precision, brand voice, or industry-specific expertise required for real-world applications. Maybe the language doesn’t align with your brand, or the model struggles with specialized terminology. You might have experimented with prompt engineering and RAG to enhance responses with additional context. Although these techniques help, they have limitations (for example, longer prompts can increase latency and cost), and models might still lack deep domain expertise needed for domain-specific tasks. To fully harness generative AI, businesses need a way to securely adapt models, making sure AI-generated responses are not only accurate but also relevant, reliable, and aligned with business goals.

Amazon Bedrock simplifies model customization, enabling businesses to fine-tune FMs with proprietary data without building models from scratch or managing complex infrastructure.

Rather than retraining an entire model, Amazon Bedrock provides a fully managed fine-tuning process that creates a private copy of the base FM. This makes sure your proprietary data remains confidential and isn’t used to train the original model. Amazon Bedrock offers two powerful techniques to help businesses refine models efficiently:

  • Fine-tuning: You can train an FM with labeled datasets to improve accuracy in industry-specific terminology, brand voice, and company workflows. This allows the model to generate more precise, context-aware responses without relying on complex prompts.
  • Continued pre-training: If you have unlabeled domain-specific data, you can use continued pre-training to further train an FM on specialized industry knowledge without manual labeling. This approach is especially useful for regulatory compliance, domain-specific jargon, or evolving business operations.

By combining fine-tuning for core domain expertise with RAG for real-time knowledge retrieval, businesses can create highly specialized AI models that stay accurate and adaptable, and make sure the style of responses align with business goals. To summarize, Amazon Bedrock offers the following benefits:

  • Privacy-preserved customization: Fine-tune models securely while making sure that your proprietary data remains private
  • Efficiency: Achieve high accuracy and domain relevance without the complexity of building models from scratch

As your project evolves, managing and optimizing prompts becomes critical, especially when dealing with different iterations or testing multiple prompt versions. The next step is refining your prompts to maximize model performance.

Managing and optimizing prompts

As your AI projects scale, managing multiple prompts efficiently becomes a growing challenge. Tracking versions, collaborating with teams, and testing variations can quickly become complex. Without a structured approach, prompt management can slow down innovation, increase costs, and make iteration cumbersome. Optimizing a prompt for one FM doesn’t always translate well to another. A prompt that performs well with one FM might produce inconsistent or suboptimal outputs with another, requiring significant rework. This makes switching between models time-consuming and inefficient, limiting your ability to experiment with different AI capabilities effectively. Without a centralized way to manage, test, and refine prompts, AI development becomes slower, more costly, and less adaptable to evolving business needs.

Amazon Bedrock simplifies prompt engineering with Amazon Bedrock Prompt Management, an integrated system that helps teams create, refine, version, and share prompts effortlessly. Instead of manually adjusting prompts for months, Amazon Bedrock accelerates experimentation and enhances response quality without additional code. Bedrock Prompt Management introduces the following capabilities:

  • Versioning and collaboration: Manage prompt iterations in a shared workspace, so teams can track changes and reuse optimized prompts.
  • Side-by-side testing: Compare up to two prompt variations simultaneously to analyze model behavior and identify the most effective format.
  • Automated prompt optimization: Fine-tune and rewrite prompts based on the selected FM to improve response quality. You can select a model, apply optimization, and generate a more accurate, contextually relevant prompt.

Bedrock Prompt Management offers the following benefits:

  • Efficiency: Quickly iterate and optimize prompts without writing additional code
  • Teamwork: Enhance collaboration with shared access and version control
  • Insightful testing: Identify which prompts perform best for your use case

After you’ve optimized your prompts for the best results, the next challenge is optimizing your application for cost and latency by choosing the most appropriate model within a family for a given task. This is where intelligent prompt routing can help.

Optimizing efficiency with intelligent model selection

Not all prompts require the same level of AI processing. Some are straightforward and need fast responses, whereas others require deeper reasoning and more computational power. Using high-performance models for every request increases costs and latency, even when a lighter, faster model could generate an equally effective response. At the same time, relying solely on smaller models might reduce accuracy for complex queries. Without an automated approach, business must manually determine which model to use for each request, leading to higher costs, inefficiencies, and slower development cycles.

Amazon Bedrock Intelligent Prompt Routing optimizes AI performance and cost by dynamically selecting the most appropriate FM for each request. Instead of manually choosing a model, Amazon Bedrock automates model selection within a model family, making sure that each prompt is routed to the best-performing model for its complexity. Bedrock Intelligent Prompt Routing offers the following capabilities:

  • Adaptive model routing: Automatically directs simple prompts to lightweight models and complex queries to more advanced models, providing the right balance between speed and efficiency
  • Performance balance: Makes sure that you use high-performance models only when necessary, reducing AI inference costs by up to 30%
  • Effortless integration: Automatically selects the right model within a family, simplifying deployment

By automating model selection, Amazon Bedrock removes the need for manual decision-making, reduces operational overhead, and makes sure AI applications run efficiently at scale. With Amazon Bedrock Intelligent Prompt Routing, each query is processed by the most efficient model, delivering speed, cost savings, and high-quality responses. The next step in optimizing AI efficiency is reducing redundant computations in frequently used prompts. Many AI applications require maintaining context across multiple interactions, which can lead to performance bottlenecks, increased costs, and unnecessary processing overhead.

Reducing redundant processing for faster responses

As your generative AI applications scale, efficiency becomes just as critical as accuracy. Applications that repeatedly use the same context—such as document Q&A systems (where users ask multiple questions about the same document) or coding assistants that maintain context about code files—often face performance bottlenecks and rising costs because of redundant processing. Each time a query includes long, static context, models reprocess unchanged information, leading to increased latency as models repeatedly analyze the same content and unnecessary token usage inflates compute expenses. To keep AI applications fast, cost-effective, and scalable, optimizing how prompts are reused and processed is essential.

Amazon Bedrock Prompt Caching enhances efficiency by storing frequently used portions of prompts—reducing redundant computations and improving response times. It offers the following benefits:

  • Faster processing: Skips unnecessary recomputation of cached prompt prefixes, boosting overall throughput
  • Lower latency: Reduces processing time for long, repetitive prompts, delivering a smoother user experience, and reducing latency by up to 85% for supported models
  • Cost-efficiency: Minimizes compute resource usage by avoiding repeated token processing, reducing costs by up to 90%

With prompt caching, AI applications respond faster, reduce operational costs, and scale efficiently while maintaining high performance. With Bedrock Prompt Caching providing faster responses and cost-efficiency, the next step is enabling AI applications to move beyond static prompt-response interactions. This is where agentic AI comes in, empowering applications to dynamically orchestrate multistep processes, automate decision-making, and drive intelligent workflows.

Automating multistep tasks with agentic AI

As AI applications grow more sophisticated, automating complex, multistep tasks become essential. You need a solution that can interact with internal systems, APIs, and databases to execute intricate workflows autonomously. The goal is to reduce manual intervention, improve efficiency, and create more dynamic, intelligent applications. Traditional AI models are reactive; they generate responses based on inputs but lack the ability to plan and execute multistep tasks. Agentic AI refers to AI systems that act with autonomy, breaking down complex tasks into logical steps, making decisions, and executing actions without constant human input. Unlike traditional models that only respond to prompts, agentic AI models have the following capabilities:

  • Autonomous planning and execution: Breaks complex tasks into smaller steps, makes decisions, and plans actions to complete the workflow
  • Chaining capabilities: Handles sequences of actions based on a single request, enabling the AI to manage intricate tasks that would otherwise require manual intervention or multiple interactions
  • Interaction with APIs and systems: Connects to your enterprise systems and automatically invokes necessary APIs or databases to fetch or update data

Amazon Bedrock Agents enables AI-powered task automation by using FMs to plan, orchestrate, and execute workflows. With a fully managed orchestration layer, Amazon Bedrock simplifies the process of deploying, scaling, and managing AI agents. Bedrock Agents offers the following benefits:

  • Task orchestration: Uses FMs’ reasoning capabilities to break down tasks, plan execution, and manage dependencies
  • API integration: Automatically calls APIs within enterprise systems to interact with business applications
  • Memory retention: Maintains context across interactions, allowing agents to remember previous steps, providing a seamless user experience

When a task requires multiple specialized agents, Amazon Bedrock supports multi-agent collaboration, making sure agents work together efficiently while alleviating manual orchestration overhead. This unlocks the following capabilities:

  • Supervisor-agent coordination: A supervisor agent delegates tasks to specialized subagents, providing optimal distribution of workloads
  • Efficient task execution: Supports parallel task execution, enabling faster processing and improved accuracy
  • Flexible collaboration modes: You can choose between the following modes:
    • Fully orchestrated supervisor mode: A central agent manages the full workflow, providing seamless coordination
    • Routing mode: Basic tasks bypass the supervisor and go directly to subagents, reducing unnecessary orchestration
  • Seamless integration: Works with enterprise APIs and internal knowledge bases, making it straightforward to automate business operations across multiple domains

By using multi-agent collaboration, you can increase task success rates, reduce execution time, and improve accuracy, making AI-driven automation more effective for real-world, complex workflows. To summarize, agentic AI offers the following benefits:

  • Automation: Reduces manual intervention in complex processes
  • Flexibility: Agents can adapt to changing requirements or gather additional information as needed
  • Transparency: You can use the trace capability to debug and optimize agent behavior

Although automating tasks with agents can streamline operations, handling sensitive information and enforcing privacy is paramount, especially when interacting with user data and internal systems. As your application grows more sophisticated, so do the security and compliance challenges.

Maintaining security, privacy, and responsible AI practices

As you integrate generative AI into your business, security, privacy, and compliance become critical concerns. AI-generated responses must be safe, reliable, and aligned with your organization’s policies to help violating brand guidelines or regulatory policies, and must not include inaccurate or misleading responses.

Amazon Bedrock Guardrails provides a comprehensive framework to enhance security, privacy, and accuracy in AI-generated outputs. With built-in safeguards, you can enforce policies, filter content, and improve trustworthiness in AI interactions. Bedrock Guardrails offers the following capabilities:

  • Content filtering: Block undesirable topics and harmful content in user inputs and model responses.
  • Privacy protection: Detect and redact sensitive information like personally identifiable information (PII) and confidential data to help prevent data leaks.
  • Custom policies: Define organization-specific rules to make sure AI-generated content aligns with internal policies and brand guidelines.
  • Hallucination detection: Identify and filter out responses not grounded in your data sources through the following capabilities:
    • Contextual grounding checks: Make sure model responses are factually correct and relevant by validating them against enterprise data source. Detect hallucinations when outputs contain unverified or irrelevant information.
    • Automated reasoning for accuracy: Moves beyond trust me to prove it AI outputs by applying mathematically sound logic and structured reasoning to verify factual correctness.

With security and privacy measures in place, your AI solution is not only powerful but also responsible. However, if you’ve already made significant investments in custom models, the next step is to integrate them seamlessly into Amazon Bedrock.

Using existing custom models with Amazon Bedrock Custom Model Import

Use Amazon Bedrock Custom Model Import if you’ve already invested in custom models developed outside of Amazon Bedrock and want to integrate them into your new generative AI solution without managing additional infrastructure.

Bedrock Custom Model Import includes the following capabilities:

  • Seamless integration: Import your custom models into Amazon Bedrock
  • Unified API access: Interact with models—both base and custom—through the same API
  • Operational efficiency: Let Amazon Bedrock handle the model lifecycle and infrastructure management

Bedrock Custom Model Import offers the following benefits:

  • Cost savings: Maximize the value of your existing models
  • Simplified management: Reduce overhead by consolidating model operations
  • Consistency: Maintain a unified development experience across models

By importing custom models, you can use your prior investments. To truly unlock the potential of your models and prompt structures, you can automate more complex workflows, combining multiple prompts and integrating with other AWS services.

Automating workflows with Amazon Bedrock Flows

You need to build complex workflows that involve multiple prompts and integrate with other AWS services or business logic, but you want to avoid extensive coding.

Amazon Bedrock Flows has the following capabilities:

  • Visual builder: Drag-and-drop components to create workflows
  • Workflow automation: Link prompts with AWS services and automate sequences
  • Testing and versioning: Test flows directly in the console and manage versions

Amazon Bedrock Flows offers the following benefits:

  • No-code solution: Build workflows without writing code
  • Speed: Accelerate development and deployment of complex applications
  • Collaboration: Share and manage workflows within your team

With workflows now automated and optimized, you’re nearly ready to deploy your generative AI-powered solution. The final stage is making sure that your generative AI solution can scale efficiently and maintain high performance as demand grows.

Monitoring and logging to close the loop on AI operations

As you prepare to move your generative AI application into production, it’s critical to implement robust logging and observability to monitor system health, verify compliance, and quickly troubleshoot issues. Amazon Bedrock offers built-in observability capabilities that integrate seamlessly with AWS monitoring tools, enabling teams to track performance, understand usage patterns, and maintain operational control

  • Model invocation logging: You can enable detailed logging of model invocations, capturing input prompts and output responses. These logs can be streamed to Amazon CloudWatch or Amazon Simple Storage Service (Amazon S3) for real-time monitoring or long-term analysis. Logging is configurable through the AWS Management Console or the CloudWatchConfig API.
  • CloudWatch metrics: Amazon Bedrock provides rich operational metrics out-of-the-box, including:
    • Invocation count
    • Token usage (input/output)
    • Response latency
    • Error rates (for example, invalid input and model failures)

These capabilities are essential for running generative AI solutions at scale with confidence. By using CloudWatch, you gain visibility across the full AI pipeline from input prompts to model behavior; making it straightforward to maintain uptime, performance, and compliance as your application grows.

Finalizing and scaling your generative AI solution

You’re ready to deploy your generative AI application and need to scale it efficiently while providing reliable performance. Whether you’re handling unpredictable workloads, enhancing resilience, or needing consistent throughput, you must choose the right scaling approach. Amazon Bedrock offers three flexible scaling options that you can use to tailor your infrastructure to your workload needs:

  • On-demand: Start with the flexibility of on-demand scaling, where you pay only for what you use. This option is ideal for early-stage deployments or applications with variable or unpredictable traffic. It offers the following benefits:
    • No commitments.
    • Pay only for tokens processed (input/output).
    • Great for dynamic or fluctuating workloads.
  • Cross-Region inference: When your traffic grows or becomes unpredictable, you can use cross-Region inference to handle bursts by distributing compute across multiple AWS Regions, enhancing availability without additional cost. It offers the following benefits:
    • Up to two times larger burst capacity.
    • Improved resilience and availability.
    • No additional charges, you have the same pricing as your primary Region.
  • Provisioned Throughput: For large, consistent workloads, Provisioned Throughput maintains a fixed level of performance. This option is perfect when you need predictable throughput, particularly for custom models. It offers the following benefits:
    • Consistent performance for high-demand applications.
    • Required for custom models.
    • Flexible commitment terms (1 month or 6 months).

Conclusion

Building generative AI solutions is a multifaceted process that requires careful consideration at every stage. Amazon Bedrock simplifies this journey by providing a unified service that supports each phase, from model selection and customization to deployment and compliance. Amazon Bedrock offers a comprehensive suite of features that you can use to streamline and enhance your generative AI development process. By using its unified tools and APIs, you can significantly reduce complexity, enabling accelerated development and smoother workflows. Collaboration becomes more efficient because team members can work seamlessly across different stages, fostering a more cohesive and productive environment. Additionally, Amazon Bedrock integrates robust security and privacy measures, helping to ensure that your solutions meet industry and organization requirements. Finally, you can use its scalable infrastructure to bring your generative AI solutions to production faster while minimizing overhead. Amazon Bedrock stands out as a one-stop solution that you can use to build sophisticated, secure, and scalable generative AI applications. Its extensive capabilities alleviate the need for multiple vendors and tools, streamlining your workflow and enhancing productivity.

Explore Amazon Bedrock and discover how you can use its features to support your needs at every stage of generative AI development. To learn more, see the Amazon Bedrock User Guide.


About the authors

Venkata Santosh Sajjan Alla is a Senior Solutions Architect at AWS Financial Services, driving AI-led transformation across North America’s FinTech sector. He partners with organizations to design and execute cloud and AI strategies that speed up innovation and deliver measurable business impact. His work has consistently translated into millions in value through enhanced efficiency and additional revenue streams. With deep expertise in AI/ML, Generative AI, and cloud-native architectures, Sajjan enables financial institutions to achieve scalable, data-driven outcomes. When not architecting the future of finance, he enjoys traveling and spending time with family. Connect with him on LinkedIn.

Axel Larsson is a Principal Solutions Architect at AWS based in the greater New York City area. He supports FinTech customers and is passionate about helping them transform their business through cloud and AI technology. Outside of work, he is an avid tinkerer and enjoys experimenting with home automation.

Read More

How Netsertive built a scalable AI assistant to extract meaningful insights from real-time data using Amazon Bedrock and Amazon Nova

How Netsertive built a scalable AI assistant to extract meaningful insights from real-time data using Amazon Bedrock and Amazon Nova

This post was co-written with Herb Brittner from Netsertive.

Netsertive is a leading digital marketing solutions provider for multi-location brands and franchises, helping businesses maximize local advertising, improve engagement, and gain deep customer insights.

With a growing demand in providing more actionable insights from their customer call tracking data, Netsertive needed a solution that could unlock business intelligence from every call, making it easier for franchises to improve customer service and boost conversion rates. The team was looking for a single, flexible system that could do several things:

  • Understand phone calls – Automatically create summaries of what was discussed
  • Gauge customer feelings – Determine if the caller was happy, upset, or neutral
  • Identify important topics – Pull out keywords related to frequent services, questions, problems, and mentions of competitors
  • Improve agent performance – Offer advice and suggestions for coaching
  • Track performance over time – Generate reports on trends for individual locations, regions, and the entire country

Crucially, this new system needed to work smoothly with their existing Multi-Location Experience (MLX) platform. The MLX platform is specifically designed for businesses with many locations and helps them manage both national and local marketing. It allows them to run campaigns across various online channels, including search engines, social media, display ads, videos, connected TVs, and online reviews, as well as manage SEO, business listings, reviews, social media posting, and individual location web pages.

In this post, we show how Netsertive introduced a generative AI-powered assistant into MLX, using Amazon Bedrock and Amazon Nova, to bring their next generation of the platform to life.

Solution overview

Operating a comprehensive digital marketing solution, Netsertive handles campaign execution while providing key success metrics through their Insights Manager product. The platform features location-specific content management capabilities and robust lead capture functionality, collecting data from multiple sources, including paid campaigns, organic website traffic, and attribution pro forms. With CRM integration and call tracking features, MLX creates a seamless flow of customer data and marketing insights. This combination of managed services, automated tools, and analytics makes MLX a single source of truth for businesses seeking to optimize their digital marketing efforts while taking advantage of Netsertive’s expertise in campaign management. To address their desire to provide more actionable insights on the platform from customer call tracking data, Netsertive considered various solutions. After evaluating different tools and models, they decided to use Amazon Bedrock and the Amazon Nova Micro model. This choice was driven by the API-driven approach of Amazon Bedrock, its wide selection of large language models (LLMs), and the performance of the Amazon Nova Micro model specifically. They selected Amazon Nova Micro based on its ability to deliver fast response times at a low cost, while providing consistent and intelligent insights—key factors for Netsertive. With its generation speed of over 200 tokens per second and highly performant language understanding skills, this text-only model proved ideal for Netsertive. The following diagram shows how their MLX platform receives real-time phone calls and uses Amazon Nova Micro in Amazon Bedrock for processing real-time phone calls.

AWS architecture for Netsertive showcasing EKS, Aurora, Bedrock integration with insights management and call reporting workflow

The real-time call processing flow consists of the following steps:

  1. When a call comes in, it’s immediately routed to the Lead API. This process captures both the live call transcript and important metadata about the caller. This system continuously processes new calls as they arrive, facilitating real-time handling of incoming communications.
  2. The captured transcript is forwarded to Amazon Bedrock for analysis. The system currently uses a standardized base prompt for all customers, and the architecture is designed to allow for customer-specific prompt customization as an added layer of context.
  3. Amazon Nova Micro processes the transcript and returns a structured JSON response. This response includes multiple analysis components: sentiment analysis of the conversation, a concise call summary, identified key terms, overall call theme classification, and specific coaching suggestions for improvement.
  4. All analysis results are systematically stored in an Amazon Aurora database with their associated key metrics. This makes sure the processed data is properly indexed and readily available for both immediate access and future analysis.

The aggregate report schedule flow consists of the following steps:

  1. The aggregate analysis process automatically initiates on both weekly and monthly schedules. During each run, the system gathers call data that falls within the specified time period.
  2. This aggregate analysis uses both Amazon Bedrock and Amazon Nova Micro, applying a specialized prompt designed specifically for trend analysis. This prompt differs from the real-time analysis to focus on identifying patterns and insights across multiple calls.

The processed aggregate data from both workflows is transformed into comprehensive reports displaying trend analysis and comparative metrics through the UI. This provides stakeholders with valuable insights into performance patterns and trends over time while allowing the user to dive deeper into specific metrics.

Results

The implementation of generative AI to create a real-time call data analysis solution has been a transformative journey for Netsertive. Their new Call Insights AI feature, using Amazon Nova Micro on Amazon Bedrock, only takes minutes to create actionable insights, compared to their previous manual call review processes, which took hours or even days for customers with high call volumes. Netsertive chose Amazon Bedrock and Amazon Nova Micro for their solution after a swift evaluation period of approximately 1 week of testing different tools and models. Their development approach was methodical and customer-focused. The Call Insights AI feature was added to their platform’s roadmap based on direct customer feedback and internal marketing expertise. The entire development process, from creating and testing their Amazon Nova Micro prompts to integrating Amazon Bedrock with their MLX platform, was completed within approximately 30 days before launching in beta. The transformation of real-time call data analysis isn’t just about processing more calls—it’s about creating a more comprehensive understanding of customer interactions. By implementing Amazon Bedrock and Amazon Nova Micro, Netsertive is able to better understand call purposes and value, enhance measurement capabilities, and progress towards more automated and efficient analysis systems. This evolution can not only streamline operations but also provide customers with more actionable insights about their digital marketing performance.

Conclusion

In this post, we shared how Netsertive introduced a generative AI-powered assistant into MLX, using Amazon Bedrock and Amazon Nova. This solution helped scale their MLX platform to provide their customers with instant, actionable insights, creating a more engaging and informative user experience. By using the advanced natural language processing capabilities of Amazon Bedrock and the high-performance, low-latency Amazon Nova Micro model, Netsertive was able to build a comprehensive call intelligence system that goes beyond just transcription and sentiment analysis.

The success of this project has demonstrated the transformative potential of generative AI in driving business intelligence and operational efficiency. To learn more about building powerful, generative AI assistants and applications using Amazon Bedrock and Amazon Nova, see Generative AI on AWS.


About the authors

Nicholas Switzer is an AI/ML Specialist Solutions Architect at Amazon Web Services. He joined AWS in 2022 and specializes in AI/ML, generative AI, IoT, and edge AI. He is based in the US and enjoys building intelligent products that improve everyday life.

Jane Ridge is Senior Solutions Architect at Amazon Web Services with over 20 years of technology experience. She joined AWS in 2020 and is based in the US. She is passionate around enabling growth of her customers through innovative solutions combined with her deep technical expertise in the AWS ecosystem. She is known for her ability to guide customers through all stages of their cloud journey and deliver impactful solutions.

Herb Brittner is the Vice President of Product & Engineering at Netsertive, where he leads the development of AI-driven digital marketing solutions for multi-location brands and franchises. With a strong background in product innovation and scalable engineering, he specializes in using machine learning and cloud technologies to drive business insights and customer engagement. Herb is passionate about building data-driven platforms that enhance marketing performance and operational efficiency.

Read More

Make videos accessible with automated audio descriptions using Amazon Nova

Make videos accessible with automated audio descriptions using Amazon Nova

According to the World Health Organization, more than 2.2 billion people globally have vision impairment. For compliance with disability legislation, such as the Americans with Disabilities Act (ADA) in the United States, media in visual formats like television shows or movies are required to provide accessibility to visually impaired people. This often comes in the form of audio description tracks that narrate the visual elements of the film or show. According to the International Documentary Association, creating audio descriptions can cost $25 per minute (or more) when using third parties. For building audio descriptions internally, the effort for businesses in the media industry can be significant, requiring content creators, audio description writers, description narrators, audio engineers, delivery vendors and more according to the American Council of the Blind (ACB). This leads to the natural question, can you automate this process with the help of generative AI offerings in Amazon Web Services (AWS)?

Newly announced in December at re:Invent 2024, the Amazon Nova Foundation Models family is available through Amazon Bedrock and includes three multimodal foundational models (FMs):

  • Amazon Nova Lite (GA) – A low-cost multimodal model that’s lightning-fast for processing image, video, and text inputs
  • Amazon Nova Pro (GA) – A highly capable multimodal model with a balanced combination of accuracy, speed, and cost for a wide range of tasks
  • Amazon Nova Premier (GA) – Our most capable model for complex tasks and a teacher for model distillation

In this post, we demonstrate how you can use services like Amazon Nova, Amazon Rekognition, and Amazon Polly to automate the creation of accessible audio descriptions for video content. This approach can significantly reduce the time and cost required to make videos accessible for visually impaired audiences. However, this post doesn’t provide a complete, deployment-ready solution. We share pseudocode snippets and guidance in sequential order, in addition to detailed explanations and links to resources. For a complete script, you can use additional resources, such as Amazon Q Developer, to build a fully functional system. The automated workflow described in the post involves analyzing video content, generating text descriptions, and narrating them using AI voice generation. In summary, while powerful, this requires careful integration and testing to deploy effectively. By the end of this post, you’ll understand the key steps, but some additional work is needed to create a production-ready solution for your specific use case.

Solution overview

The following architecture diagram demonstrates the end-to-end workflow of the proposed solution. We will describe each component in-depth in the later sections of this post, but note that you can define the logic within a single script. You can then run your script on an Amazon Elastic Compute Cloude (Amazon EC2) instance or on your local computer. For this post, we assume that you will run the script on an Amazon SageMaker notebook.

End-to-end AWS workflow demonstrating video content analysis using AI services to generate text descriptions and audio narration

Services used

The services shown in the architecture diagram include:

  1. Amazon S3Amazon Simple Storage Service (Amazon S3) is an object storage service that provides scalable, durable, and highly available storage. In this example, we use Amazon S3 to store the video files (input) and scene description (text files) and audio description (MP3 files) output generated by the solution. The script starts by fetching the source video from an S3 bucket.
  2. Amazon Rekognition – Amazon Rekognition is a computer vision service that can detect and extract video segments or scenes by identifying technical cues such as shot boundaries, black frames, and other visual elements. To yield higher accuracy for the generated video descriptions, you use Amazon Rekognition to segment the source video into smaller chunks before passing it to Amazon Nova. These video segments can be stored in a temporary directory on your compute machine.
  3. Amazon Bedrock – Amazon Bedrock is a managed service that provides access to large, pre-trained AI models such as the Amazon Nova Pro model, which is used in this solution to analyze the content of each video segment and generate detailed scene descriptions. You can store these text descriptions in a text file (for example, video_analysis.txt).
  4. Amazon Polly – Amazon Polly is a text-to-speech service that is used to convert the text descriptions generated by the Amazon Nova Pro model into high-quality audio, made available using an MP3 file.

Prerequisites

To follow along with the solution outlined in this post, you should have the following in place:

You can use AWS SDK to create, configure, and manage AWS services. For Boto3, you can include it at the top of your script using: import boto3

Additionally, you need a mechanism to split videos. If you’re using Python, we recommend the moviepy library.
import moviepy # pip install moviepy

Solution walkthrough

The solution includes the following basic steps, which you can use as a basic structure and customize or expand to fit your use case.

  1. Define the requirements for the AWS environment, including defining the use of the Amazon Nova Pro model for its visual support and the AWS Region you’re working in. For optimal throughput, we recommend using inference profiles when configuring Amazon Bedrock to invoke the Amazon Nova Pro model. Initialize a client for Amazon Rekognition, which you use for its support of segmentation.
CLASS VideoAnalyzer:
	FUNCTION initialize():
 		Set AWS_REGION to "us-east-1"
 		Set MODEL_ID to "amazon.nova-pro-v1:0"
 		Set chunk_delay to 20 Initialize AWS clients (Bedrock and Rekognition)
  1. Define a function for detecting segments in the video. Amazon Rekognition supports segmentation, which means users have the option to detect and extract different segments or scenes within a video. By using the Amazon Rekognition Segment API, you can perform the following:
    1. Detect technical cues such as black frames, color bars, opening and end credits, and studio logos in a video.
    2. Detect shot boundaries to identify the start, end, and duration of individual shots within the video.

The solution uses Amazon Rekognition to partition the video into multiple segments and perform Amazon Nova Pro-based inference on each segment. Finally, you can piece together each segment’s inference output to return a comprehensive audio description for the entire video.

FUNCTION get_segment_results(job_id):
 	TRY:
 	   Initialize empty segments list 
 	   WHILE more results exist:
 	         Get segment detection results 
                Add segments to list 
                IF no more results THEN break
          RETURN segments 
       CATCH any errors and return null 

FUNCTION extract_scene(video_path, start_time, end_time):
       TRY: 
           Load video file 
           Validate time range
           Create temporary directory 
           Extract video segment 
           Save segment to file 
           RETURN path to saved segment 
       CATCH any errors and return null

Three coffee cups on checkered tablecloth and close-up of coffee grounds in cup

In the preceding image, there are two scenes: a screenshot of one scene on the left followed by the scene that immediately follows it on the right. With the Amazon Rekognition segmentation API, you can identify that the scene has changed—that the content that is displayed on screen is different—and therefore you need to generate a new scene description.

  1. Create the segmentation job and:
    • Upload the video file for which you want to create an audio description to Amazon S3.
    • Start the job using that video.

Setting SegmentType=[‘SHOT’] identifies the start, end, and duration of a scene. Additionally, MinSegmentConfidence sets the minimum confidence Amazon Rekognition must have to return a detected segment, with 0 being lowest confidence and 100 being highest.

  1. Use the analyze_chunk function. This function defines the main logic of the audio description solution. Some items to note about analyze_chunk:
    • For this example, we sent a video scene to Amazon Nova Pro for an analysis of the contents using the prompt Describe what is happening in this video in detail. This prompt is relatively straightforward and experimentation or customization for your use case is encouraged. Amazon Nova Pro then returned the text description for our video scene.
    • For longer videos with many scenes, you might encounter throttling. This is resolved by implementing a retry mechanism. For details on throttling and quotas for Amazon Bedrock, see Quotas for Amazon Bedrock.
FUNCTION analyze_chunk(chunk_path): 
     TRY: 
        Convert video chunk to base64 
        Create request body for Bedrock 
        Set max_retries and backoff_time 

        WHILE retry_count < max_retries:
          TRY:
             Send InvokeModel request to Bedrock
             RETURN analysis results 
          CATCH throttling: 
              Wait and retry with exponential backoff 
          CATCH other errors: 
              Return null 
     CATCH any errors:
         Return null

In effect, the raw scenes are converted into rich, descriptive text. Using this text, you can generate a complete scene-by-scene walkthrough of the video and send it to Amazon Polly for audio.

  1. Use the following code to orchestrate the process:
    1. Initiate the detection of the various segments by using Amazon Rekognition.
    2. Each segment is processed through a flow of:
      1. Extraction.
      2. Analysis using Amazon Nova Pro.
      3. Compiling the analysis into a video_analysis.txt file.
  2. The analyze_video function brings together all the components and produces a text file that contains the complete, scene-by-scene analysis of the video contents, with timestamps
FUNCTION analyze_video(video_path, bucket): 
     TRY: 
         Start segment detection 
         Wait for job completion 
         Get segments 
         FOR each segment: 
             Extract scene 
             Analyze chunk 
             Save analysis results 
         Write results to file 
      CATCH any errors

If you refer back to the previous screenshot, the output—without any additional refinement—will look similar to the following image.

Three coffee cups on checkered tablecloth and close-up of coffee grounds in cup

“Segment 103.136-126.026 seconds:
[{'text': 'The video shows a close-up of a coffee cup with steam rising from it, followed by three cups of coffee on a table with milk and sugar jars. A person then picks up a bunch of coffee beans from a plant.'}]
Segment 126.059-133.566 seconds:
[{'text': "The video starts with a person's hand, covered in dirt and holding a branch with green leaves and berries. The person then picks up some berries. The video then shows a man standing in a field with trees and plants. He is holding a bunch of red fruits in his right hand and looking at them. He is wearing a shirt and has a mustache. He seems to be picking the fruits. The fruits are probably coffee beans. The area is surrounded by green plants and trees."}]” 

The following screenshot is an example is a more extensive look at the video_analysis.txt for the coffee.mp4 video:

Detailed video analysis text file displaying 12 chronological segments with timestamps, describing a day's journey from waking up to coffee cultivation and brewing.

  1. Send the contents of the text file to Amazon Polly. Amazon Polly adds a voice to the text file, completing the workflow of the audio description solution.
FUNCTION generate_audio(text_file, output_audio_file):
     TRY:
        Read analysis text
        Set max_retries and backoff_time

        WHILE retry_count < max_retries:
           TRY:
              Initialize Polly client
              Convert text to speech
              Save audio file
              RETURN success
           CATCH throttling:
              Wait with exponential backoff
              retry_count += 1
           CATCH other errors:
              retry_count += 1
              Continue or Break based on error type
     CATCH any errors:
         RETURN error

For a list of different voices that you can use in Amazon Polly, see Available voices in the Amazon Polly Developer Guide.

Your final output with Polly should sound something like this:

Clean up

It’s a best practice to delete the resources you provisioned for this solution. If you used an EC2 or SageMaker Notebook Instance, stop or terminate it. Remember to delete unused files from your S3 bucket (eg: video_analysis.txt and video_analysis.mp3).

Conclusion

Recapping the solution at a high level, in this post, you used:

  • Amazon S3 to store the original video, intermediate data, and the final audio description artifacts
  • Amazon Rekognition to partition the video file into time-stamped scenes
  • Computer vision capabilities from Amazon Nova Pro (available through Amazon Bedrock) to analyze the contents of each scene

We showed you how to use Amazon Polly to create an MP3 audio file from the final scene description text file, which is what will be consumed by the audience members. The solution outlined in this post demonstrates how to fully automate the process of creating audio descriptions for video content to improve accessibility. By using Amazon Rekognition for video segmentation, the Amazon Nova Pro model for scene analysis, and Amazon Polly for text-to-speech, you can generate a comprehensive audio description track that narrates the key visual elements of a video. This end-to-end automation can significantly reduce the time and cost required to make video content accessible for visually impaired audiences, helping businesses and organizations meet their accessibility goals. With the power of AWS AI services, this solution provides a scalable and efficient way to improve accessibility and inclusion for video-based media.

This solution isn’t limited to using it for TV shows and movies. Any visual media that requires accessibility can be a candidate! For more information about the new Amazon Nova model family and the amazing things these models can do, see Introducing Amazon Nova foundation models: Frontier intelligence and industry leading price performance.

In addition to the steps described in this post, additional actions you might need to take include:

  • Removing a video segment analysis’s introductory text from Amazon Nova. When Amazon Nova returns a response, it might begin with something like “In this video…” or something similar. You probably want just the video description itself without this introductory text. If there is introductory text in your scene descriptions, then Amazon Polly will speak it aloud and impact the quality of your audio transcriptions. You can account for this in a few ways.
    • For example, prior to sending it to Amazon Polly, you can modify the generated scene descriptions by programmatically removing that type of text from them.
    • Alternatively, you can use prompt engineering to request that Amazon Bedrock return only the scene descriptions in a structured format or without any additional commentary.
    • The third option is to define and use a tool when performing inference on Amazon Bedrock. This can be a more comprehensive technique of defining the format of the output that you want Amazon Bedrock to return. Using tools to shape model output, is known as function calling. For more information, see Use a tool to complete an Amazon Bedrock model response.
  • You should also be mindful of the architectural components of the solution. In a production environment, being mindful of any potential scaling, security, and storage elements is important because the architecture might begin to resemble something more complex than the basic solution architecture diagram that this post began with.

About the Authors

Dylan Martin is an AWS Solutions Architect, working primarily in the generative AI space helping AWS Technical Field teams build AI/ML workloads on AWS. He brings his experience as both a security solutions architect and software engineer. Outside of work he enjoys motorcycling, the French Riviera and studying languages.

Ankit Patel is an AWS Solutions Developer, part of the Prototyping And Customer Engineering (PACE) team. Ankit helps customers bring their innovative ideas to life by rapid prototyping; using the AWS platform to build, orchestrate, and manage custom applications.

Read More

Training Llama 3.3 Swallow: A Japanese sovereign LLM on Amazon SageMaker HyperPod

Training Llama 3.3 Swallow: A Japanese sovereign LLM on Amazon SageMaker HyperPod

This post is based on a technical report written by Kazuki Fujii, who led the Llama 3.3 Swallow model development.

The Institute of Science Tokyo has successfully trained Llama 3.3 Swallow, a 70-billion-parameter large language model (LLM) with enhanced Japanese capabilities, using Amazon SageMaker HyperPod. The model demonstrates superior performance in Japanese language tasks, outperforming GPT-4o-mini and other leading models. This technical report details the training infrastructure, optimizations, and best practices developed during the project.

This post is organized as follows:

  • Overview of Llama 3.3 Swallow
  • Architecture for Llama 3.3 Swallow training
  • Software stack and optimizations employed in Llama 3.3 Swallow training
  • Experiment management

We discuss topics relevant to machine learning (ML) researchers and engineers with experience in distributed LLM training and familiarity with cloud infrastructure and AWS services. We welcome readers who understand model parallelism and optimization techniques, especially those interested in continuous pre-training and supervised fine-tuning approaches.

Overview of the Llama 3.3 Swallow

Llama 3.3 Swallow is a 70-billion-parameter LLM that builds upon Meta’s Llama 3.3 architecture with specialized enhancements for Japanese language processing. The model was developed through a collaboration between the Okazaki Laboratory and Yokota Laboratory at the School of Computing, Institute of Science Tokyo, and the National Institute of Advanced Industrial Science and Technology (AIST).

The model is available in two variants on Hugging Face:

Both variants are accessible through the tokyotech-llm organization on Hugging Face, providing researchers and developers with flexible options for different application needs.

Training methodology

The base model was developed through continual pre-training from Meta Llama 3.3 70B Instruct, maintaining the original vocabulary without expansion. The training data primarily consisted of the Swallow Corpus Version 2, a carefully curated Japanese web corpus derived from Common Crawl. To secure high-quality training data, the team employed the Swallow Education Classifier to extract educationally valuable content from the corpus. The following table summarizes the training data used for the base model training with approximately 314 billion tokens. For compute, the team used 32 ml.p5.48xlarge Amazon Elastic Compute Cloud (Amazon EC2) instances (H100, 80 GB, 256 GPUs) for continual pre-training with 16 days and 6 hours.

Training Data Number of Training Tokens
Japanese Swallow Corpus v2 210 billion
Japanese Wikipedia 5.3 billion
English Wikipedia 6.9 billion
English Cosmopedia 19.5 billion
English DCLM baseline 12.8 billion
Laboro ParaCorpus 1.4 billion
Code Swallow-Code 50.2 billion
Math Finemath-4+ 7.85 billion

For the instruction-tuned variant, the team focused exclusively on Japanese dialogue and code generation tasks. This version was created through supervised fine-tuning of the base model, using the same Japanese dialogue data that proved successful in the previous Llama 3.1 Swallow v0.3 release. Notably, the team made a deliberate choice to exclude English dialogue data from the fine-tuning process to maintain focus on Japanese language capabilities. The following table summarizes the instruction-tuning data used for the instruction-tuned model.

Training Data Number of Training Samples
Gemma-2-LMSYS-Chat-1M-Synth 240,000
Swallow-Magpie-Ultra-v0.1 42,000
Swallow-Gemma-Magpie-v0.1 99,000
Swallow-Code-v0.3-Instruct-style 380,000

Performance and benchmarks

The base model has demonstrated remarkable performance in Japanese language tasks, consistently outperforming several industry-leading models. In comprehensive evaluations, it has shown superior capabilities compared to OpenAI’s GPT-4o (gpt-4o-2024-08-06), GPT-4o-mini (gpt-4o-mini-2024-07-18), GPT-3.5 (gpt-3.5-turbo-0125), and Qwen2.5-72B. These benchmarks reflect the model’s enhanced ability to understand and generate Japanese text. The following graph illustrates the base model performance comparison across these different benchmarks (original image).

The instruction-tuned model has shown particularly strong performance on the Japanese MT-Bench, as evaluated by GPT-4o-2024-08-06, demonstrating its effectiveness in practical applications. The following graph presents the performance metrics (original image).

Licensing and usage

The model weights are publicly available on Hugging Face and can be used for both research and commercial purposes. Users must comply with both the Meta Llama 3.3 license and the Gemma Terms of Use. This open availability aims to foster innovation and advancement in Japanese language AI applications while enforcing responsible usage through appropriate licensing requirements.

Training infrastructure architecture

The training infrastructure for Llama 3.3 Swallow was built on SageMaker HyperPod, with a focus on high performance, scalability, and observability. The architecture combines compute, network, storage, and monitoring components to enable efficient large-scale model training. The base infrastructure stack is available as an AWS CloudFormation template for seamless deployment and replication. This template provisions a comprehensive foundation by creating a dedicated virtual private cloud (VPC). The networking layer is complemented by a high-performance Amazon FSx for Lustre file system, alongside an Amazon Simple Storage Service (Amazon S3) bucket configured to store lifecycle scripts, which are used to configure the SageMaker HyperPod cluster.

Before deploying this infrastructure, it’s essential to make sure the AWS account has the appropriate service quotas. The deployment of SageMaker HyperPod requires specific quota values that often exceed default limits. You should check your current quota against the requirements detailed in SageMaker HyperPod quotas and submit a quota increase request as needed.

The following diagram illustrates the high-level architecture of the training infrastructure.

Compute and network configuration

The compute infrastructure is based on SageMaker HyperPod using a cluster of 32 EC2 P5 instances, each equipped with 8 NVIDIA H100 GPUs. The deployment uses a single spine configuration to provide minimal latency between instances. All communication between GPUs is handled through NCCL over an Elastic Fabric Adapter (EFA), providing high-throughput, low-latency networking essential for distributed training. The SageMaker HyperPod Slurm configuration manages the deployment and orchestration of these resources effectively.

Storage architecture

The project implements a hierarchical storage approach that balances performance and cost-effectiveness. At the foundation is Amazon S3, providing long-term storage for training data and checkpoints. To prevent storage bottlenecks during training, the team deployed FSx for Lustre as a high-performance parallel file system. This configuration enables efficient data access patterns across all training nodes, crucial for handling the massive datasets required for the 70-billion-parameter model.

The following diagram illustrates the storage hierarchy implementation.

The integration between Amazon S3 and FSx for Lustre is managed through a data repository association, configured using the following AWS Command Line Interface (AWS CLI) command:


aws fsx create-data-repository-association 
    --file-system-id ${FSX_ID} 
    --file-system-path "/hsmtest" 
    --data-repository-path s3://${BUCKET_NAME_DATA} 
    --s3 AutoImportPolicy='{Events=[NEW,CHANGED,DELETED]}',AutoExportPolicy={Events=[NEW,CHANGED,DELETED]} 
    --batch-import-meta-data-on-create 
    --region ${AWS_REGION}

Observability stack

The monitoring infrastructure combines Amazon Managed Service for Prometheus and Amazon Managed Grafana to provide comprehensive observability. The team integrated DCGM Exporter for GPU metrics and EFA Exporter for network metrics, enabling real-time monitoring of system health and performance. This setup allows for continuous tracking of GPU health, network performance, and training progress, with automated alerting for any anomalies through Grafana Dashboards. The following screenshot shows an example of a GPU health dashboard.

Software stack and training optimizations

The training environment is built on SageMaker HyperPod DLAMI, which provides a preconfigured Ubuntu base Amazon Machine Image (AMI) with essential components for distributed training. The software stack includes CUDA drivers and libraries (such as cuDNN and cuBLAS), NCCL for multi-GPU communication, and AWS-OFI-NCCL for EFA support. On top of this foundation, the team deployed Megatron-LM as the primary framework for model training. The following diagram illustrates the software stack architecture.

Distributed training implementation

The training implementation uses Megatron-LM’s advanced features for scaling LLM training. The framework provides sophisticated model parallelism capabilities, including both tensor and pipeline parallelism, along with efficient data parallelism that supports communication overlap. These features are essential for managing the computational demands of training a 70-billion-parameter model.

Advanced parallelism and communication

The team used a comprehensive 4D parallelism strategy of Megatron-LM that maximizes GPU utilization through careful optimization of communication patterns across multiple dimensions: data, tensor, and pipeline, and sequence parallelism. Data parallelism splits the training batch across GPUs, tensor parallelism divides individual model layers, pipeline parallelism splits the model into stages across GPUs, and sequence parallelism partitions the sequence length dimension—together enabling efficient training of massive models.

The implementation overlaps communication across data parallelism, tensor parallelism, and pipeline parallelism domains, significantly reducing blocking time during computation. This optimized configuration enables efficient scaling across the full cluster of GPUs while maintaining consistently high utilization rates. The following diagram illustrates this communication and computation overlap in distributed training (original image).

Megatron-LM enables fine-grained communication overlapping through multiple configuration flags: --overlap-grad-reduce and --overlap-param-gather for data-parallel operations, --tp-comm-overlap for tensor parallel operations, and built-in pipeline-parallel communication overlap (enabled by default). These optimizations work together to improve training scalability.

Checkpointing strategy

The training infrastructure implements an optimized checkpointing strategy using Distributed Checkpoint (DCP) and asynchronous I/O operations. DCP parallelizes checkpoint operations across all available GPUs, rather than being constrained by tensor and pipeline parallel dimensions as in traditional Megatron-LM implementations. This parallelization, combined with asynchronous I/O, enables the system to:

  • Save checkpoints up to 10 times faster compared to synchronous approaches
  • Minimize training interruption by offloading I/O operations
  • Scale checkpoint performance with the total number of GPUs
  • Maintain consistency through coordinated distributed saves

The checkpointing system automatically saves model states to the FSx Lustre file system at configurable intervals, with metadata tracked in Amazon S3. For redundancy, checkpoints are asynchronously replicated to Amazon S3 storage.

For implementation details on asynchronous DCP, see Asynchronous Saving with Distributed Checkpoint (DCP).

Experiment management

In November 2024, the team introduced a systematic approach to resource optimization through the development of a sophisticated memory prediction tool. This tool accurately predicts per-GPU memory usage during training and semi-automatically determines optimal training settings by analyzing all possible 4D parallelism configurations. Based on proven algorithmic research, this tool has become instrumental in maximizing resource utilization across the training infrastructure. The team plans to open source this tool with comprehensive documentation to benefit the broader AI research community.

The following screenshot shows an example of the memory consumption prediction tool interface (original image).

Training pipeline management

The success of the training process heavily relied on maintaining high-quality data pipelines. The team implemented rigorous data curation processes and robust cleaning pipelines, maintaining a careful balance in dataset composition across different languages and domains.For experiment planning, version control was critical. The team first fixed the versions of pre-training libraries and instruction tuning libraries to be used in the next experiment cycle. For libraries without formal version releases, the team managed versions using Git branches or tags to provide reproducibility. After the versions were locked, the team conducted short-duration training runs to:

  • Measure throughput with different numbers of GPU nodes
  • Search for optimal configurations among distributed training settings identified by the memory prediction library
  • Establish accurate training time estimates for scheduling

The following screenshot shows an example experiment schedule showing GPU node allocation, expected training duration, and key milestones across different training phases (original image).

To optimize storage performance before beginning experiments, training data was preloaded from Amazon S3 to the FSx for Lustre file system to prevent I/O bottlenecks during training. This preloading process used parallel transfers to maximize throughput:

# Preload data to Lustre filesystem
find <data/path> -type f -print0 | xargs -0 -n 1 -P 8 sudo lfs 
hsm_restore

Monitoring and performance management

The team implemented a comprehensive monitoring system focused on real-time performance tracking and proactive issue detection. By integrating with Weights & Biases, the system continuously monitors training progress and delivers automated notifications for key events such as job completion or failure and performance anomalies. Weights & Biases provides a set of tools that enable customized alerting through Slack channels. The following screenshot shows an example of a training monitoring dashboard in Slack (original image).

The monitoring infrastructure excels at identifying both job failures and performance bottlenecks like stragglers. The following figure presents an example of straggler detection showing training throughput degradation.

Conclusion

The successful training of Llama 3.3 Swallow represents a significant milestone in the development of LLMs using cloud infrastructure. Through this project, the team has demonstrated the effectiveness of combining advanced distributed training techniques with carefully orchestrated cloud resources. The implementation of efficient 4D parallelism and asynchronous checkpointing has established new benchmarks for training efficiency, and the comprehensive monitoring and optimization tools have provided consistent performance throughout the training process.

The project’s success is built on several foundational elements: a systematic approach to resource planning and optimization, robust data pipeline management, and a comprehensive monitoring and alerting system. The efficient storage hierarchy implementation has proven particularly crucial in managing the massive datasets required for training a 70-billion-parameter model.Looking ahead, the project opens several promising directions for future development. The team plans to open source the memory prediction tools, so other researchers can benefit from the optimizations developed during this project. Further improvements to the training pipelines are under development, along with continued enhancement of Japanese language capabilities. The project’s success also paves the way for expanded model applications across various domains.

Resources and references

This section provides key resources and references for understanding and replicating the work described in this paper. The resources are organized into documentation for the infrastructure and tools used, as well as model-specific resources for accessing and working with Llama 3.3 Swallow.

Documentation

The following resources provide detailed information about the technologies and frameworks used in this project:

Model resources

For more information about Llama 3.3 Swallow and access to the model, refer to the following resources:


About the Authors

Kazuki Fujii graduated with a bachelor’s degree in Computer Science from Tokyo Institute of Technology in 2024 and is currently a master’s student there (2024–2026). Kazuki is responsible for the pre-training and fine-tuning of the Swallow model series, a state-of-the-art multilingual LLM specializing in Japanese and English as of December 2023. Kazuki focuses on distributed training and building scalable training systems to enhance the model’s performance and infrastructure efficiency.

Daisuke Miyamato is a Senior Specialist Solutions Architect for HPC at Amazon Web Services. He is mainly supporting HPC customers in drug discovery, numerical weather prediction, electronic design automation, and ML training.

Kei Sasaki is a Senior Solutions Architect on the Japan Public Sector team at Amazon Web Services, where he helps Japanese universities and research institutions navigate their cloud migration journeys. With a background as a systems engineer specializing in high-performance computing, Kei supports these academic institutions in their large language model development initiatives and advanced computing projects.

Keita Watanabe is a Senior GenAI World Wide Specialist Solutions Architect at Amazon Web Services, where he helps develop machine learning solutions using OSS projects such as Slurm and Kubernetes. His background is in machine learning research and development. Prior to joining AWS, Keita worked in the ecommerce industry as a research scientist developing image retrieval systems for product search. Keita holds a PhD in Science from the University of Tokyo.

Read More