Scalable intelligent document processing using Amazon Bedrock Data Automation

Scalable intelligent document processing using Amazon Bedrock Data Automation

Intelligent document processing (IDP) is a technology to automate the extraction, analysis, and interpretation of critical information from a wide range of documents. By using advanced machine learning (ML) and natural language processing algorithms, IDP solutions can efficiently extract and process structured data from unstructured text, streamlining document-centric workflows.

When enhanced with generative AI capabilities, IDP enables organizations to transform document workflows through advanced understanding, structured data extraction, and automated classification. Generative AI-powered IDP solutions can better handle the variety of documents that traditional ML models might not have seen before. This technology combination is impactful across multiple industries, including child support services, insurance, healthcare, financial services, and the public sector. Traditional manual processing creates bottlenecks and increases error risk, but by implementing these advanced solutions, organizations can dramatically enhance their document workflow efficiency and information retrieval capabilities. AI-enhanced IDP solutions improve service delivery while reducing administrative burden across diverse document processing scenarios.

This approach to document processing provides scalable, efficient, and high-value document processing that leads to improved productivity, reduced costs, and enhanced decision-making. Enterprises that embrace the power of IDP augmented with generative AI can benefit from increased efficiency, enhanced customer experiences, and accelerated growth.

In the blog post Scalable intelligent document processing using Amazon Bedrock, we demonstrated how to build a scalable IDP pipeline using Anthropic foundation models on Amazon Bedrock. Although that approach delivered robust performance, the introduction of Amazon Bedrock Data Automation brings a new level of efficiency and flexibility to IDP solutions. This post explores how Amazon Bedrock Data Automation enhances document processing capabilities and streamlines the automation journey.

Benefits of Amazon Bedrock Data Automation

Amazon Bedrock Data Automation introduces several features that significantly improve the scalability and accuracy of IDP solutions:

  • Confidence scores and bounding box data – Amazon Bedrock Data Automation provides confidence scores and bounding box data, enhancing data explainability and transparency. With these features, you can assess the reliability of extracted information, resulting in more informed decision-making. For instance, low confidence scores can signal the need for additional human review or verification of specific data fields.
  • Blueprints for rapid development – Amazon Bedrock Data Automation provides pre-built blueprints that simplify the creation of document processing pipelines, helping you develop and deploy solutions quickly. Amazon Bedrock Data Automation provides flexible output configurations to meet diverse document processing requirements. For simple extraction use cases (OCR and layout) or for a linearized output of the text in documents, you can use standard output. For customized output, you can start from scratch to design a unique extraction schema, or use preconfigured blueprints from our catalog as a starting point. You can customize your blueprint based on your specific document types and business requirements for more targeted and accurate information retrieval.
  • Automatic classification support – Amazon Bedrock Data Automation splits and matches documents to appropriate blueprints, resulting in precise document categorization. This intelligent routing alleviates the need for manual document sorting, drastically reducing human intervention and accelerating processing time.
  • Normalization – Amazon Bedrock Data Automation addresses a common IDP challenge through its comprehensive normalization framework, which handles both key normalization (mapping various field labels to standardized names) and value normalization (converting extracted data into consistent formats, units, and data types). This normalization approach helps reduce data processing complexities, so organizations can automatically transform raw document extractions into standardized data that integrates more smoothly with their existing systems and workflows.
  • Transformation – The Amazon Bedrock Data Automation transformation feature converts complex document fields into structured, business-ready data by automatically splitting combined information (such as addresses or names) into discrete, meaningful components. This capability simplifies how organizations handle varied document formats, helping teams define custom data types and field relationships that match their existing database schemas and business applications.
  • Validation – Amazon Bedrock Data Automation enhances document processing accuracy by using automated validation rules for extracted data, supporting numeric ranges, date formats, string patterns, and cross-field checks. This validation framework helps organizations automatically identify data quality issues, trigger human reviews when needed, and make sure extracted information meets specific business rules and compliance requirements before entering downstream systems.

Solution overview

The following diagram shows a fully serverless architecture that uses Amazon Bedrock Data Automation along with AWS Step Functions and Amazon Augmented AI (Amazon A2I) to provide cost-effective scaling for document processing workloads of different sizes.

AWS Architetcure Diagram Showing Document Processing using Amazon Bedrock Data Auatomation and Human in the Loop

The Step Functions workflow processes multiple document types including multipage PDFs and images using Amazon Bedrock Data Automation. It uses various Amazon Bedrock Data Automation blueprints (both standard and custom) within a single project to enable processing of diverse document types such as immunization documents, conveyance tax certificates, child support services enrollment forms, and driver licenses.

The workflow processes a file (PDF, JPG, PNG, TIFF, DOC, DOCX) containing a single document or multiple documents through the following steps:

  1. For multi-page documents, splits along logical document boundaries
  2. Matches each document to the appropriate blueprint
  3. Applies the blueprint’s specific extraction instructions to retrieve information from each document
  4. Perform normalization, Transformation and validation on extracted data according to the instruction specified in blueprint

The Step Functions Map state is used to process each document. If a document meets the confidence threshold, the output is sent to an Amazon Simple Storage Service (Amazon S3) bucket. If any extracted data falls below the confidence threshold, the document is sent to Amazon A2I for human review. Reviewers use the Amazon A2I UI with bounding box highlighting for selected fields to verify the extraction results. When the human review is complete, the callback task token is used to resume the state machine and human-reviewed output is sent to an S3 bucket.

To deploy this solution in an AWS account, follow the steps provided in the accompanying GitHub repository.

In the following sections, we review the specific Amazon Bedrock Data Automation features deployed using this solution, using the example of a child support enrollment form.

Automated Classification

In our implementation, we define the document class name for each custom blueprint created, as illustrated in the following screenshot. When processing multiple document types, such as driver’s licenses and child support enrollment forms, the system automatically applies the appropriate blueprint based on content analysis, making sure the correct extraction logic is used for each document type.

Bedrock Data Automation interface showing Child Support Form classification detail

Data Normalization

We use data normalization to make sure downstream systems receive uniformly formatted data. We use both explicit extractions (for clearly stated information visible in the document) and implicit extractions (for information that needs transformation). For example, as shown in the following screenshot, dates of birth are standardized to YYYY-MM-DD format.

Bedrock Data Automation interface displaying extracted and normalized Date of Birth data

Similarly, format of Social Security Numbers is changed to XXX-XX-XXXX.

Data Transformation

For the child support enrollment application, we’ve implemented custom data transformations to align extracted data with specific requirements. One example is our custom data type for addresses, which breaks down single-line addresses into structured fields (Street, City, State, ZipCode). These structured fields are reused across different address fields in the enrollment form (employer address, home address, other parent address), resulting in consistent formatting and straightforward integration with existing systems.

Amazon Bedrock Data Automation interface displaying custom address type configuration with explicit field mappings

Data Validation

Our implementation includes validation rules for maintaining data accuracy and compliance. For our example use case, we’ve implemented two validations: 1. verify the presence of the enrollee’s signature and 2. verify that the signed date isn’t in the future.

Bedrock extraction interface showing signature and date validation configurations

The following screenshot shows the result of the above validation rules applied to the document.

Amazon Bedrock-powered document automation showing form field validation, signature verification, and confidence scoring

Human-in-the-loop validation

The following screenshot illustrates the extraction process, which includes a confidence score and is integrated with a human-in-the-loop process. It also shows normalization applied to the date of birth format.

bda Human in the loop

Conclusion

Amazon Bedrock Data Automation significantly advances IDP by introducing confidence scoring, bounding box data, automatic classification, and rapid development through blueprints. In this post, we demonstrated how to take advantage of its advanced capabilities for data normalization, transformation, and validation. By upgrading to Amazon Bedrock Data Automation, organizations can significantly reduce development time, improve data quality, and create more robust, scalable IDP solutions that integrate with human review processes.

Follow the AWS Machine Learning Blog to keep up to date with new capabilities and use cases for Amazon Bedrock.


About the authors

Abdul NavazAbdul Navaz is a Senior Solutions Architect in the Amazon Web Services (AWS) Health and Human Services team, based in Dallas, Texas. With over 10 years of experience at AWS, he focuses on modernization solutions for child support and child welfare agencies using AWS services. Prior to his role as a Solutions Architect, Navaz worked as a Senior Cloud Support Engineer, specializing in networking solutions.

Venkata Kampana is a senior solutions architect in the Amazon Web Services (AWS) Health and Human Services team and is based in Sacramento, Calif. In this role, he helps public sector customers achieve their mission objectives with well-architected solutions on AWS.

Sanjeev PulapakaSanjeev Pulapaka is principal solutions architect and AI lead for public sector. Sanjeev is a published author with several blogs and a book on generative AI. He is also a well-known speaker at several events including re:Invent and Summit. Sanjeev has an undergraduate degree in engineering from the Indian Institute of Technology and an MBA from the University of Notre Dame.

Read More

Whiteboard to cloud in minutes using Amazon Q, Amazon Bedrock Data Automation, and Model Context Protocol

Whiteboard to cloud in minutes using Amazon Q, Amazon Bedrock Data Automation, and Model Context Protocol

Upgrading legacy systems has become increasingly important to stay competitive in today’s market as outdated infrastructure can cost organizations time, money, and market position. However, modernization efforts face challenges like time-consuming architecture reviews, complex migrations, and fragmented systems. These delays not only impact engineering teams but have broader impacts including lost market opportunities, reduced competitiveness, and higher operational costs. With Amazon Q DeveloperAmazon Bedrock Data Automation (Bedrock Data Automation) and Anthropic’s Model Context Protocol (MCP), developers can now go from whiteboard sketches and team discussions to fully deployed, secure, and scalable cloud architectures in a matter of minutes, not months.

We’re excited to share the Amazon Bedrock Data Automation Model Context Protocol (MCP) server, for seamless integration between Amazon Q and your enterprise data. With this new capability, developers can use the features of Amazon Q while maintaining secure access to their organization’s data through standardized MCP interactions. In this post, you will learn how to use the Amazon Bedrock Data Automation MCP server to securely integrate with AWS Services, use Bedrock Data Automation operations as callable MCP tools, and build a conversational development experience with Amazon Q.

The problem: Five systems, lack of agility

Engineers looked at a whiteboard, eyeing a complex web of arrows, legacy system names, and integration points that had long stopped making sense. The diagram represented multiple disconnected systems held together by brittle scripts, fragile batch jobs, and a patchwork of manual workarounds as shown in the following illustration. Collaborative AWS solution design meeting with whiteboard diagrams showing cloud services integration and data flow

The meeting audio was synthesized using Amazon Polly to bring the conversation to life for this post.

“We need to stop patching and start transforming,” Alex said, pointing at the tangled mess. The team nodded, weary from another outage that left the finance team reconciling thousands of transactions by hand. Feature development had slowed to a crawl, infrastructure costs were unpredictable, and any change risked breaking something downstream. Migration felt inevitable but overwhelming. The question wasn’t whether to modernize – it was how to begin without burning months in planning and coordination. That’s when they turned to the new pattern.

The breakthrough

Just a few months ago, building a working prototype from a whiteboard session like this would have taken months, if not longer. The engineers would have started by manually transcribing the meeting, converting rough ideas into action items, cleaning up architecture diagrams, aligning teams across operations and security, and drafting infrastructure templates by hand. Every step would have required coordination, and each change made would have invited risk to the system. Even a proof-of-concept would have demanded hours of YAML, command line interface (CLI) commands, policy definitions, and trial-and-error troubleshooting. Now the engineers need to only ask, and what used to take months happens in minutes.

With Amazon Q CLI, the team initiates a conversation. Behind the scenes, Amazon Q CLI invokes the MCP server and extracts information from multimodal content using Bedrock Data Automation. The meeting recording and the draft architecture diagram are also analyzed using Bedrock Data Automation. Amazon Q uses the extracted content from Bedrock Data Automation to generate the AWS CloudFormation template. It even deploys it to the AWS Cloud when asked. There is no manual translation, no brittle scripting, and no dependency mapping across systems. The result is a fully deployable, secure AWS architecture generated and provisioned in minutes. What once required cross-functional coordination and prolonged development cycles now starts and completes with a chat.

Understanding the Model Context Protocol

The Model Context Protocol (MCP) is an open standard developed by Anthropic to facilitate secure, two-way connections between AI models and multiple data sources, including content repositories, business tools, and development environments. By standardizing these interactions, MCP enables AI systems to access the data they need to provide more relevant and accurate responses.

MCP operates on a client-server architecture, where developers can either expose their data through MCP servers or build AI applications (MCP clients) that connect to these servers. This setup allows for a more streamlined and scalable integration process, replacing the need for custom connectors for each data source.

Enhancing Amazon Q with Amazon Bedrock Data Automation and MCP server

Bedrock Data Automation complements MCP by providing a robust suite of tools that automate the extraction, transformation, and loading (ETL) of enterprise data into AI workflows at scale and with minimal manual intervention. With Bedrock Data Automation, customers can:

  • Extract unstructured data from diverse sources such as document, image, audio, and video files.
  • Transform and validate data using schema-driven extraction using Blueprints, confidence scoring, and responsible AI practice to maintain accuracy, completeness, and consistency.
  • Load ready-to-use data into AI models for real-time, context-aware reasoning across business.

This deep integration makes sure that AI models are not just connected to data, they are grounded in clean, validated, and context-rich information. As a result, intelligent agents deliver more accurate, relevant, and reliable outputs that drive faster decisions and richer insights across the enterprise. Amazon Q Developer is a generative AI-powered conversational assistant from AWS designed to help software developers and IT professionals build, operate, and transform software with greater speed, security, and efficiency. It acts as an intelligent coding companion and productivity tool, integrated with the AWS environment and available in popular code editors, the AWS Management Console, and collaboration tools such as Microsoft Teams and Slack. As described in the following figure, the Bedrock Data Automation MCP server works in the following way:

  1. The User sends a “Request action” to the MCP Host.
  2. The MCP Host processes the request with an LLM.
  3. The MCP Host then requests a tool execution to the MCP Client.
  4. The MCP Client makes a tool call request to the MCP Server.
  5. The MCP Server makes an API request to the Bedrock Data Automation.
  6. Bedrock Data Automation sends back an API response to the MCP Server.
  7. The MCP Server returns the tool result to the MCP Client.
  8. The MCP Client sends the result back to the MCP Host.
  9. The MCP Host again processes with LLM.
  10. The MCP Host sends a final response to the User.

End-to-end request flow diagram showing MCP Host/Client/Server interaction with AWS Bedrock and LLM processing steps

Step-by-step guide

If this is your first time using AWS MCP servers, visit the Installation and Setup guide in the AWS Labs GitHub repository for installation instructions. After installation, add the following MCP server configuration to your local setup:

Prerequisites

Set up MCP

Install Amazon Q for command line and add the configuration to ~/.aws/amazonq/mcp.json. If you’re already an Amazon Q CLI user, add only the configuration.

{
  "mcpServers": {
    "bedrock-data-automation-mcp-server": {
      "command": "uvx",
      "args": [
        "awslabs.aws-bedrock-data-automation-mcp-server@latest"
      ],
      "env": {
        "AWS_PROFILE": "your-aws-profile",
        "AWS_REGION": "your-aws-region",
        "AWS_BUCKET_NAME": "amzn-s3-demo-bucket"
      }
    }
  }
}

To confirm the setup was successful, open a terminal and enter q chat to enter into a chat session with Amazon Q.

Need to know what tools are at your disposal? Enter:"Tell me the tools I have access to"

If MCP has been properly configured, as shown in the following screenshot, you will have, aws_bedrock_data_automation suffixed by getprojects, getprojectdetails, and analyzeasset as its three tools. This will help you quickly verify access and make sure that the necessary components are properly set up.

Interactive AWS terminal interface showing Q CLI, MCP and Bedrock Data Automation project management and analysis capabilities

Now, you can ask Amazon Q to use Bedrock Data Automation as a tool and extract the transcript from the meeting stored in the .mp3 file and refer to the updated architecture diagram, as shown in the following screenshot.

can you extract the meeting recording from <your-location> and refer to the updated architecture diagram from <your-location> using Bedrock Data Automation 

Terminal interface initiating AWS Bedrock analysis of MP3 and PNG files with project listing request

You can seamlessly continue a natural language conversation with Amazon Q to generate an AWS CloudFormation template, write prototype code, or even implement monitoring solutions. The potential applications are virtually endless.

Clean up

When you’re done working with the Amazon Bedrock Data Automation MCP server, follow the given steps to perform cleanup:

  1. Empty and delete the S3 buckets used for Bedrock Data Automation.
   aws s3 rm s3://amzn-s3-demo-bucket --recursive
    aws s3 rb s3://amzn-s3-demo-bucket
  1. Remove the configuration added to ~/.aws/amazonq/mcp.json for bedrock-data-automation-mcp-server.

Conclusion

With MCP and Bedrock Data Automation, Amazon Q Developer can turn messy ideas into working cloud architectures in record time. No whiteboards are left behind.

Are you ready to build smarter, faster, and more context-aware applications? Explore Amazon Q Developer and see how MCP and Amazon Bedrock Data Automation can help your team turn ideas into reality faster than ever before.


About the authors

Wrick TalukdarWrick Talukdar is a Tech Lead and Senior Generative AI Specialist at Amazon Web Services, driving innovation through multimodal AI, generative models, computer vision, and natural language processing. He is also the author of the bestselling book “Building Agentic AI Systems”. He is a keynote speaker and often presents his innovations and solutions at leading global forums, including AWS re:Invent, ICCE, Global Consumer Technology conference, and major industry events such as CERAWeek and ADIPEC. In his free time, he enjoys writing and birding photography.

Ayush GoyalAyush Goyal is a Senior Software Engineer at Amazon Bedrock, where he focuses on designing and scaling AI-powered distributed systems. He’s also passionate about contributing to open-source projects. When he’s not writing code, Ayush enjoys speed cubing, exploring global cuisines, and discovering new parks—both in the real world and through open-world games.

Himanshu SahHimanshu Sah is an Associate Delivery Consultant in AWS Professional Services, specialising in Application Development and Generative AI solutions. Based in India, he helps customers architect and implement cutting-edge applications leveraging AWS services and generative AI capabilities. Working closely with cross-functional teams, he focuses on delivering best-practice implementations while ensuring optimal performance and cost-effectiveness. Outside of work, he is passionate about exploring new technologies and contributing to the tech community.

Read More

Bringing agentic Retrieval Augmented Generation to Amazon Q Business

Bringing agentic Retrieval Augmented Generation to Amazon Q Business

Amazon Q Business is a generative AI-powered enterprise assistant that helps organizations unlock value from their data. By connecting to enterprise data sources, employees can use Amazon Q Business to quickly find answers, generate content, and automate tasks—from accessing HR policies to streamlining IT support workflows, all while respecting existing permissions and providing clear citations. At the heart of systems like Amazon Q Business lies Retrieval Augmented Generation (RAG), which enables AI models to ground their responses in an organization’s enterprise data.

The evolution of RAG

Traditional RAG implementations typically follow a straightforward approach: retrieve relevant documents or passages based on a user query, then generate a response using these documents or passages as context for the large language model (LLM) to respond. While this methodology works well for basic, factual queries, enterprise environments present uniquely complex challenges that expose the limitations of this single-shot retrieval approach.

Consider an employee asking about the differences between two benefits packages or requesting a comparison of project outcomes across multiple quarters. These queries require synthesizing information from various sources, understanding company-specific context, and often need multiple retrieval steps to gather comprehensive information around each aspect of the query.

Traditional RAG systems struggle with such complexity, often providing incomplete answers or failing to adapt their retrieval strategy when initial results are insufficient. When processing these more involved queries, users are left waiting without visibility into the system’s progress, leading to an opaque experience.

Bringing agency to Amazon Q Business

Bringing agency to Amazon Q Business is a new paradigm to handle sophisticated enterprise queries through intelligent, agent-based retrieval strategies. By introducing AI agents that dynamically plan and execute sophisticated retrieval strategies with a suite of data navigation tools, Agentic RAG represents a significant evolution in how AI assistants interact with enterprise data, delivering more accurate and comprehensive responses while maintaining the speed users expect.

With Agentic RAG in Amazon Q Business you have several new capabilities, including query decomposition and transparent events, agentic retrieval tool use, improved conversational capabilities, and agentic response optimization. Let’s dive deeper into what each of these mean.

Query decomposition and transparent response events

Traditional RAG systems often face significant challenges when processing complex enterprise queries, particularly those involving multiple steps, composite elements, or comparative analysis. With this release of Agentic RAG in Amazon Q Business, we aim to solve this problem through sophisticated query decomposition techniques, where AI agents intelligently break down complex questions into discrete, manageable components.

When an employee asks Please compare the vacation policies of Washington and California?, the question is decomposed into two queries on Washington and California policies. The first decomposed query being washington state vacation policies and the second query being california state vacation policies.

Because Agentic RAG presumes a series of parallel steps to explore the data source and collect thorough information for more accurate query resolution, we are now providing real-time visibility into its processing steps that will be displayed on the screen as data is being retrieved to generate the response. After the response is generated, the steps will be collapsed with the response streamed. In the following image, we see how the decomposed queries are displayed and the relevant data retrieved for response generation.

This allows users to see meaningful updates to the system’s operations, including query decomposition patterns, document retrieval paths, and response generation workflows. This granular visibility into the system’s decision-making process enhances user confidence and provides valuable insights into the sophisticated mechanisms driving accurate response generation.

This agentic solution facilitates comprehensive data collection and enables more accurate, nuanced responses. The result is enhanced responses that maintain both granular precision and holistic understanding of complex, multi-faceted business questions, while relying on the LLM to synthesize the information retrieved. As shown in the following image, the information fetched individually for California and Washington vacation policies were synthesized by the LLM and presented in a rich markdown format.

Agentic tool use

The designed RAG agents can intelligently deploy various data exploration tools and retrieval methods in optimal strategies by thinking about the retrieval plan while maintaining context over multiple turns of the conversations. These retrieval tools include tools built within Amazon Q Business such as tabular search, allowing intelligent retrieval of data through either code generation or tabular linearization across small and large tables embedded in documents (such as DOCX, PPTX, PDF, and so on) or stored in CSV or XLSX files. Another retrieval tool includes long context retrieval, which determines when the full context of a document is required for retrieval. An example of long context retrieval: if a user asks a query such as Summarize the 10K of Company X, the agent could identify the query’s intent as a summarization query that requires document-level context and, as a result, deploy the long context retrieval tool that fetches the complete document—the 10K of Company X—as part of the context for the LLM to generate a response (as shown in the following figure). This intelligent tool selection and deployment represents a significant advancement over traditional RAG systems, which often rely on fragmented passage retrieval that can compromise the coherence and completeness of complex document analysis for question answering.

Improved conversational capabilities

Agentic RAG introduces multi-turn query capabilities that elevate the conversational capabilities of Amazon Q Business into dynamic, context-aware dialogues. The agent maintains conversational context across interactions by storing short-term memory, enabling natural follow-up questions without requiring users to restate previous context. Additionally, when the agent encounters multiple possible answers based on your enterprise data, it asks clarifying questions to disambiguate the query to better understand what you’re looking for to provide more accurate responses. For instance, Q refers to any of the many implementations of Amazon Q. The system handles semantic ambiguity gracefully by recognizing multiple potential interpretations of what Q could be and asks for clarifications in its responses to verify accuracy and relevance. This sophisticated approach to dialogue management makes complex tasks like policy interpretation or technical troubleshooting more efficient, because the system can progressively refine its understanding through targeted clarification and follow-up exchanges.

In the following image, the user asks tell me about Q with the system providing a high-level overview of the various implementations and asking a follow-up question to disambiguate the user’s search intent.

Upon successful disambiguation, the system persists both the conversation state and previously retrieved contextual data in-memory, enabling the generation of precisely targeted responses that align with the user’s clarified intent thus being more accurate, relevant, and complete.

Agentic response optimization

Agentic RAG introduces dynamic response optimization where AI agents actively evaluate and refine their responses. Unlike traditional systems that provide answers even when the context is insufficient, these agents continuously assess response quality and iteratively plan out new actions to improve information completeness. They can recognize when initial retrievals miss crucial information and autonomously initiate additional searches or alternative retrieval strategies. This means when discussing complex topics like compliance policies, the system captures all relevant updates, exceptions, and interdependencies while maintaining context across multiple turns of the conversation. The following diagram shows how Agentic RAG handles the conversation history across multiple turns of the conversation. The agent plans and reasons across the retrieval tool use and response generation process. Based on the initial retrieval, while taking into account the conversation state and history, the agent re-plans the process as needed to generate the most complete and accurate response for the user’s query.

Using the Agentic RAG feature

Getting started with Agentic RAG’s advanced capabilities in Amazon Q Business is straightforward and can immediately improve how your organization interacts with your enterprise data. To begin, in the Amazon Q Business web interface, you can switch on the Advanced Search toggle to enable Agentic RAG, as shown in the following image.

After advanced search is enabled, users can experience richer and more complete responses from Amazon Q Business. Agentic RAG particularly shines when handling complex business scenarios based on your enterprise data—imagine asking about cross-AWS Region performance comparisons, investigating policy implications across departments, or analyzing historical trends in project deliveries. The system excels at breaking down these complex queries into manageable search tasks while maintaining context throughout the conversation.

For the best experience, users should feel confident in asking detailed, multi-part questions. Unlike traditional search systems, Agentic RAG handles nuanced queries like

How have our metrics changed across the southeast and northeast regions in 2024?

The system will work through such questions methodically, showing its progress as it analyzes and breaks the query down into composite parts to fetch sufficient context and generate a complete and accurate response.

Conclusion

Agentic RAG represents a significant leap forward for Amazon Q Business, transforming how organizations use their enterprise data while maintaining the robust security and compliance that they expect with AWS services. Through its sophisticated query processing and contextual understanding, the system enables deeper, more nuanced interactions with enterprise data—from comparative and multi-step queries to interactive multi-turn chat experiences. All of this occurs within a secure framework that respects existing permissions and access controls, making sure that users receive only authorized information while maintaining the rich, contextual responses needed for meaningful insights.

By combining advanced retrieval capabilities with intelligent, conversation-aware interactions, Agentic RAG allows organizations to unlock the full potential of their data while maintaining the highest standards of data governance. The result is an improved chat experience and a more capable query answering engine that maximizes the value of your data assets.

Try out Amazon Q Business for your organization with your data and share your thoughts in the comments.


About the authors

Sanjit Misra is a technical product leader at Amazon Web Services, driving innovation on Amazon Q Business, Amazon’s generative AI product. He leads product development for core Agentic AI features that enhance accuracy and retrieval — including Agentic RAG, conversational disambiguation, tabular search, and long-context retrieval. With over 15 years of experience across product and engineering roles in data, analytics, and AI/ML, Sanjit combines deep technical expertise with a track record of delivering business outcomes. He is based in New York City.

Venky Nagapudi is a Senior Manager of Product Management for Amazon Q Business. His focus areas include RAG features, accuracy evaluation and enhancement, user identity management and user subscriptions.

Yi-An Lai is a Senior Applied Scientist with the Amazon Q Business team at Amazon Web Services in Seattle, WA. His expertise spans agentic information retrieval, conversational AI systems, LLM tool orchestration, and advanced natural language processing. With over a decade of experience in ML/AI, he has been enthusiastic about developing sophisticated AI solutions that bridge state-of-the-art research and practical enterprise applications.

Yumo Xu is an Applied Scientist at AWS, where he focuses on building helpful and responsible AI systems for enterprises. His primary research interests are centered on the foundational challenges of machine reasoning and agentic AI. Prior to AWS, Yumo received his PhD in Natural Language Processing from the University of Edinburgh.

Danilo Neves Ribeiro is an Applied Scientist on the Q Business team based in Santa Clara, CA. He is currently working on designing innovative solutions for information retrieval, reasoning, language model agents, and conversational experience for enterprise use cases within AWS. He holds a Ph.D. in Computer Science from Northwestern University (2023) and has over three years of experience working as an AI/ML scientist.

Kapil Badesara is a Senior Machine Learning Engineer on AWS Q Business, focusing on optimizing RAG systems for accuracy and efficiency. Kapil is based out of Seattle and has more than 10 years of building large scale AI/ML services.

Sunil Singh is an Engineering Manager on the Amazon Q Business team, where he leads the development of next-generation agentic AI solutions designed to enhance Retrieval-Augmented Generation (RAG) systems for greater accuracy and efficiency. Sunil is based out of Seattle and has more than 10 years of experience in architecting secure, scalable AI/ML services for enterprise-grade applications.

Read More

Empowering students with disabilities: University Startups’ generative AI solution for personalized student pathways

Empowering students with disabilities: University Startups’ generative AI solution for personalized student pathways

This post was co-authored with Laura Lee Williams and John Jabara from University Startups.

University Startups, headquartered in Bethesda, MD, was founded in 2020 to empower high school students to expand their education beyond a traditional curriculum. University Startups is focused on special education and related services in school districts throughout the US.

After students graduate from high school, they don’t often know what they want to pursue for education and a career. University Startups and AWS have designed a unique program that helps students with disabilities solve that challenge. The program creates a personalized transition plan for each student and can be used in schools across the country. The program can provide specific guidance for each student to help them create their own path for success after high school. Specifically, University Startups uses Amazon Bedrock to create this customized experience without increasing workload for educators. The impact of this initiative has been successfully demonstrated during a pilot test program, where dozens of students and teachers used University Startups’ tool to explore different career paths and suggestions on how to pursue their goals.

In this post, we explain how University Startups uses generative AI technology on AWS to enable students to design a specific plan for their future either in education or the work force.

Challenges in special education

Special education includes services and programs to meet the individual needs of a student with a disability. There are more than 7.5 million K-12 students with disabilities in the US, and this number is growing. Students with disabilities who require special education services have Individualized Education Programs (IEPs). The IEP is a legally mandated document developed by a student’s parents, teachers, and specialists to make sure the student’s educational needs are met. This includes access to the general education curriculum together with general education peers, and as close to their home as possible.

An important component of the IEP document is a transition plan. A transition plan is a document that articulates objectives and development goals meant to prepare students with disabilities for adult life. This can include planning for postsecondary education and career goals, gaining work experience, independent living, or whichever option is appropriate for the student. This plan is highly individualized to a student’s interests, preferences, skills, and needs. According to federal law, each student must have a transition plan in place by age 16.

However, schools in under-resourced school districts often have limited capacity to provide ample one-on-one attention from teachers and counselors to create an effective transition plan for every student. Additionally, because these transition plans are audited by federal and state governments, these school districts can face legal and financial consequences if plans are not in compliance. These challenges highlight the need for a personalized and robust solution to empower students with disabilities and their support team of parents, teachers, and specialists. With this context in mind, let’s explore how University Startups and AWS collaborated to develop a solution using Amazon Bedrock to enhance the transition planning process.

Solution overview

To address this challenge, University Startups created Trinity, a transition planning AI assistant to support students with disabilities in uncovering their interests to create an effective and personalized transition plan. The following diagram shows how you can use Amazon Bedrock Agents in a workflow, alongside other tools, to provide context when gathering information to create a transition plan.

At a high level, the solution uses Amazon Bedrock Agents to orchestrate comprehensive workflows that engage students by enabling interactive, personalized AI assistant experiences. The agents use AWS Lambda functions to securely call external APIs, such as job boards or university databases, and retrieve up-to-date opportunities. Additionally, Amazon Bedrock Flows can automate the generation of the individualized documents by collecting and synthesizing information from these interactions into a personalized draft transition plan. Throughout the process, features like Amazon Bedrock Knowledge Bases and Amazon Bedrock Guardrails are designed to improve the accuracy and reliability of the information provided, and to align it with safety and compliance standards, creating a secure and effective support system for students, parents, and educators.

To begin the process, students engage with an agent designed with Amazon Bedrock Agents to uncover their preferences and interests, while under the guidance of their educators. The agent then proposes tailored suggestions on pathways the students can explore for their transition plans. These suggestions are informed by context provided to the agent about requirements for IEP documents and common transition plan structures. Based on this output, the student’s parents and teachers can efficiently create a transition plan with the student or continue engaging with the agent to dive deeper. In the following section, we take a closer look at the tools this agent uses.

Tools and features used by Amazon Bedrock Agents

The effectiveness of Amazon Bedrock Agents lies in its ability to engage the students in a productive manner and provide actionable results for parents and educators. This is accomplished by using features within Amazon Bedrock, such as Amazon Bedrock Knowledge Bases and Amazon Bedrock Guardrails, alongside other AWS services, including Lambda, to create tools for the agent.

Maintaining student safety

A critical component of this application is designing with student security in mind, especially when handling personally identifiable information (PII) and protected health information (PHI). Without proper protections, there can be risk of data breach and violating compliance regulations.

We use Amazon Bedrock Guardrails to implement safeguards within the generative AI system. If a user enters PII or PHI data, Amazon Bedrock Guardrails will detect this and automatically block the request, and will return a preconfigured message, such as “Sorry, your query violates our usage policy.” We also use Amazon Bedrock Guardrails to mask PIIs in model responses. If sensitive information is detected in the model response, the guardrail masks it with an identifier, such as {NAME-1}.

Amazon Bedrock Guardrails can also help make sure this resource is being used only for its intended purposes. Technology can often be a distraction to students, but Amazon Bedrock Guardrails can be configured with a set of denied topics and content filters so students can stay on track.

Using relevant context

Transition plans require specific information about a student by law. However, there is no singular format required for the IEP document across the US. Each state can have varying standardized formats, and in some states, individual school systems design their own forms.

Therefore, the resource needs to be able to gather the required information and present it in the appropriate format, according to the school system it is being used in. We use Amazon Bedrock Knowledge Bases to provide the agent with necessary context so it can deliver accurate, customized responses for each use case.

Use case

To understand the impact of this solution and show Amazon Bedrock Agents in action, let’s view an example of Noah, a student who used University Startups’ resource in a pilot test program.

Noah was preparing to create his own transition plan for his IEP document. He attends a school district that uses University Startups’ resource, so he creates an account to get personalized outputs for his IEP. The system agent began to gather requirements:

  1. The agent began by asking for Noah’s current performance and any special requirements he might need. Then, the agent asked Noah to share his interests and skills, and what he might be interested in pursuing after high school. Noah shared that he is interested in criminology.
  2. Then, the agent used Amazon Bedrock Guardrails to redact sensitive information that might have been entered. Then, it uses tools to provide recommendations on how Noah can pursue a career in criminology, including postsecondary education and internships. These recommendations were tailored for Noah, based on his needs and location preferences.
  3. Noah then chose to explore another pathway option and was offered steps to discover other methods to pursue a career in criminology.
  4. Finally, Noah was satisfied with the results and recommendations he received, and this information was formatted into a draft transition plan. This is then provided to the team of educators responsible for developing and finalizing his IEP.

This discovery process helped Noah fully explore his interests and options, while feeling confident that these recommendations were personalized to his needs. It also reduced the time required by educators to prepare for IEP meetings and submission.

Demonstrated impact

This solution developed by University Startups provides a safe and efficient way for educators and students to develop their transition planning IEP documents. By automating the planning phase, students can take their time to explore their interests and the different paths available to them after high school. The tool can also support a dialogue between teacher and student in cases where special assistance is required. A teacher who participated in the pilot program said, “I think Trinity is an awesome tool. I would love to see parents and their students use it together, especially before the IEP meeting. I also think it is a great conversation starter to be used between the student and their case worker/counselor.” Even in underserved school districts, students can receive personalized attention tailored to their needs that captures their perspective in the transition planning process.

Transition planning preparation can be time-consuming for educators and requires highly specialized knowledge. However, using the Trinity resource has produced results in approximately 10 minutes and has the potential to reduce the average time spent by educators on transition planning. This makes it possible for them to scale up their efforts and focus their time on actionable results for every student.

Additionally, this tool is able to engage students to thoughtfully share their interests and skills. Pilot program participant Noah said, “Using this tool was easy and helpful; it quickly showed me a path to get to the job I want to do in the future.” Most importantly, although this solution automates the discovery process for developing a transition plan, it guarantees hands-on educator support during each step and whenever a student requests assistance. Throughout the pilot program, educators have preferred to provide one-on-one support by guiding students through the Trinity tool. This makes sure that the students feel acknowledged throughout the process, and that educators can adjust and validate the transition plan recommendations provided.

Conclusion

Through this solution, University Startups has demonstrated how students, including students with disabilities, can receive personalized services despite a lack of resources in underserved school districts. This solution can empower students to pursue their interests and develop actionable plans to build skills and achieve success. Incorporating generative AI into educational resources demonstrates the effective capability of AI in learning environments. Educational tools powered with Amazon Bedrock can handle diverse student needs and provide instant feedback, which streamlines the creation of personalized learning pathways by educators. This allows high-quality, adaptive instruction accessible for every student.

John Jabara, co-founder of University Startups, shares,

“Amazon Bedrock allowed us to create a tool that can capture the student’s voice during this important transition planning process.”

As of the publication date of this post, University Startups is active in 15 states. As University Startups continues to scale, they plan to expand their reach across the country. The potential of tools like Amazon Bedrock Agents and Amazon Bedrock Guardrails in democratizing education and career resources for students is significant, and Amazon Bedrock is at the forefront of this initiative.

To learn more, refer to Amazon Bedrock Agents, Amazon Bedrock Guardrails, and Amazon Bedrock Security and Privacy. To get a hands-on introduction to creating your own agent with Amazon Bedrock Agents, check out the following GitHub repository.

Visit https://www.university-startups.com for more information on programming for special education.


About the authors

Sadia Ahmed is a Solutions Architect at AWS supporting startup companies to transform big ideas into scalable solutions, with a focus on Generative AI. She is also passionate about educating the next generation of tech innovators. Sadia graduated from the University of Illinois Urbana-Champaign with a Master’s in Computer Science, and enjoys painting in her free time.

DeJonte July is a Solutions Architect at Amazon Web Services supporting early startup customers and helping make their entrepreneurial dreams come true. His primary focus lies on ethical use of Generative AI and technology communication. Outside of work he enjoys filmmaking and film photography.

Varad Ram is an Enterprise Solutions Architect supporting customers in Advertisement and Marketing vertical at Amazon Web Services. He collaborates closely with customers to design scalable and operationally efficient solutions. Currently, his primary focus is on analytics and helping customers maximize their return on investment in Generative Artificial Intelligence. In his free time, Varad enjoys biking with his children and playing tennis.

Read More

Citations with Amazon Nova understanding models

Citations with Amazon Nova understanding models

Large language models (LLMs) have become increasingly prevalent across both consumer and enterprise applications. However, their tendency to “hallucinate” information and deliver incorrect answers with seeming confidence has created a trust problem. Think of LLMs as you would a human expert: we typically trust experts who can back up their claims with references and walk us through their reasoning process. The same principle applies to LLMs – they become more trustworthy when they can demonstrate their thought process and cite reliable sources for their information. Fortunately, with proper prompting, LLMs can be instructed to provide these citations, making their outputs more verifiable and dependable.

In this post, we demonstrate how to prompt Amazon Nova understanding models to cite sources in responses. Further, we will also walk through how we can evaluate the responses (and citations) for accuracy.

What are citations and why are they useful? 

Citations are references to sources that indicate where specific information, ideas, or concepts in a work originated. Citations play a crucial role in addressing the following issues, enhancing the credibility, usability, and ethical grounding of LLM-based applications.

  1. Ensuring factual accuracy: LLMs are prone to “hallucinations,” where they generate plausible but incorrect information. Citations allow users to verify claims by tracing them back to reliable sources, improving factual correctness and reducing misinformation risks.
  2. Building trust and transparency: Citations foster trust in AI-generated content so users can cross-check information and understand its origins. This transparency is vital for applications in research, healthcare, law, and education.
  3. Supporting ethical practices: Citing sources ensures proper attribution to original authors, respecting intellectual property rights and scholarly contributions. It prevents plagiarism and promotes ethical AI use.
  4. Enhancing usability: Citations improve user experience by providing a pathway to explore related materials. Features like inline citations or bibliographies help users find relevant sources easily.
  5. Addressing Limitations of LLMs: LLMs often fabricate references due to their inability to access real-time data or remember training sources accurately. Retrieval augmented generation (RAG) systems and citation tools mitigate this issue by grounding responses in external data.
  6. Professional and academic standards: In academic contexts, citations are indispensable for replicating research methods and validating findings. AI-generated outputs must adhere to these standards to maintain scholarly integrity.

Citations with Amazon Nova models

Amazon Nova, launched in Dec 2024, is a new generation of foundation models that deliver frontier intelligence and industry leading price performance, available on Amazon Bedrock. Amazon Nova models include four understanding models (Nova Micro, Nova Lite, Nova Pro and Nova Premier), two creative content generation models (Nova Canvas and Nova Reel), and one speech-to-speech model (Nova Sonic). Through seamless integration with Amazon Bedrock, developers can build and scale generative AI applications with Amazon Nova foundation models.

Citations for the Amazon Nova understanding models can be achieved by crafting prompts where we instruct the model to cite its sources and indicate the response format. To illustrate this, we’ve picked an example where we ask questions to Nova Pro about Amazon shareholder letters. We will include the shareholder letter in the prompt as context and ask Nova Pro to answer questions and include citations from the letter(s).

Here’s an example prompt that we constructed for Amazon Nova Pro following best practices for prompt engineering for Amazon Nova.

Note the output format that we included in the prompt to distinguish the actual answers from the citations.

System prompt 

##Instruction
You are a QA agent. You answer questions based on the context provided. 
You will answer the question and also include exact excerpts from the context and 
quote them as quotes. If you are unable to answer the question based on the context, 
your response should be “I’m sorry I cannot answer based on the context”.

##Examples: 
Question: What factors contributed to the growth of Amazon 
Quotes: [1] Our
vision for Kindle is every book ever printed in any language, all available in less 
than 60 seconds.
Publishers—including all the major publishers—have embraced Kindle, and we’re thankful 
for that. From a
publisher’s point of view, there are a lot of advantages to Kindle. Books never go out 
of print, and they never go
out of stock. Nor is there ever waste from over-printing. Most important, Kindle makes 
it more convenient for
readers to buy more books. Anytime you make something simpler and lower friction, you 
get more of it.
Answer: Inovation with Kindle and publisher collaboration contributed to the growth of 
Amazon [1] 

##Output Format 
Quotes: 
[1] ....
[2] ...
Answer:

User Prompt

Question: What contributed to net sales?
Context: <Amazon 2009 shareholder letter... document not shown for brevity> 

Here’s the response from Nova Pro for the above prompt

Quotes: 
[1] "Net sales increased 28% year-over-year to $24.51 billion in 2009." 
[2] "More customers are taking advantage of Amazon Prime, with worldwide memberships 
up significantly over last year." 
[3] "We added 21 new product categories around the world in 2009..." 
[4] "It was a busy year for our shoes business. In November we acquired Zappos..." 
[5] "Sales of products by third party sellers on our websites represented 30% of unit
 sales in 2009." 
[6] "Amazon Web Services continued its rapid pace of innovation, launching many new 
services and features..." 
[7] "The U.S. Kindle Store now has more than 460,000 books..."

Answer: Several factors contributed to the increase in net sales for Amazon in 2009. 
These include the significant growth in Amazon Prime memberships [2], the addition 
of 21 new product categories globally [3], the acquisition of Zappos which enhanced 
the shoes business [4], and the strong performance of third-party sellers on Amazon's 
platforms [5]. Additionally, the innovation and expansion of Amazon Web Services (AWS) 
attracted more customers [6], and the growth of the Kindle Store with a larger selection 
of books and content [7] also played a role. These combined efforts resulted in a 28% 
year-over-year increase in net sales to $24.51 billion [1].

As you can see Nova Pro is following our instructions and providing the answer along with the citations. We’ve verified the quotes are indeed present in the 2009 shareholder letter.

Here’s another user prompt (with the same system prompt as above) along with the model’s response

User Prompt:

Question: How does Bezos describe Amazon's approach to failure, and how does he 
tie it to innovation?
Context: <Amazon 2006 shareholder letter> 

Model response

Quotes: 

[1] "We won’t always choose right, and we won’t always succeed. But we will be 
choosy, and we will work hard and patiently."

[2] "In my view, Amazon’s culture is unusually supportive of small businesses 
with big potential, and I believe that’s a source of competitive advantage."

[3] "The culture demands that these new businesses be high potential and that 
they be innovative and differentiated, but it does not demand that they be large 
on the day that they are born."

Answer: 

Bezos describes Amazon's approach to failure as one that is inherently tied to 
innovation and experimentation. He acknowledges that not every venture will be 
successful, stating, "We won’t always choose right, and we won’t always succeed" 
[1]. However, he emphasizes the importance of being selective and persistent, 
working hard and patiently despite potential failures. 

Bezos ties this approach to innovation by highlighting Amazon's supportive culture 
towards small, high-potential businesses. He notes, "In my view, Amazon’s culture 
is unusually supportive of small businesses with big potential, and I believe that’s 
a source of competitive advantage" [2]. This culture allows for the nurturing of 
innovative ideas, even if they start small. He further explains that the company’s 
culture demands innovation and differentiation in new businesses, but does not 
require them to be large from the outset [3]. This patient and experimental approach 
fosters an environment where failure is seen as a part of the innovation process, 
rather than a deterrent.

Evaluating citations 

While citations are good, it’s important to evaluate that the model is following our instructions and including the citation verbatim from the context and not making up the citations.

To evaluate the citations at scale, we used another LLM to judge the responses from Amazon Nova Pro. We used the LLM-as-a-judge technique in Amazon Bedrock evaluations and evaluated 10 different prompts. LLM-as-a-judge on Amazon Bedrock Model Evaluation provides a comprehensive, end-to-end solution for assessing and optimizing AI model performance. This automated process uses the power of LLMs to evaluate responses across multiple metric categories (such as correctness, completeness, harmfulness, helpfulness and more) offering insights that can significantly improve your AI applications.

We prepared the input dataset for evaluation. The input dataset is a jsonl file containing our prompts that we want to evaluate. Each line in the jsonl file must include key-value pairs. Here are the required and optional fields for the input dataset:

  • prompt (required): This key indicates the input for various tasks. It can be used for general text generation where the model needs to provide a response, question-answering tasks where the model must answer a specific question, text summarization tasks where the model needs to summarize a given text, or classification tasks where the model must categorize the provided text.
  • referenceResponse (optional – used for specific metrics with ground truth): This key contains the ground truth or correct response. It serves as the reference point against which the model’s responses will be evaluated if it is provided.
  • category (optional): This key is used to generate evaluation scores reported by category, helping organize and segment evaluation results for better analysis.

Here’s an example jsonl file for evaluating our prompts (full jsonl file not shown for brevity).

{
	"prompt": "##Model Instructions You are a QA agent. You answer questions 
based on the context provided. You will answer the question and also include exact 
excerpts from the context and quote them as quotes. n ##Examples: nQuestion: What 
factors contributed to the growth of AmazonnQuotes: [1] Ourvision for Kindle is 
every book ever printed in any language, all available in less than 60 seconds. 
Publishers—including all the major publishers—have embraced Kindle, and we're thankful 
for that. From a publisher’s point of view, there are a lot of advantages to Kindle. 
Books never go out of print, and they never go out of stock. Nor is there ever waste 
from over-printing. Most important, Kindle makes it more convenient for readers to buy 
more books. Anytime you make something simpler and lower friction, you get more of it.n 
Answer: Inovation with Kindle and publisher collaboration contributed to the growth of 
Amazon [1]nn ##Output FormatnQuotes: [1] ....n[2] ...nn Answer: nnQuestion:How 
does Bezos describe Amazon's approach to failure, and how does he tie it to innovation?n 
Context: <Amazon shareholder letter…. Not included here for brevity”
}
{
	"prompt":……..
}

We then started a model evaluation job using the Bedrock API with Anthropic Claude 3.5 Sonnet v1 as the evaluator/judge model. We have open sourced our code on the AWS Samples GitHub.

We evaluated our prompts and responses for the following built-in metrics

  1. Helpfulness
  2. Correctness
  3. Professional style and tone
  4. Faithfulness
  5. Completeness
  6. Coherence
  7. Following instructions
  8. Relevance
  9. Readability
  10. Harmfuless

Here’s the result summary of our evaluation. As you can see, Nova Pro had a 0.78 score on coherence and faithfulness and 0.67 on correctness. The high scores indicate that Nova Pro’s responses were holistic, useful, complete and accurate while being coherent as evaluated by Claude 3.5 Sonnet.

Conclusion

In this post, we walked through how we can prompt Amazon Nova understanding models to cite sources from the context through simple instructions. Amazon Nova’s capability to include citations in its responses demonstrates a practical approach to implementing this feature, showcasing how simple instructions can lead to more reliable and trustworthy AI interactions. The evaluation of these citations, using an LLM-as-a-judge technique, further underscores the importance of assessing the quality and faithfulness of AI-generated responses. To learn more about prompting for Amazon Nova models please visit this prompt library. You can learn more about Amazon Bedrock evaluations on the AWS website.


About the authors

Sunita Koppar is a Senior Specialist Solutions Architect in Generative AI and Machine Learning at AWS, where she partners with customers across diverse industries to design solutions, build proof-of-concepts, and drive measurable business outcomes. Beyond her professional role, she is deeply passionate about learning and teaching Sanskrit, actively engaging with student communities to help them upskill and grow.

Veda Raman is a Senior Specialist Solutions Architect for generative AI and machine learning at AWS. Veda works with customers to help them architect efficient, secure, and scalable machine learning applications. Veda specializes in generative AI services like Amazon Bedrock and Amazon SageMaker.

Read More

Securely launch and scale your agents and tools on Amazon Bedrock AgentCore Runtime

Securely launch and scale your agents and tools on Amazon Bedrock AgentCore Runtime

Organizations are increasingly excited about the potential of AI agents, but many find themselves stuck in what we call “proof of concept purgatory”—where promising agent prototypes struggle to make the leap to production deployment. In our conversations with customers, we’ve heard consistent challenges that block the path from experimentation to enterprise-grade deployment:

“Our developers want to use different frameworks and models for different use cases—forcing standardization slows innovation.”

“The stochastic nature of agents makes security more complex than traditional applications—we need stronger isolation between user sessions.”

“We struggle with identity and access control for agents that need to act on behalf of users or access sensitive systems.”

“Our agents need to handle various input types—text, images, documents—often with large payloads that exceed typical serverless compute limits.”

“We can’t predict the compute resources each agent will need, and costs can spiral when overprovisioning for peak demand.”

“Managing infrastructure for agents that may be a mix of short and long-running requires specialized expertise that diverts our focus from building actual agent functionality.”

Amazon Bedrock AgentCore Runtime addresses these challenges with a secure, serverless hosting environment specifically designed for AI agents and tools. Whereas traditional application hosting systems weren’t built for the unique characteristics of agent workloads—variable execution times, stateful interactions, and complex security requirements—AgentCore Runtime was purpose-built for these needs.

The service alleviates the infrastructure complexity that has kept promising agent prototypes from reaching production. It handles the undifferentiated heavy lifting of container orchestration, session management, scalability, and security isolation, helping developers focus on creating intelligent experiences rather than managing infrastructure. In this post, we discuss how to accomplish the following:

  • Use different agent frameworks and different models
  • Deploy, scale, and stream agent responses in four lines of code
  • Secure agent execution with session isolation and embedded identity
  • Use state persistence for stateful agents along with Amazon Bedrock AgentCore Memory
  • Process different modalities with large payloads
  • Operate asynchronous multi-hour agents
  • Pay only for used resources

Use different agent frameworks and models

One advantage of AgentCore Runtime is its framework-agnostic and model-agnostic approach to agent deployment. Whether your team has invested in LangGraph for complex reasoning workflows, adopted CrewAI for multi-agent collaboration, or built custom agents using Strands, AgentCore Runtime can use your existing code base without requiring architectural changes or any framework migrations. Refer to these samples on Github for examples.

With AgentCore Runtime, you can integrate different large language models (LLMs) from your preferred provider, such as Amazon Bedrock managed models, Anthropic’s Claude, OpenAI’s API, or Google’s Gemini. This makes sure your agent implementations remain portable and adaptable as the LLM landscape continues to evolve while helping you pick the right model for your use case to optimize for performance, cost, or other business requirements. This gives you and your team the flexibility to choose your favorite or most useful framework or model using a unified deployment pattern.

Let’s examine how AgentCore Runtime supports two different frameworks and model providers:

LangGraph agent using Anthropic’s Claude Sonnet on Amazon Bedrock Strands agent using GPT 4o mini through the OpenAI API

For the full code examples, refer to langgraph_agent_web_search.py and strands_openai_identity.py on GitHub.

Both examples above show how you can use AgentCore SDK, regardless of the underlying framework or model choice. After you have modified your code as shown in these examples, you can deploy your agent with or without the AgentCore Runtime starter toolkit, discussed in the next section.

Note that there are minimal additions, specific to AgentCore SDK, to the example code above. Let us dive deeper into this in the next section.

Deploy, scale, and stream agent responses with four lines of code

Let’s examine the two examples above. In both examples, we only add four new lines of code:

  • Importfrom bedrock_agentcore.runtime import BedrockAgentCoreApp
  • Initializeapp = BedrockAgentCoreApp()
  • Decorate@app.entrypoint
  • Runapp.run()

Once you have made these changes, the most straightforward way to get started with agentcore is to use the AgentCore Starter toolkit. We suggest using uv to create and manage local development environments and package requirements in python. To get started, install the starter toolkit as follows:

uv pip install bedrock-agentcore-starter-toolkit

Run the appropriate commands to configure, launch, and invoke to deploy and use your agent. The following video provides a quick walkthrough.

For your chat style applications, AgentCore Runtime supports streaming out of the box. For example, in Strands, locate the following synchronous code:

result = agent(user_message)

Change the preceding code to the following and deploy:

agent_stream = agent.stream_async(user_message)
    async for event in agent_stream:
        yield event #you can process/filter these events before yielding

For more examples on streaming agents, refer to the following GitHub repo. The following is an example streamlit application streaming back responses from an AgentCore Runtime agent.

Secure agent execution with session isolation and embedded identity

AgentCore Runtime fundamentally changes how we think about serverless compute for agentic applications by introducing persistent execution environments that can maintain an agent’s state across multiple invocations. Rather than the typical serverless model where functions spin up, execute, and immediately terminate, AgentCore Runtime provisions dedicated microVMs that can persist for up to 8 hours. This enables sophisticated multi-step agentic workflows where each subsequent call builds upon the accumulated context and state from previous interactions within the same session. The practical implication of this is that you can now implement complex, stateful logic patterns that would previously require external state management solutions or cumbersome workarounds to maintain context between function executions. This doesn’t obviate the need for external state management (see the following section on using AgentCore Runtime with AgentCore Memory), but is a common need for maintaining local state and files temporarily, within a session context.

Understanding the session lifecycle

The session lifecycle operates through three distinct states that govern resource allocation and availability (see diagram below for a high level view of this session lifecycle). When you first invoke a runtime with a unique session identifier, AgentCore provisions a dedicated execution environment and transitions it to an Active state during request processing or when background tasks are running.

The system automatically tracks synchronous invocation activity, while background processes can signal their status through HealthyBusy responses to health check pings from the service (see the later section on asynchronous workloads). Sessions transition to Idle when not processing requests but remain provisioned and ready for immediate use, reducing cold start penalties for subsequent invocations.

Finally, sessions reach a Terminated state when they currently exceed a 15-minute inactivity threshold, hit the 8-hour maximum duration limit, or fail health checks. Understanding these state transitions is crucial for designing resilient workflows that gracefully handle session boundaries and resource cleanup. For more details on session lifecycle-related quotas, refer to AgentCore Runtime Service Quotas.

The ephemeral nature of AgentCore sessions means that runtime state exists solely within the boundaries of the active session lifecycle. The data your agent accumulates during execution—such as conversation context, user preference mappings, intermediate computational results, or transient workflow state—remains accessible only while the session persists and is completely purged when the session terminates.

For persistent data requirements that extend beyond individual session boundaries, AgentCore Memory provides the architectural solution for durable state management. This purpose-built service is specifically engineered for agent workloads and offers both short-term and long-term memory abstractions that can maintain user conversation histories, learned behavioral patterns, and critical insights across session boundaries. See documentation here for more information on getting started with AgentCore Memory.

True session isolation

Session isolation in AI agent workloads addresses fundamental security and operational challenges that don’t exist in traditional application architectures. Unlike stateless functions that process individual requests independently, AI agents maintain complex contextual state throughout extended reasoning processes, handle privileged operations with sensitive credentials and files, and exhibit non-deterministic behavior patterns. This creates unique risks where one user’s agent could potentially access another’s data—session-specific information could be used across multiple sessions, credentials could leak between sessions, or unpredictable agent behavior could compromise system boundaries. Traditional containerization or process isolation isn’t sufficient because agents need persistent state management while maintaining absolute separation between users.

Let’s explore a case study: In May 2025, Asana deployed a new MCP server to power agentic AI features (integrations with ChatGPT, Anthropic’s Claude, Microsoft Copilot) across its enterprise software as a service (SaaS) offering. Due to a logic flaw in MCP’s tenant isolation and relying solely on user but not agent identity, requests from one organization’s user could inadvertently retrieve cached results containing another organization’s data. This cross-tenant contamination wasn’t triggered by a targeted exploit but was an intrinsic security fault in handling context and cache separation across agentic AI-driven sessions.

The exposure silently persisted for 34 days, impacting roughly 1,000 organizations, including major enterprises. After it was discovered, Asana halted the service, remediated the bug, notified affected customers, and released a fix.

AgentCore Runtime solves these challenges through complete microVM isolation that goes beyond simple resource separation. Each session receives its own dedicated virtual machine with isolated compute, memory, and file system resources, making sure agent state, tool operations, and credential access remain completely compartmentalized. When a session ends, the entire microVM is terminated and memory sanitized, minimizing the risk of data persistence or cross-contamination. This architecture provides the deterministic security boundaries that enterprise deployments require, even when dealing with the inherently probabilistic and non-deterministic nature of AI agents, while still enabling the stateful, personalized experiences that make agents valuable. Although other offerings might provide sandboxed kernels, with the ability to manage your own session state, persistence, and isolation, this should not be treated a strict security boundary. AgentCore Runtime provides consistent, deterministic isolation boundaries regardless of agent execution patterns, delivering the predictable security properties required for enterprise deployments. The following diagram shows how two separate sessions run in isolated microVM kernels.

AgentCore Runtime embedded identity

Traditional agent deployments often struggle with identity and access management, particularly when agents need to act on behalf of users or access external services securely. The challenge becomes even more complex in multi-tenant environments—for example, where you need to make sure Agent A accessing Google Drive on behalf of User 1 can never accidentally retrieve data belonging to User 2.

AgentCore Runtime addresses these challenges through its embedded identity system that seamlessly integrates authentication and authorization into the agent execution environment. First, each runtime is associated with a unique workload identity (you can treat this as a unique agent identity). The service supports two primary authentication mechanisms for agents using this unique agent identity: IAM SigV4 Authentication for agents operating within AWS security boundaries, and OAuth based (JWT Bearer Token Authentication) integration with existing enterprise identity providers like Amazon Cognito, Okta, or Microsoft Entra ID.

When deploying an agent with AWS Identity and Access Management (IAM) authentication, users don’t have to incorporate other Amazon Bedrock AgentCore Identity specific settings or setup—simply configure with IAM authorization, launch, and invoke with the right user credentials.

When using JWT authentication, you configure the authorizer during the CreateAgentRuntime operation, specifying your identity provider (IdP)-specific discovery URL and allowed clients. Your existing agent code requires no modification—you simply add the authorizer configuration to your runtime deployment. When a calling entity or user invokes your agent, they pass their IdP-specific access token as a bearer token in the Authorization header. AgentCore Runtime uses AgentCore Identity to automatically validate this token against your configured authorizer and rejects unauthorized requests. The following diagram shows the flow of information between AgentCore runtime, your IdP, AgentCore Identity, other AgentCore services, other AWS services (in orange), and other external APIs or resources (in purple).

Behind the scenes, AgentCore Runtime automatically exchanges validated user tokens for workload access tokens (through the bedrock-agentcore:GetWorkloadAccessTokenForJWT API). This provides secure outbound access to external services through the AgentCore credential provider system, where tokens are cached using the combination of agent workload identity and user ID as the binding key. This cryptographic binding makes sure, for example, User 1’s Google token can never be accessed when processing requests for User 2, regardless of application logic errors. Note that in the preceding diagram, connecting to AWS resources can be achieved simply by editing the AgentCore Runtime execution role, but connections to Amazon Bedrock AgentCore Gateway or to another runtime will require reauthorization with a new access token.

The most straightforward way to configure your agent with OAuth-based inbound access is to use the AgentCore starter toolkit:

  1. With the AWS Command Line Interface (AWS CLI), follow the prompts to interactively enter your OAuth discovery URL and allowed Client IDs (comma-separated).

  1. With Python, use the following code:
 bedrock_agentcore_starter_toolkit  Runtime
 boto3.session  Session
boto_session  Session()
region  boto_sessionregion_name
region

discovery_url  '<your-cognito-user-pool-discovery-url>'

client_id  '<your-cognito-app-client-id>'

agentcore_runtime  Runtime()
response  agentcore_runtimeconfigure(
    entrypoint"strands_openai.py",
    auto_create_execution_role,
    auto_create_ecr,
    requirements_file"requirements.txt",
    regionregion,
    agent_nameagent_name,
    authorizer_configuration{
        "customJWTAuthorizer": {
        "discoveryUrl": discovery_url,
        "allowedClients": [client_id]
        }
    }
    )
  1. For outbound access (for example, if your agent uses OpenAI APIs), first set up your keys using the API or the Amazon Bedrock console, as shown in the following screenshot.

  1. Then access your keys from within your AgentCore Runtime agent code:
from bedrock_agentcore.identity.auth import requires_api_key

@requires_api_key(
    provider_name="openai-apikey-provider" # replace with your own credential provider name
)
async def need_api_key(*, api_key: str):
    print(f'received api key for async func: {api_key}')
    os.environ["OPENAI_API_KEY"] = api_key

For more information on AgentCore Identity, refer to Authenticate and authorize with Inbound Auth and Outbound Auth and Hosting AI Agents on AgentCore Runtime.

Use AgentCore Runtime state persistence with AgentCore Memory

AgentCore Runtime provides ephemeral, session-specific state management that maintains context during active conversations but doesn’t persist beyond the session lifecycle. Each user session preserves conversational state, objects in memory, and local temporary files within isolated execution environments. For short-lived agents, you can use the state persistence offered by AgentCore Runtime without needing to save this information externally. However, at the end of the session lifecycle, the ephemeral state is permanently destroyed, making this approach suitable only for interactions that don’t require knowledge retention across separate conversations.

AgentCore Memory addresses this challenge by providing persistent storage that survives beyond individual sessions. Short-term memory captures raw interactions as events using create_event, storing the complete conversation history that can be retrieved with get_last_k_turns even if the runtime session restarts. Long-term memory uses configurable strategies to extract and consolidate key insights from these raw interactions, such as user preferences, important facts, or conversation summaries. Through retrieve_memories, agents can access this persistent knowledge across completely different sessions, enabling personalized experiences. The following diagram shows how AgentCore Runtime can use specific APIs to interact with Short-term and Long-term memory in AgentCore Memory.

This basic architecture, of using a runtime to host your agents, and a combination of short- and long-term memory has become commonplace in most agentic AI applications today. Invocations to AgentCore Runtime with the same session ID lets you access the agent state (for example, in a conversational flow) as though it were running locally, without the overhead of external storage operations, and AgentCore Memory selectively captures and structures the valuable information worth preserving beyond the session lifecycle. This hybrid approach means agents can maintain fast, contextual responses during active sessions while building cumulative intelligence over time. The automatic asynchronous processing of long-term memories according to each strategy in AgentCore Memory makes sure insights are extracted and consolidated without impacting real-time performance, creating a seamless experience where agents become progressively more helpful while maintaining responsive interactions. This architecture avoids the traditional trade-off between conversation speed and long-term learning, enabling agents that are both immediately useful and continuously improving.

Process different modalities with large payloads

Most AI agent systems struggle with large file processing due to strict payload size limits, typically capping requests at just a few megabytes. This forces developers to implement complex file chunking, multiple API calls, or external storage solutions that add latency and complexity. AgentCore Runtime removes these constraints by supporting payloads up to 100 MB in size, enabling agents to process substantial datasets, high-resolution images, audio, and comprehensive document collections in a single invocation.

Consider a financial audit scenario where you need to verify quarterly sales performance by comparing detailed transaction data against a dashboard screenshot from your analytics system. Traditional approaches would require using external storage such as Amazon Simple Storage Service (Amazon S3) or Google Drive to download the Excel file and image into the container running the agent logic. With AgentCore Runtime, you can send both the comprehensive sales data and the dashboard image in a single payload from the client:

large_payload = {
"prompt": "Compare the Q4 sales data with the dashboard metrics and identify any discrepancies",
"sales_data": base64.b64encode(excel_sales_data).decode('utf-8'),
"dashboard_image": base64.b64encode(dashboard_screenshot).decode('utf-8')
}

The agent’s entrypoint function can be modified to process both data sources simultaneously, enabling this cross-validation analysis:

@app.entrypoint
def audit_analyzer(payload, context):
    inputs = [
        {"text": payload.get("prompt", "Analyze the sales data and dashboard")},
        {"document": {"format": "xlsx", "name": "sales_data", 
                     "source": {"bytes": base64.b64decode(payload["sales_data"])}}},
        {"image": {"format": "png", 
                  "source": {"bytes": base64.b64decode(payload["dashboard_image"])}}}
    ]
    
    response = agent(inputs)
    return response.message['content'][0]['text']

To test out an example of using large payloads, refer to the following GitHub repo.

Operate asynchronous multi-hour agents

As AI agents evolve to tackle increasingly complex tasks—from processing large datasets to generating comprehensive reports—they often require multi-step processing that can take significant time to complete. However, most agent implementations are synchronous (with response streaming) that block until completion. While synchronous, streaming agents are a common way to expose agentic chat applications to users, users cannot interact with the agent when a task or tool is still running, view the status of, or cancel background operations, or start more concurrent tasks while others have still not completed.

Building asynchronous agents forces developers to implement complex distributed task management systems with state persistence, job queues, worker coordination, failure recovery, and cross-invocation state management while also navigating serverless system limitations like execution timeouts (tens of minutes), payload size restrictions, and cold start penalties for long-running compute operations—a significant heavy lift that diverts focus from core functionality.

AgentCore Runtime alleviates this complexity through stateful execution sessions that maintain context across invocations, so developers can build upon previous work incrementally without implementing complex task management logic. The AgentCore SDK provides ready-to-use constructs for tracking asynchronous tasks and seamlessly managing compute lifecycles, and AgentCore Runtime supports execution times up to 8 hours and request/response payload sizes of 100 MB, making it suitable for most asynchronous agent tasks.

Getting started with asynchronous agents

You can get started with just a couple of code changes:

pip install bedrock-agentcore

To build interactive agents that perform asynchronous tasks, simply call add_async_task when starting a task and complete_async_task when finished. The SDK automatically handles task tracking and manages compute lifecycle for you.

# Start tracking a task
task_id = app.add_async_task("data_processing")

# Do your work...
# (your business logic here)

# Mark task as complete
app.complete_async_task(task_id)

These two method calls transform your synchronous agent into a fully asynchronous, interactive system. Refer to this sample for more details.

The following example shows the difference between a synchronous agent that streams back responses to the user immediately vs. a more complex multi-agent scenario where longer running, asynchronous background shopping agents use Amazon Bedrock AgentCore Browser to automate a shopping experience on amazon.com on behalf of the user.

Pay only for Used Resources

Amazon Bedrock AgentCore Runtime introduces a consumption-based pricing model that fundamentally changes how you pay for AI agent infrastructure. Unlike traditional compute models that charge for allocated resources regardless of utilization, AgentCore Runtime bills you only for what you actually use however long you use it; said differently, you don’t have to pre-allocate resources like CPU or GB Memory, and you don’t pay for CPU resources during I/O wait periods. This distinction is particularly valuable for AI agents, which typically spend significant time waiting for LLM responses or external API calls to complete. Here is a typical Agent event loop, where we only expect the purple boxes to be processed within Runtime:

The LLM call (light blue) and tool call (green) boxes take time, but are run outside the context of AgentCore Runtime; users only pay for processing that happens in Runtime itself (purple boxes). Let’s look at some real-world examples to understand the impact:

Customer support agent example

Consider a customer support agent that handles 10,000 user inquiries per day. Each interaction involves initial query processing, knowledge retrieval from Retrieval Augmented Generation (RAG) systems, LLM reasoning for response formulation, API calls to order systems, and final response generation. In a typical session lasting 60 seconds, the agent could actively use CPU for only 18 seconds (30%) while spending the remaining 42 seconds (70%) waiting for LLM responses or API calls to complete. Memory usage can fluctuate between 1.5 GB to 2.5 GB depending on the complexity of the customer query and the amount of context needed. With traditional compute models, you would pay for the full 60 seconds of CPU time and peak memory allocation. With AgentCore Runtime, you only pay for the 18 seconds of active CPU processing and the actual memory consumed moment-by-moment:

CPU cost: 18 seconds × 1 vCPU × ($0.0895/3600) = $0.0004475
 Memory cost: 60 seconds × 2GB average × ($0.00945/3600) = $0.000315
 Total per session: $0.0007625

For 10,000 daily sessions, this represents a 70% reduction in CPU costs compared to traditional models that would charge for the full 60 seconds.

Data analysis agent example

The savings become even more dramatic for data processing agents that handle complex workflows. A financial analysis agent processing quarterly reports might run for three hours but have highly variable resource needs. During data loading and initial parsing, it might use minimal resources (0.5 vCPU, 2 GB memory). When performing complex calculations or running statistical models, it might spike to 2 vCPU and 8 GB memory for just 15 minutes of the total runtime, while spending the remaining time waiting for batch operations or model inferences at much lower resource utilization. By charging only for actual resource consumption while maintaining your session state during I/O waits, AgentCore Runtime aligns costs directly with value creation, making sophisticated agent deployments economically viable at scale.

Conclusion

In this post, we explored how AgentCore Runtime simplifies the deployment and management of AI agents. The service addresses critical challenges that have traditionally blocked agent adoption at scale, offering framework-agnostic deployment, true session isolation, embedded identity management, and support for large payloads and long-running, asynchronous agents, all with a consumption based model where you pay only for the resources you use.

With just four lines of code, developers can securely launch and scale their agents while using AgentCore Memory for persistent state management across sessions. For hands-on examples on AgentCore Runtime covering simple tutorials to complex use cases, and demonstrating integrations with various frameworks such as LangGraph, Strands, CrewAI, MCP, ADK, Autogen, LlamaIndex, and OpenAI Agents, refer to the following examples on GitHub:


About the authors

Shreyas Subramanian is a Principal Data Scientist and helps customers by using Generative AI and deep learning to solve their business challenges using AWS services like Amazon Bedrock and AgentCore. Dr. Subramanian contributes to cutting-edge research in deep learning, Agentic AI, foundation models and optimization techniques with several books, papers and patents to his name. In his current role at Amazon, Dr. Subramanian works with various science leaders and research teams within and outside Amazon, helping to guide customers to best leverage state-of-the-art algorithms and techniques to solve business critical problems. Outside AWS, Dr. Subramanian is a expert reviewer for AI papers and funding via organizations like Neurips, ICML, ICLR, NASA and NSF.

Kosti Vasilakakis is a Principal PM at AWS on the Agentic AI team, where he has led the design and development of several Bedrock AgentCore services from the ground up, including Runtime. He previously worked on Amazon SageMaker since its early days, launching AI/ML capabilities now used by thousands of companies worldwide. Earlier in his career, Kosti was a data scientist. Outside of work, he builds personal productivity automations, plays tennis, and explores the wilderness with his family.

Vivek Bhadauria is a Principal Engineer at Amazon Bedrock with almost a decade of experience in building AI/ML services. He now focuses on building generative AI services such as Amazon Bedrock Agents and Amazon Bedrock Guardrails. In his free time, he enjoys biking and hiking.

Read More

PwC and AWS Build Responsible AI with Automated Reasoning on Amazon Bedrock

PwC and AWS Build Responsible AI with Automated Reasoning on Amazon Bedrock

This is a guest post co-written with Scott Likens, Ambuj Gupta, Adam Hood, Chantal Hudson, Priyanka Mukhopadhyay, Deniz Konak Ozturk, and Kevin Paul from PwC

Organizations are deploying generative AI solutions while balancing accuracy, security, and compliance. In this globally competitive environment, scale matters less, speed matters more, and innovation matters most of all, according to recent PwC 2025 business insights on AI agents. To maintain a competitive advantage, organizations must support rapid deployment and verifiable trust in AI outputs. Particularly within regulated industries, mathematical verification of results can transform the speed of innovation from a potential risk into a competitive advantage.

This post presents how AWS and PwC are developing new reasoning checks that combine deep industry expertise with Automated Reasoning checks in Amazon Bedrock Guardrails to support innovation. Automated Reasoning is a branch of AI focused on algorithmic search for mathematical proofs. Automated Reasoning checks in Amazon Bedrock Guardrails, which encode knowledge into formal logic to validate if large language models (LLM) outputs are possible, are generally available as of August 6, 2025.

This new guardrail policy maintains accuracy within defined parameters, unlike traditional probabilistic reasoning methods. The system evaluates AI-generated content against rules derived from policy documents, including company guidelines and operational standards. Automated Reasoning checks produce findings that provide insights into whether the AI-generated content aligns with the rules extracted from the policy, highlights ambiguity that exists in the content, and provides suggestions on how to remove assumptions.

“In a field where breakthroughs are happening at incredible speed, reasoning is one of the most important technical advances to help our joint customers succeed in generative AI,” says Matt Wood, Global CTIO at PwC, at AWS Re:Invent 2024.

Industry-transforming use cases using Amazon Bedrock Automated Reasoning checks

The strategic alliance combining PwC’s proven, deep expertise and the innovative technology from AWS is set to transform how businesses approach AI-driven innovation. The following diagram illustrates PWC’s Automated Reasoning implementation. We initially focus on highly regulated industries such as pharmaceuticals, financial services, and energy.

In the following sections, we present three groundbreaking use cases developed by PwC teams.

EU AI Act compliance for financial services risk management

The European Union (EU) AI Act requires organizations to classify and verify all AI applications according to specific risk levels and governance requirements. PwC has developed a practical approach to address this challenge using Automated Reasoning checks in Amazon Bedrock Guardrails, which transforms EU AI Act compliance from a manual burden into a systematic, verifiable process. Given a description of an AI application’s use case, the solution converts risk classification criteria into defined guardrails, enabling organizations to consistently assess and monitor AI applications while supporting expert human judgment through automated compliance verification with auditable artifacts.The key benefits of using Automated Reasoning checks include:

  • Automated classification of AI use cases into risk categories
  • Verifiable logic trails for AI-generated classifications
  • Enhanced speed in identifying the required governance controls

The following diagram illustrates the workflow for this use case.

Pharmaceutical content review

PwC’s Regulated Content Orchestrator (RCO) is a globally scalable, multi-agent capability—powered by a core rules engine customized to company, region, product, and indication for use—that automates medical, legal, regulatory, and brand compliance. The RCO team was an early incubating collaborator of Amazon Bedrock Automated Reasoning checks, implementing it as a secondary validation layer in the marketing content generation process. This enhanced defense strengthened existing content controls, resulting in accelerated content creation and review processes while enhancing compliance standards.Key benefits of Automated Reasoning Checks in Amazon Bedrock Guardrails include:

  • Applies automated, mathematically based safeguards for verifying RCO’s analysis
  • Enables transparent QA with traceable, audit-ready reasoning
  • Safeguards against potentially unsupported or hallucinated outputs

The following diagram illustrates the workflow for this use case.

Utility outage management for real-time decision support

Utility outage management applies Automated Reasoning checks in Amazon Bedrock Guardrails to enhance response times and operational efficiency of utility companies. The solution can generate standardized protocols from regulatory guidelines, creates procedures based on NERC and FERC requirements, and verifies AI-produced outage classifications. Through an integrated cloud-based architecture, this solution applies severity-based verification workflows to dispatch decisions—normal outages (3-hour target) assign tickets to available crews, medium severity (6-hour target) triggers expedited dispatch, and critical incidents (12-hour target) activate emergency procedures with proactive messaging.

The key benefits of using Automated Reasoning checks include:

  • Effective and enhanced responses to customers
  • Real-time operational insights with verified regulatory alignment
  • Accelerated decision-making with mathematical certainty

The following diagram illustrates the workflow for this use case.

Looking ahead

As the adoption of AI continues to evolve, particularly with agentic AI, the AWS and PwC alliance is focused on the following:

  • Expanding Automated Reasoning checks integrated solutions across more industries
  • Developing industry-specific agentic AI solutions with built-in compliance verification
  • Enhancing explainability features to provide greater transparency

Conclusion

The integration of Automated Reasoning checks in Amazon Bedrock Guardrails with PwC’s deep industry expertise offers a powerful avenue to help deploy AI-based solutions. As an important component of responsible AI, Automated Reasoning checks provides safeguards that help improve the trustworthiness of AI applications. With the expectation of mathematical certainty and verifiable trust in AI outputs, organizations can now innovate without compromising on accuracy, security, or compliance. To learn more about how Automated Reasoning checks works, refer to Minimize AI hallucinations and deliver up to 99% verification accuracy with Automated Reasoning checks: Now available and Improve accuracy by adding Automated Reasoning checks in Amazon Bedrock Guardrails.

Explore how Automated Reasoning checks in Amazon Bedrock can improve the trustworthiness of your generative AI applications. To learn more about using this capability or to discuss custom solutions for your specific needs, contact your AWS account team or an AWS Solutions Architect. Contact the PwC team to learn how you can use the combined power of AWS and PwC to drive innovation in your industry.


About the authors

Nafi Diallo is a Senior Automated Reasoning Architect at Amazon Web Services, where she advances innovations in AI safety and Automated Reasoning systems for generative AI applications. Her expertise is in formal verification methods, AI guardrails implementation, and helping global customers build trustworthy and compliant AI solutions at scale. She holds a PhD in Computer Science with research in automated program repair and formal verification, and an MS in Financial Mathematics from WPI.

Adewale Akinfaderin is a Sr. Data Scientist–Generative AI, Amazon Bedrock, where he contributes to cutting edge innovations in foundational models and generative AI applications at AWS. His expertise is in reproducible and end-to-end AI/ML methods, practical implementations, and helping global customers formulate and develop scalable solutions to interdisciplinary problems. He has two graduate degrees in physics and a doctorate in engineering.


Bharathi Srinivasan is a Generative AI Data Scientist at the AWS Worldwide Specialist Organization. She works on developing solutions for Responsible AI, focusing on algorithmic fairness, veracity of large language models, and explainability. Bharathi guides internal teams and AWS customers on their responsible AI journey. She has presented her work at various learning conferences.

Dan Spillane, a Principal at Amazon Web Services (AWS), leads global strategic initiatives in the Consulting Center of Excellence (CCOE). He works with customers and partners to solve their critical business challenges using innovative technologies. Dan specializes in generative AI and responsible AI, including automated reasoning. He applies these tools to deliver measurable business value at scale. As a lifelong learner, Dan actively studies global cultures and business mechanisms, which enhances his ability to mentor others and drive cross-cultural initiatives.

Aartika Sardana Chandras is a Senior Product Marketing Manager for AWS Generative AI solutions, with a focus on Amazon Bedrock. She brings over 15 years of experience in product marketing, and is dedicated to empowering customers to navigate the complexities of the AI lifecycle. Aartika is passionate about helping customers leverage powerful AI technologies in an ethical and impactful manner.

Rama Lankalapalli is a Senior Partner Solutions Architect (PSA) at AWS where he leads a global team of PSAs supporting PwC, a major global systems integrator. Working closely with PwC’s global practice he champions enterprise cloud adoption by leveraging the breadth and depth of AWS services across migrations, modernization, security, AI/ML, and analytics. Rama architects scalable solutions that help organizations accelerate their digital transformation while delivering measurable business outcomes. His leadership combines deep technical expertise with strategic insight to drive customer success through innovative, industry-specific cloud solutions.

Scott Likens serves as the Chief AI Engineer over Global and US teams at PwC and leads the AI Engineering and Emerging Technology R&D teams domestically, driving the firm’s strategy around AI, Blockchain, VR, Quantum Computing, and other disruptors. With over 30 years of emerging technology expertise, he has helped clients transform customer experience, digital strategy, and operations for various industries.

Ambuj Gupta is a Director in PwC’s AI and Digital Contacts & Service practice, based in Chicago. With over 15 years of experience, Ambuj brings deep expertise in Artificial Intelligence, Agentic and Generative AI, Digital Contact Solutions, and Cloud Innovation across a broad spectrum of platforms and industries. He is recognized for driving strategic transformation through Cloud Native AI Automation and emerging technologies—including GenAI-powered agents, Intelligent Agent Assists, and Customer Data Platforms—to enhance channel performance and employee effectiveness.

Adam Hood is a Partner and AWS Data and AI Leader at PwC US. As a strategic and results-oriented technology leader, Adam specializes in driving enterprise-wide transformation and unlocking business value through the strategic application of digital systems, data, and GenAI/AI/ML including building agentic workflows. With a track record of success in industry and consulting, he has guided organizations through complex digital, finance, and ERP modernizations, from initial strategy and business case development to seamless execution and global rollout.

Chantal Hudson is a Manager in PwC UK’s AI and Modelling team. She has been with PwC for just over five years, starting her career in the South African firm. Chantal works primarily with large banks on credit risk modelling, and is particularly interested in the application of AI applying AI to advance modelling practices.

Priyanka Mukhopadhyay is a Manager in PwC’s Cloud and Digital Engineering practice. She is an AWS Certified Solution Architect – Associate with over 13 years of experience in Data Engineering. Over the past decade, she has honed her expertise in AWS services and has more than 12 years of experience in developing and delivering robust projects following Agile Methodologies.

Deniz Konak Ozturk is a Senior Manager within PwC’s AI &Modelling team. She has around 15 years of experience in AI/Gen AI and traditional model development, implementation and validation across UK and EU/non-EU territories and compliance assessment with EU regulations as well as IFRS9 audits. Over the past 6 years, her focus has been primarily on AI/Gen AI, highlighted by her involvement in AI Validation framework development, implementation of this framework into different clients, product management for an automated ML platform, and leading research and product ownership in an R&D initiative on Alternative Data Usage for ML based Risk Models targeting the financially underserved segment.

Kevin Paul is a Director within the AI Engineering group at PwC. He specializes in Applied AI, and has extensive of experience across the AI lifecycle, building and maintaining solutions across industries.

Read More

How Amazon scaled Rufus by building multi-node inference using AWS Trainium chips and vLLM

How Amazon scaled Rufus by building multi-node inference using AWS Trainium chips and vLLM

At Amazon, our team builds Rufus, a generative AI-powered shopping assistant that serves millions of customers at immense scale. However, deploying Rufus at scale introduces significant challenges that must be carefully navigated. Rufus is powered by a custom-built large language model (LLM). As the model’s complexity increased, we prioritized developing scalable multi-node inference capabilities that maintain high-quality interactions while delivering low latency and cost-efficiency.

In this post, we share how we developed a multi-node inference solution using Amazon Trainium and vLLM, an open source library designed for efficient and high-throughput serving of LLMs. We also discuss how we built a management layer on top of Amazon Elastic Container Service (Amazon ECS) to host models across multiple nodes, facilitating robust, reliable, and scalable deployments.

Challenges with multi-node inference

As our Rufus model grew bigger in size, we needed multiple accelerator instances because no single chip or instance had enough memory for the entire model. We first needed to engineer our model to be split across multiple accelerators. Techniques such as tensor parallelism can be used to accomplish this, which can also impact various metrics such as time to first token. At larger scale, the accelerators on a node might not be enough and require you to use multiple hosts or nodes. At that point, you must also address managing your nodes as well as how your model is sharded across them (and their respective accelerators). We needed to address two major areas:

  • Model performance – Maximize compute and memory resources utilization across multiple nodes to serve models at high throughput, without sacrificing low latency. This includes designing effective parallelism strategies and model weight-sharding approaches to partition computation and memory footprint both within the same node and across multiple nodes, and an efficient batching mechanism that maximizes hardware resource utilization under dynamic request patterns.
  • Multi-node inference infrastructure – Design a containerized, multi-node inference abstraction that represents a single model running across multiple nodes. This abstraction and underlying infrastructure needs to support fast inter-node communication, maintain consistency across distributed components, and allow for deployment and scaling as a single, deployable unit. In addition, it must support continuous integration to allow rapid iteration and safe, reliable rollouts in production environments.

Solution overview

Taking these requirements into account, we built multi-node inference solution designed to overcome the scalability, performance, and reliability challenges inherent in serving LLMs at production scale using tens of thousands of TRN1 instances.

To create a multi-node inference infrastructure, we implemented a leader/follower multi-node inference architecture in vLLM. In this configuration, the leader node uses vLLM for request scheduling, batching, and orchestration, and follower nodes execute distributed model computations. Both leader and follower nodes share the same NeuronWorker implementation in vLLM, providing a consistent model execution path through seamless integration with the AWS Neuron SDK.

To address how we split the model across multiple instances and accelerators, we used hybrid parallelism strategies supported in the Neuron SDK. Hybrid parallelism strategies such as tensor parallelism and data parallelism are selectively applied to maximize cross-node compute and memory bandwidth utilization, significantly improving overall throughput.

Being aware of how the nodes are connected is also important to avoid latency penalties. We took advantage of network topology-aware node placement. Optimized placement facilitates low-latency, high-bandwidth cross-node communication using Elastic Fabric Adapter (EFA), minimizing communication overhead and improving collective operation efficiency.

Lastly, to manage models across multiple nodes, we built a multi-node inference unit abstraction layer on Amazon ECS. This abstraction layer supports deploying and scaling multiple nodes as a single, cohesive unit, providing robust and reliable large-scale production deployments.

By combining a leader/follower orchestration model, hybrid parallelism strategies, and a multi-node inference unit abstraction layer built on top of Amazon ECS, this architecture deploys a single model replica to run seamlessly across multiple nodes, supporting large production deployments.In the following sections, we discuss the architecture and key components of the solution in more detail.

Inference engine design

We built an architecture on Amazon ECS using Trn1 instances that supports scaling inference beyond a single node to fully use distributed hardware resources, while maintaining seamless integration with NVIDIA Triton Inference Server, vLLM, and the Neuron SDK.

Although the following diagram illustrates a two-node configuration (leader and follower) for simplicity, the architecture is designed to be extended to support additional follower nodes as needed.

AWS NeuronX distributed inference system architecture detailing Leader node's inference engine and Follower node's worker process integration

In this architecture, the leader node runs the Triton Inference Server and vLLM engine, serving as the primary orchestration unit for inference. By integrating with vLLM, we can use continuous batching—a technique used in LLM inference to improve throughput and accelerator utilization by dynamically scheduling and processing inference requests at the token level. The vLLM scheduler handles batching based on the global batch size. It operates in a single-node context and is not aware of multi-node model execution. After the requests are scheduled, they’re handed off to the NeuronWorker component in vLLM, which handles broadcasting model inputs and executing the model through integration with the Neuron SDK.

The follower node operates as an independent process and acts as a wrapper around the vLLM NeuronWorker component. It continuously listens to model inputs broadcasted from the leader node and executes the model using the Neuron runtime in parallel with the leader node.

For nodes to communicate with each other with the proper information, two mechanisms are required:

  • Cross-node model inputs broadcasting on CPU – Model inputs are broadcasted from the leader node to follower nodes using the torch.distributed communication library with the Gloo backend. A distributed process group is initialized during NeuronWorker initialization on both the leader and follower nodes. This broadcast occurs on CPU over standard TCP connections, allowing follower nodes to receive the full set of model inputs required for model execution.
  • Cross-node collectives communication on Trainium chips – During model execution, cross-node collectives (such as all gather or all reduce) are managed by the Neuron Distributed Inference (NxDI) library, which uses EFA to deliver high-bandwidth, low-latency inter-node communication.

Model parallelism strategies

We adopted hybrid model parallelism strategies through integration with the Neuron SDK to maximize cross-node memory bandwidth utilization (MBU) and model FLOPs utilization (MFU), while also reducing memory pressure on each individual node. For example, during the context encoding (prefill) phase, we use context parallelism by splitting inputs along the sequence dimension, facilitating parallel computation of attention layers across nodes. In the decoding phase, we adopt data parallelism by partitioning the input along the batch dimension, so each node can serve a subset of batch requests independently.

Multi-node inference infrastructure

We also designed a distributed LLM inference abstraction: the multi-node inference unit, as illustrated in the following diagram. This abstraction serves as a unit of deployment for inference service, supporting consistent and reliable rolling deployments on a cell-by-cell basis across the production fleet. This is important so you only have a minimal number of nodes offline during upgrades without impacting your entire service. Both the leader and follower nodes described earlier are fully containerized, so each node can be independently managed and updated while maintaining a consistent execution environment across the entire fleet. This consistency is critical for reliability, because the leader and follower nodes must run with identical software stacks—including Neuron SDKs, Neuron drivers, EFA software, and other runtime dependencies—to achieve correct and reliable multi-node inference execution. The inference containers are deployed on Amazon ECS.

WS inference architecture showing control plane, service routing, and distributed model execution across leader-follower nodes

A crucial aspect of achieving high-performance distributed LLM inference is minimizing the latency of cross-node collective operations, which rely on Remote Direct Memory Access (RDMA). To enable this, optimized node placement is essential: the deployment management system must compose a cell by pairing nodes based on their physical location and proximity. With this optimized placement, cross-node operations can utilize the high-bandwidth, low-latency EFA network available to instances. The deployment management system gathers this information using the Amazon EC2 DescribeInstanceTopology API to pair nodes based on their underlying network topology.

To maintain high availability for customers (making sure Rufus is always online and ready to answer a question), we developed a proxy layer positioned between the system’s ingress or load-balancing layer and the multi-node inference unit. This proxy layer is responsible for continuously probing and reporting the health of all worker nodes. Rapidly detecting unhealthy nodes in a distributed inference environment is critical for maintaining availability because it makes sure the system can immediately route traffic away from unhealthy nodes and trigger automated recovery processes to restore service stability.

The proxy also monitors real-time load on each multi-node inference unit and reports it to the ingress layer, supporting fine-grained, system-wide load visibility. This helps the load balancer make optimized routing decisions that maximize per-cell performance and overall system efficiency.

Conclusion

As Rufus continues to evolve and become more capable, we must continue to build systems to host our model. Using this multi-node inference solution, we successfully launched a much larger model across over tens of thousands of AWS Trainium chips to Rufus customers, supporting Prime Day traffic. This increased model capacity has enabled new shopping experiences and significantly improved user engagement. This achievement marks a major milestone in pushing the boundaries of large-scale AI infrastructure for Amazon, delivering a highly available, high-throughput, multi-node LLM inference solution at industry scale.

AWS Trainium in combination with solutions such as NVIDIA Triton and vLLM can help you enable large inference workloads at scale with great cost performance. We encourage you to try these solutions to host large models for your workloads.


About the authors

James Park is a ML Specialist Solutions Architect at Amazon Web Services. He works with Amazon.com to design, build, and deploy technology solutions on AWS, and has a particular interest in AI and machine learning. In his spare time he enjoys seeking out new cultures, new experiences, and staying up to date with the latest technology trends.

Faqin Zhong is a Software Engineer at Amazon Stores Foundational AI, working on LLM inference infrastructure and optimizations. Passionate about generative AI technology, Faqin collaborates with leading teams to drive innovation, making LLMs more accessible and impactful, and ultimately enhancing customer experiences across diverse applications. Outside of work, she enjoys cardio exercise and baking with her son.

Charlie Taylor is a Senior Software Engineer within Amazon Stores Foundational AI, focusing on developing distributed systems for high performance LLM inference. He builds inference systems and infrastructure to help larger, more capable models respond to customers faster. Outside of work, he enjoys reading and surfing.

Yang Zhou is a Software Engineer working on building and optimizing machine learning systems. His recent focus is enhancing the performance and cost-efficiency of generative AI inference. Beyond work, he enjoys traveling and has recently discovered a passion for running long distances.

Nicolas Trown is a Principal Engineer in Amazon Stores Foundational AI. His recent focus is lending his systems expertise across Rufus to aid the Rufus Inference team and efficient utilization across the Rufus experience. Outside of work, he enjoys spending time with his wife and taking day trips to the nearby coast, Napa, and Sonoma areas.

Michael Frankovich is a Principal Software Engineer at Amazon Core Search, where he supports the ongoing development of their cellular deployment management system used to host Rufus, among other search applications. Outside of work, he enjoys playing board games and raising chickens.

Adam (Hongshen) Zhao is a Software Development Manager at Amazon Stores Foundational AI. In his current role, Adam is leading the Rufus Inference team to build generative AI inference optimization solutions and inference system at scale for fast inference at low cost. Outside of work, he enjoys traveling with his wife and creating art.

Bing Yin is a Director of Science at Amazon Stores Foundational AI. He leads the effort to build LLMs that are specialized for shopping use cases and optimized for inference at Amazon scale. Outside of work, he enjoys running marathon races.

Parthasarathy Govindarajen is Director of Software Development at Amazon Stores Foundational AI. He leads teams that develop advanced infrastructure for large language models, focusing on both training and inference at scale. Outside of work, he spends his time playing cricket and exploring new places with his family.

Read More

Build an intelligent financial analysis agent with LangGraph and Strands Agents

Build an intelligent financial analysis agent with LangGraph and Strands Agents

Agentic AI is revolutionizing the financial services industry through its ability to make autonomous decisions and adapt in real time, moving well beyond traditional automation. Imagine an AI assistant that can analyze quarterly earnings reports, compare them against industry expectations, and generate insights about future performance. This seemingly straightforward task involves multiple complex steps: document processing, data extraction, numerical analysis, context integration, and insight generation.

Financial analysis workflows present unique technical challenges for generative AI that push the boundaries of traditional large language model (LLM) implementations. This domain requires architectural patterns designed to handle the inherent complexities of financial workflows to assist analysts. Although agentic AI systems drive substantial improvements in operational efficiency and customer experience, delivering measurable productivity gains across operations, they also present unique implementation challenges around governance, data privacy, and regulatory compliance. Financial institutions must carefully balance the transformative potential of agentic AI—from dynamic financial coaching to real-time risk assessment—with the need for robust oversight and control frameworks.

This post describes an approach of combining three powerful technologies to illustrate an architecture that you can adapt and build upon for your specific financial analysis needs: LangGraph for workflow orchestration, Strands Agents for structured reasoning, and Model Context Protocol (MCP) for tool integration.

The following screenshot figure demonstrates how this solution operates in practice:

The reference architecture discussed in this post emerged from experimenting with different patterns for financial domain applications. We hope these insights help you navigate similar challenges in your own projects, whether in finance or other complex analytical domains.

Understanding the challenges in financial analysis workflows

Before diving into the solution implementation details, it’s worth understanding the core challenges that informed our architectural decisions. These challenges aren’t unique to our project, they’re inherent to the nature of financial analysis and appear in many complex analytical applications.Our first challenge involved dynamic and adaptive analysis flows. Financial analysis workflows are inherently dynamic, with analysts constantly adjusting their approach based on observed patterns and intuition. An analyst might shift focus from revenue analysis to operational metrics, or dive deeper into specific industry segments based on emerging insights. This requires an orchestration strategy that can handle flexible, nonlinear execution paths while maintaining analytical coherence and context throughout the process.In our second challenge, we were faced with complex integration across multiple data sources. Financial analysis requires seamless integration with various internal and external systems, from proprietary databases to public industry data APIs. Each integration point introduces potential compatibility issues and architectural complexity. The challenge lies in maintaining manageable system coupling while enabling access to diverse data sources, each with their own interfaces, authentication methods, and data formats. A robust integration strategy is needed to keep system complexity at a sustainable level while maintaining reliable data access across sources.

Solution overview

To address these challenges, we developed an architectural pattern combining three complementary technologies:

  • LangGraph – Provides the foundation for handling dynamic analysis flows through structured workflow orchestration, enabling flexible execution paths while maintaining state and context
  • Strands Agents – Serves as an intermediary layer, coordinating between foundation models (FMs) and specialized tools to execute complex analytical tasks
  • MCP – Standardizes the integration of diverse data sources and tools, simplifying the complexity of connecting to multiple financial systems and services.

This pattern is illustrated in the following diagram.

The combination of these frameworks serves distinct but complementary purposes in the architecture. LangGraph excels at orchestrating high-level workflows and directional processes, and Strands Agents optimizes autonomous agent interactions and decision-making at a more granular level. We used this dual-framework approach to make the most of the strengths of each technology: LangGraph’s structured workflow capabilities for macro-level orchestration, and the specialized handling by Strands Agents of agent-tool interactions. Rather than being redundant, this architecture creates a more robust and flexible system capable of handling both complex workflow management and intelligent agent operations effectively. This modular, maintainable system can handle the complexity of financial analysis while remaining flexible and extensible.

LangGraph: Structuring financial analysis workflows

One effective design principle when implementing multi-agent systems is breaking down complex problems into simpler tasks. Instead of asking an agent to solve an entire financial problem at one time, consider decomposing it into discrete analytical steps.Consider the following example. When a user wants to analyze the performance of company A compared to company B, the process works as follows:

  1. The user submits a natural language query: “Compare the quarterly revenue growth of Company A and Company B for the past year and explain what’s driving the differences.”
  2. The router node analyzes this query and determines that it requires financial data retrieval followed by comparative analysis. It routes the request to the agent specialized for financial analysis.
  3. The specialized agent processes the request and generates a comprehensive response that addresses the user’s query. (The specific tools and reasoning process that the agent uses to generate its response will be covered in the next section.)

This workflow is illustrated in the following diagram.

In a multi-agent architecture design, LangGraph provides powerful workflow orchestration capabilities that align well with the dynamic nature of financial analysis. It supports both directional and recursive workflows, so you can model everything from straightforward sequential processes to complex iterative analyses that evolve based on intermediate findings.A key strength of LangGraph is how it combines flexible workflow patterns with precise programmatic control. Rather than leaving all flow decisions to LLM reasoning, you can implement specific business logic to guide the analysis process. For instance, our architecture’s router node can direct requests to specialized agents based on concrete criteria, determining whether a request requires file processing, numerical analysis, or industry data integration.Additionally, LangGraph’s GraphState primitive elegantly solves the challenge of maintaining context across distributed agent nodes. This shared state mechanism makes sure that each agent in the workflow can access and build upon previous analysis results, minimizing redundant processing while maintaining analytical coherence throughout the entire process.

Strands Agents: Orchestrating financial reasoning

While LangGraph manages the overall workflow, Strands Agent handles the specialized reasoning and tool usage within each node. When the agent receives the request to compare companies, it first identifies the specific data needed:

  1. Initial reasoning – The agent determines that it needs quarterly revenue figures for both companies, year-over-year growth percentages, and industry benchmarks for context
  2. Data retrieval – The agent retrieves financial news mentioning both companies and pulls fundamental metrics such as quarterly revenue data and industry averages
  3. Analysis and synthesis – The agent analyzes growth trends, identifies correlations between news events and performance changes, and synthesizes findings into a comprehensive analysis explaining not only the revenue differences but also potential driving factors

This reasoning-tool execution cycle, shown in the following diagram, allows the agent to dynamically gather information and refine its analysis as needed.

For a different use case, consider a request to “Create a Word document summarizing the financial performance of Amazon for investors”:

  1. The documentation agent uses a document generation tool to create a Word document with proper formatting
  2. The agent structures the content with appropriate sections, such as “Executive Summary,” “Revenue Analysis,” and “Industry Position”
  3. Using specialized formatting tools, it creates data tables for quarterly comparisons, bullet points for key insights, and embedded charts visualizing performance trends
  4. Finally, it delivers a professionally formatted document ready with download links

Strands agents are designed to flexibly iterate through their specialized tasks using the tools provided to them, gathering necessary information for comprehensive responses. It’s crucial to connect these agents with appropriate tools that match the nature of their assigned tasks—whether it’s financial data analysis tools, document generation capabilities, or other specialized functionalities. This alignment between agent capabilities and their toolsets facilitates optimal performance in executing their designated responsibilities.

MCP: Interaction with financial tools

The MCP provides a foundation for creating extensible, standardized financial tools. Rather than building monolithic applications where M clients need to connect to N servers (creating M×N integration complexity), MCP standardizes the communication protocol, reducing this to M+N connections. This creates a modular system where financial analysts can focus on creating specialized tools, agent developers can concentrate on reasoning and orchestration, and new capabilities can be added without modifying existing components.

Modular server architecture

Our architecture uses specialized MCP servers that provide focused financial capabilities. Each server exposes self-documenting tools through the standardized MCP protocol:

  • Stock analysis – Real-time quotes and historical market data from Yahoo Finance
  • Financial analysis – Combines fundamental metrics, such as price-to-earnings (P/E) ratios and revenue growth, with technical indicators, such as relative strength index (RSI) and moving average convergence/divergence (MACD) for investment recommendations
  • Web and news search – Aggregates sentiment from news sources and social media with theme extraction

These capabilities are shown in the following diagram.

Here’s how a typical MCP tool is structured:

@mcp.tool()
async def comprehensive_analysis(equity: str) -> str:
    """
    Get complete investment analysis combining both fundamental and technical factors.
    Provides a holistic view of a stock with interpreted signals, valuation assessment,
    growth metrics, financial health indicators, and momentum analysis with clear buy/sell signals.
    
    Args:
        equity: Stock ticker symbol (e.g., AAPL, MSFT, TSLA)
    """
    try:
        data = await fetch_comprehensive_analysis(equity)
        return format_analysis_results(data)
    except Exception as e:
        return f"Error retrieving comprehensive analysis: {str(e)}"

Client-side integration

The Strands Agents MCP client connects to available servers and provides unified tool access. However, because individual MCP servers can expose multiple tools, having too many tools in context can make the agent’s reasoning process inefficient. To address this, our implementation provides flexible server selection, so users can connect only to relevant MCP servers based on their analysis needs:

# Server name mapping for easy selection
SERVER_NAME_TO_URL = {
    "word": "http://localhost:8089/mcp",
    "stock": "http://localhost:8083/mcp", 
    "financial": "http://localhost:8084/mcp",
    "news": "http://localhost:8085/mcp"
}

async def execute_financial_analysis(state: GraphState) -> GraphState:
    # Select relevant servers based on analysis type
    selected_servers = [
        SERVER_NAME_TO_URL["stock"],
        SERVER_NAME_TO_URL["financial"],
        SERVER_NAME_TO_URL["news"]  # Skip document generation for basic analysis
    ]
    
    mcp_clients, all_tools = await create_mcp_clients(selected_servers)
    
    # Create agent with focused tool access
    agent = Agent(
        model=model,
        tools=all_tools,  # Only tools from selected servers
        system_prompt=FINANCIAL_SYSTEM_PROMPT
    )

This selective approach significantly improves analysis quality by reducing tool context noise and helping the agent focus on relevant capabilities. The implementation also supports connecting to individual servers when specific functionality is needed:

# Connect to a single server for specialized tasks
financial_client = get_mcp_client("financial")
available_tools = get_all_available_tools()  # Discover all tools across servers

Although our examples use localhost endpoints for development, MCP’s streamable HTTP protocol enables seamless connection to remote servers by simply updating the URL mappings:

# Production setup with remote MCP servers
PRODUCTION_SERVER_URLS = {
    "stock": "https://stock-analysis-api.company.com/mcp",
    "financial": "https://financial-analysis-engine.company.com/mcp",
    "news": "https://news-sentiment-service.company.com/mcp"
}

The following is an example request that asks, “Analyze Amazon financial performance and market position for Q3 2025.”

The workflow is as follows:

  1. LangGraph receives the request and routes to the financial_analysis node
  2. MCP servers provide the necessary tools:
    1. Stock serveryahoo_stock_quote() for current prices
    2. Financial servercomprehensive_analysis() for detailed metrics
    3. News serverget_market_sentiment() for recent sentiment data
  3. Strands Agents orchestrates the analysis and returns the final answer about the financial performance and market position of Amazon in Q3 2025:
    1. Stock server – Current price: $220, P/E: 47.8, market cap: $1.6T
    2. Financial server – Revenue growth: +11.2% YoY, AWS: +19% YoY, operating margin: 7.8%
    3. News server – Sentiment: Positive (0.74), key themes: AI investments, logistics efficiency, holiday retail prep

The agent returns the following output:

Amazon demonstrates strong performance across key segments in Q3 2025.
AWS continues robust growth at +19% YoY, driving improved operating margins to 7.8%.
The company’s AI investments are positioning it well for future growth, while logistics improvements support retail competitiveness heading into holiday season.
Overall sentiment remains positive despite some margin pressure from continued infrastructure investments.

Deploy the financial analysis agent

The deployment of our financial analysis agent follows two main paths: local development for testing and iteration, followed by production deployment options for real-world use. This section focuses on getting the application running in your development environment, with production deployment covered separately.

Local deployment enables rapid iteration and testing of your financial analysis workflows before moving to production. Our application uses Amazon Bedrock for its foundational AI capabilities, powering the Strands Agents components. Make sure your AWS account has Amazon Bedrock access enabled with appropriate model permissions for the models specified in your configuration. The solution can be found on GitHub in the sample-agentic-frameworks-on-aws repository. Follow these steps:

Development environment deployment

  1. To set up your development environment, clone the repository and install dependencies for both frontend and backend components:
# Clone the repository
git clone https://github.com/your-organization/financial-agent.git
cd financial-agent

# Install frontend dependencies
npm install
  1. To set up the Python backend, install the Python dependencies required for the backend services:
# Navigate to the backend directory
cd py-backend

# Install dependencies
pip install -r requirements.txt

# Return to the main project directory
cd ..
  1. To launch the application, use the convenient script configured in the project, which launches both the frontend and backend services simultaneously:
# Launch both frontend and backend
npm run dev

This command Starts the Next.js frontend on port 3000 and launches the FastAPI backend on port 8000.

You can now access the financial analysis agent in your browser at http://localhost:3000.

  1. To make sure your local deployment is functioning correctly,

  1. To configure the MCP tool integration, on the MCP console under Configuration, choose Tool. You can connect or disconnect necessary tools when you want to change MCP settings. This flexibility means you can:
    1. Tailor the agent’s capabilities to specific financial analysis tasks
    2. Add specialized data sources for industry segments
    3. Control which external APIs the agent can access
    4. Create custom tool combinations for different analytical scenarios

This configuration-driven approach embodies the flexible architecture we’ve discussed throughout this article, allowing the financial analysis agent to be adapted to various use cases without code changes. Whether you need stock market analysis, portfolio evaluation, industry news integration, or financial documentation generation, the appropriate combination of MCP tools can be selected through this interface.

Production deployment

Local development is sufficient for testing and iteration, but deploying the financial analysis agent to production requires a more robust architecture. This section outlines our production deployment approach, which uses AWS services to create a scalable, resilient, and secure solution.

The production deployment follows a distributed architecture pattern that separates concerns while maintaining efficient communication paths. As shown in the following diagram, the production architecture consists of:

  1. Web frontend – A Next.js application deployed on AWS Amplify that provides the user interface for interacting with the financial agent
  2. Backend application – A containerized FastAPI service running on Amazon Elastic Container Service (Amazon ECS) that orchestrates the LangGraph workflow and Strands Agents for financial analysis and documentation
  3. MCP server farm – A collection of specialized microservices that provide financial tools and data connectors
  4. Amazon Bedrock – The FM service that powers the agents’ reasoning capabilities
  5. Amazon DynamoDB – A persistent storage layer for conversation history and state management

This distributed approach enables scalable performance while preserving the fundamental architectural patterns we’ve discussed throughout this post. The deployment strategy can be adapted to your specific organizational requirements and infrastructure preferences.

Conclusion: Beyond implementation to architectural thinking

In this post, we’ve demonstrated how combining LangGraph, Strands Agents, and MCP creates a powerful architectural pattern for financial analysis applications. This approach directly addresses the key challenges in financial workflows:

  1. Decomposition of complex problems – Breaking financial analysis into manageable components with LangGraph allows for focused, accurate reasoning while maintaining overall context
  2. Reasoning-tool execution cycles – The separation between reasoning (using Strands Agents) and execution (using MCP) creates a more modular system that mirrors how financial analysts actually work
  3. Flexible tool integration – The Strands Agents built-in MCP support enables seamless connection to financial tools and document generation capabilities, allowing continuous expansion of functionality without disrupting the core architecture

This architecture serves as a foundation that you can extend and adapt to your specific financial analysis needs. Whether you’re building portfolio analysis tools, equity research assistants, investment advisory solutions, or financial documentation generators, these patterns provide a robust starting point.This post offers building blocks that you can adapt to your specific financial analysis needs. We hope these architectural insights help you navigate similar challenges in your own projects, whether in finance or other complex analytical domains.

Next steps

We invite you to build upon our LangGraph, Strands, and MCP architecture to transform your financial workflows. Whether you’re building for a portfolio manager, equity analyst, risk professional, or financial advisor, you can create specialized tools that connect to your unique data sources. Start with a single MCP server addressing a specific pain point, then gradually expand as you experience the benefits.

Join our community. We’re building a community of generative AI practitioners who are pushing the boundaries of what’s possible with these technologies. Share your tools by contributing to our Agentic Frameworks open source GitHub repository.

About the Authors


Evan Grenda Sr. GenAI Specialist at AWS, where he works with top-tier third-party foundation model and agentic frameworks providers to develop and execute joint go-to-market strategies, enabling customers to effectively deploy and scale solutions to solve enterprise agentic AI challenges. Evan holds a BA in Business Administration from the University of South Carolina, a MBA from Auburn University, and an MS in Data Science from St. Joseph’s University.

Karan Singh is a Agentic AI leader at AWS, where he works with top-tier third-party foundation model and agentic frameworks providers to develop and execute joint go-to-market strategies, enabling customers to effectively deploy and scale solutions to solve enterprise agentic AI challenges. Karan holds a BS in Electrical Engineering from Manipal University, a MS in Electrical Engineering from Northwestern University, and an MBA from the Haas School of Business at University of California, Berkeley.

Sayan Chakraborty is a Sr. Solutions Architect at AWS. He helps large enterprises build secure, scalable, and performant solutions in the AWS Cloud. With a background of Enterprise and Technology Architecture, he has experience delivering large scale digital transformation programs across a wide range of industry verticals. He holds a B. Tech. degree in Computer Engineering from Manipal University, Sikkim, India.

Kihyeon Myung is a Sr. Applied AI Architect at AWS, where he helps enterprise customers build and deploy agent applications and RAG pipelines. With over 3 years of experience in AI and GenAI, Kihyeon specializes in designing and implementing agentic AI systems, combining his development background with expertise in Machine Learning and Generative AI.

Read More

Amazon Bedrock AgentCore Memory: Building context-aware agents

Amazon Bedrock AgentCore Memory: Building context-aware agents

AI assistants that forget what you told them 5 minutes ago aren’t very helpful. While large language models (LLMs) excel at generating human-like responses, they are fundamentally stateless—they don’t retain information between interactions. This forces developers to build custom memory systems to track conversation history, remember user preferences, and maintain context across sessions, often solving the same problems repeatedly across different applications.

At the AWS Summit New York City 2025, we introduced Amazon Bedrock AgentCore Memory, a service for agent memory management. AgentCore Memory makes it easy for developers to build context-aware agents by eliminating complex memory infrastructure management while providing full control over what the AI agent remembers. It provides powerful capabilities for maintaining both short-term working memory (capturing immediate conversation context within a session) and long-term intelligent memory (storing persistent insights and preferences across sessions), so AI agents can retain context, learn from interactions, and deliver truly personalized experiences.

AgentCore Memory transforms one-off conversations into continuous, evolving relationships between users and AI agents. Instead of repeatedly asking for the same information (“What’s your account number?”) or forgetting critical preferences (“I’m allergic to shellfish”), agents can maintain context and build upon previous interactions naturally. AgentCore Memory seamlessly integrates with other agent-building tools, so that developers can enhance existing agents with persistent memory capabilities without managing complex infrastructure. Unlike do-it-yourself memory solutions that require developers to manually orchestrate multiple components—raw conversation storage, vector databases, session caching systems, and custom retrieval logic—AgentCore Memory offers a fully managed service with built-in storage, intelligent extraction and efficient retrieval.

In this blog post, we explore the specific challenges that AgentCore Memory solves, introduce its core concepts, and share best practices.

The memory problem in AI agents

The ability to remember is the foundation of meaningful human relationships. We remember past conversations, learn preferences over time, and build shared context that deepens our connections. Developers building AI agents have traditionally faced significant technical challenges implementing these same fundamental capabilities, creating a substantial gap between human-like understanding and machine interactions.

When implementing memory for AI agents, developers encounter several fundamental challenges:

  • Context window constraints: Modern LLMs have limited capacity to process conversation history. Developers must implement context window management strategies (often manually pruning or summarizing earlier exchanges) to handle ongoing customer conversations to stay within token limits.
  • State management complexity: Without dedicated memory systems, developers often build custom solutions for tracking conversation history, user preferences, and agent state—reinventing similar solutions across projects.
  • Memory recall challenges: Storing raw conversation data isn’t enough. Without intelligent extraction and structured memory organization, developers must implement complex systems to identify and surface relevant information at the right time.
  • Persistence without intelligence: Most existing solutions focus on data storage rather than intelligent memory formation, providing no built-in mechanisms to extract relevant insights or identify patterns that matter to users.

These limitations don’t just create technical hurdles—they fundamentally impact user experience, for example:

  • A financial advisor agent losing context about a user’s retirement goals and strategies discussed earlier in the same session
  • A coding assistant lacking access to previously established user programming style preferences or setup details

Without effective memory implementation, conversations become disjointed and repetitive rather than continuous and evolving. This creates unnecessary back-and-forth interactions that increase costs and latency while frustrating the users.

Introducing Amazon Bedrock AgentCore Memory

Amazon Bedrock AgentCore Memory is a fully managed service that lets your AI agents deliver intelligent, context-aware, and personalized interactions by maintaining both immediate and long-term knowledge. The service is built on five key design principles:

  1. Abstracted storage: AgentCore Memory handles the storage complexity for short- and long-term information, without requiring developers to manage underlying infrastructures.
  2. Security: Encrypted data both at rest and in transit.
  3. Continuity: Events within sessions are stored in chronological order to maintain accurate narrative flow and context.
  4. Data organization and access control: Hierarchical namespaces provide structured memory organization and fine-grained access control for shared memory contexts.
  5. Scalability and performance: Efficiently handle large volumes of memory data with low latency, facilitating fast and reliable retrieval as usage grows.

The service seamlessly integrates with other services in Bedrock AgentCore, such as AgentCore Runtime and AgentCore Observability. It combines API-first design with pre-verified defaults, so developers can quickly implement basic memory capabilities while retaining extensibility for advanced scenarios.

Core components of AgentCore Memory

AgentCore Memory consists of several key components that work together to provide both short-term context and long-term intelligence for your agents.

Let’s explore each component (shown in the preceding figure) with examples of how they function in practice.

1. AgentCore Memory resource

A memory resource is a logical container within AgentCore Memory that encapsulates both raw events and processed long-term memories. Think of a memory resource as the foundation of your agent’s memory system—it defines how long data is retained, how it’s secured, and how raw interactions are transformed into meaningful insights.

When creating a memory resource, you can specify an event expiry duration (up to 365 days) to control how long raw conversation data is retained in short-term memory. Data within AgentCore Memory is encrypted both at rest and in transit. By default, AWS managed keys are used for this encryption, but you can choose to enable encryption with your own customer managed KMS keys for greater control.

2. Short-term memory

Short-term memory captures raw interaction data as immutable events, organized by actor and session. This organization supports structured storage of conversations between users and agents, system events, state changes, and other interaction data. It takes in events and stores them synchronously in the AgentCore Memory resource. These events can be either “Conversational” (USER/ASSISTANT/TOOL or other message types) or “blob” (contains binary content that can be used to store checkpoints or agent state). Out of the two event types, only the Conversational events are used for long-term memory extraction.

To create an event, you typically need 3 identifiers.

  1. memoryId:This is automatically created and returned in the response when you create a new memory resource.
  2. actorId:which typically identifies entities in your system (users, agents, project, or combinations),
  3. sessionId: groups related events together.

This hierarchical structure enables precise retrieval of relevant conversation context without loading unrelated data. Let’s explore how to create a memory resource for a customer support agent using Boto3 client:

# Creating a new memory resource 
response = agentcore_client.create_memory(
    name="CustomerSupportMemory",
    description="Memory store for our customer support agent",
    eventExpiryDuration=30,  # Store raw events for 30 days
    encryptionKeyArn="arn:aws:kms:us-east-1:123456789012:key/abcd1234-...",  # Optional customer-managed KMS key
)

# Storing a user message as an event
response = agentcore_client.create_event(
    memoryId="mem-123abcd",
    actorId="customer-456",
    sessionId="session-789",
    eventTimestamp=int(time.time() * 1000),
    payload=[
        {
            "conversational": {
                "content": {"text": "I'm looking for a waterproof camera under $300"},
                "role": "USER"
            }
        }
    ]
)

# Retrieving recent conversation history
events = agentcore_client.list_events(
    memoryId="mem-123abcd",
    actorId="customer-456",
    sessionId="session-789",
    maxResults=10,
)

3. Long-term memory

Long-term memory contains extracted insights, preferences, and knowledge derived from raw events. Unlike short-term memory, which stores verbatim data, long-term memory captures meaningful information that persists across sessions—such as user preferences, conversation summaries, and key insights.

The extraction process happens asynchronously after events are created, using the memory strategies defined within your memory resource. This managed asynchronous process extracts and consolidates long term memory records for efficient retrieval.

Let’s explore how to create the long-term memory resource for the customer support agent we saw before:

# Creating a new memory resource with long term
response = agentcore_client.create_memory(
    name="CustomerSupportMemory",
    description="Memory store for our customer support agent",
    eventExpiryDuration=30,  # Store raw events for 30 days
    encryptionKeyArn="arn:aws:kms:us-east-1:123456789012:key/abcd1234-...",  # Optional customer-managed KMS key
    memoryStrategies=[{
        "userPreferenceMemoryStrategy": {
            "name": "UserPreferences",
            "namespaces": ["customer-support/{actorId}/preferences"]
        }
    }]
)

3.a Namespaces

Namespaces are a critical organizational concept within long-term memory that provide hierarchical structure within your memory resource. They function like file system paths, and you can use them to logically group and categorize memories. These are especially powerful in multi-tenant systems, be it multi-agent, multi-users, or both. Namespaces serve several important purposes:

  • Organizational structure: Separate different types of memories (preferences, summaries, entities) into distinct logical containers
  • Access control: Control which memories are accessible to different agents or in different contexts
  • Multi-tenant isolation: Segregate memories for different users or organizations with patterns like /org_id/user_id/preferences
  • Focused retrieval: Query specific types of memories without searching through unrelated information

For example, you might structure namespaces like:

  • /retail-agent/customer-123/preferences: For a specific customer’s preferences
  • /retail-agent/product-knowledge: For shared product information accessible to users
  • /support-agent/customer-123/case-summaries/session-001: For summaries of past support cases

The dynamic namespace creation above uses special placeholder variables in your namespace definitions:

  • {actorId}: Uses the actor identifier from the events being processed
  • {sessionId}: Uses the session identifier from the events
  • {strategyId}: Uses the strategy identifier for organization

This allows for elegant namespace structuring without hardcoding identifiers. When retrieving memories, you specify the exact namespace to search within, or a prefix match:

# Retrieving relevant memory records using semantic search
memories = agentcore_client.retrieve_memory_records(
    memoryId="mem-12345abcdef",
    namespace="customer-support/user-1/preferences",
    searchCriteria={
        "searchQuery": "Which camera should I buy?",
        "topK": 5
    }
)

3.b Memory strategies

Memory strategies define the intelligence layer that transforms raw events into meaningful long-term memories. They determine what information should be extracted, how it should be processed, and where the resulting memories should be stored. Each strategy is configured with a specific namespace where the extracted memories will be stored and consolidated, creating a clear organizational structure for different types of memories. All strategies by default ignore personally identifiable information (PII) data from long-term memory records. AgentCore Memory provides 3 built-in strategies:

  • Semantic Strategy: Stores facts and knowledge mentioned in the conversation for future reference. For example, “The customer’s company has 500 employees across 3 office locations in Seattle, Austin, and Boston.”
  • Summary Strategy: Stores a running summary of a conversation, capturing main points and decisions, scoped to a session. For example, “Customer inquired about enterprise pricing, discussed implementation timeline requirements, and requested a follow-up demo.”
  • User Preferences Strategy: Stores user preferences, choices, or styles. For example, “User prefers detailed technical explanations over high-level summaries”, “User prefers Python for development work”.

Here are some examples of built-in memory strategies that are defined at the time of creating an AgentCore Memory resource:

# defining Memory Strategies
strategies = [{
    "semanticMemoryStrategy": {
        "name": "semantic-facts",
        "namespaces": ["/customer/{actorId}/facts"],
    },
    "summaryMemoryStrategy": {
        "name": "conversation-summary",
        "namespaces": ["/customer/{actorId}/{sessionId}/summary"],
    },
    "userPreferenceMemoryStrategy": {
        "name": "user-preferences",
        "namespace": ["/customer/{actorId}/preferences"],
    }
]

To allow flexibility, Bedrock AgentCore also offers Custom memory strategies that lets you choose a specific LLM and override the prompt for extraction and consolidation to your specific domain or use case. For example, you might want to append to the semantic memory prompt so that it only extracts specific types of facts or memories.

Now that we understand the key components, here’s what the overall AgentCore Memory architecture looks like:

These components work together seamlessly to provide a comprehensive memory system for your AI agents, enabling them to maintain immediate context while building meaningful long-term understanding of user interactions and preferences.

Advanced features

Beyond the core memory capabilities, Amazon Bedrock AgentCore Memory offers advanced features that enable complex agent workflows and enhanced user experiences.

Branching

Branching allows agents to create alternative conversation paths from a specific point in the event history. Think of branching like creating a fork in the road—the conversation can continue down multiple different paths from the same starting point. This powerful feature supports several advanced use cases:

  • Message editing: When users edit their previous messages, branching preserves both the original conversation flow and the new direction. The agent maintains coherent context even when earlier inputs change.
  • What-if scenarios: Agents can explore hypothetical paths in decision-making processes without losing the main conversation thread. For example, a financial advisor agent could explore different investment strategies while keeping the original consultation intact.
  • Alternative approaches: For complex problem-solving, agents can maintain multiple solution approaches simultaneously, allowing users to compare different options side-by-side.

Branching works by creating a new named branch within the existing memory resource, using the same actor_id and session_id. When creating a branched event, you specify a branch name and the rootEventId from which the branch originates. This allows for alternative conversation paths without the need for new actor or session identifiers. For example:

{
  "memoryId": "mem-12345abcdef",
  "actorId": "/agent-support-123/customer-456",
  "sessionId": "session-789",
  "eventTimestamp": 1718806000000,
  "payload": [
    {
      "conversational": {
        "content": {"text": "I'm looking for a waterproof action camera for extreme sports."},
        "role": "USER"
      }
    }
  ],
  "branch": {
    "name": "edited-conversation",
    "rootEventId": "evt-67890"
  }
}

This approach allows the agent to manage multiple conversation paths within the same memory resource, providing powerful conversation management capabilities.

Checkpointing

With checkpointing, agents can save and mark specific states in the conversation, creating reference points that can be returned to later. This is like saving your progress in a complex game or application. This feature is particularly valuable for:

  • Multi-session tasks: Break complex tasks across multiple sessions while preserving context. Users can return days or weeks later and the agent can resume exactly where they left off.
  • Workflow resumption: With workflow resumption, users can pause complex processes (like mortgage applications or travel planning) and resume them seamlessly without starting over or repeating information.
  • Conversation bookmarks: Mark important decision points that might need to be referenced later, such as when a user selects specific preferences or makes key decisions.

Checkpoints can be implemented through raw events under a different isolation (actor and session), which can later be retrieved through the GetEvent API. The blob payload type can be used to ingest data that can be in various formats and doesn’t necessarily have to be conversational. Note that these events will be ignored for long term memory extraction. These advanced features extend the capabilities of AgentCore Memory beyond simple context retention, enabling sophisticated agent experiences that better approximate human-like memory and conversation management. By incorporating branching and checkpointing into your agent design, you can create more natural, flexible, and personalized user interactions.

Best practices

Optimization of memories across your agentic system should work backwards from each agent’s core objectives. How and when an agent remembers something should have these objectives in mind.

1. Structured memory architecture

Design your memory architecture intentionally by implementing distinct memory types for different needs. Use short-term memory for immediate conversational context and long-term memory for persistent knowledge and user preferences. Organize memories using hierarchical namespaces (for example, /org_id/user_id/preferences) for precise memory isolation and retrieval. Consider appropriate time-to-live (TTL) settings based on your application’s requirements and data privacy policies. For instance, support chat histories might be retained for 30 days, while persistent customer preferences might be kept for much longer.

2. Memory strategies

The effectiveness of long-term memory depends greatly on your memory strategies. Use built-in strategies for common needs like user preference extraction and conversation summarization, but don’t hesitate to build on top of these by using custom strategies for your specific use cases. For defining custom memory strategies, focus on extracting only the relevant information that directly supports your agent’s objectives. For example, a travel booking agent should prioritize extracting travel preferences, important dates, and budget constraints.

3. Efficient memory operations

Implement a rhythm of memory operations that balances performance with contextual awareness:

  • Retrieve relevant memories from within each user interaction for context hydration
  • Use targeted retrieval methods (list events for recent raw context, summaries for session context, semantic search for related long term memory records)
  • Store new interactions promptly using the CreateEvent API to maintain accurate history

Be aware that long-term memory extraction and consolidation is an asynchronous process, which leads to refresh delays when ingesting new information into long-term memory. For time-sensitive applications, plan accordingly by implementing appropriate caching, session handling, and combined context hydration techniques.

4. Security and privacy considerations

Memory often contains sensitive information, so implement proper security measures:

  • Use actors and namespaces properly to organize data
  • Use AWS Identity and Access Management (IAM)-based access controls to implement least-privileged controls for memory resource access
  • Consider privacy implications when storing personal information, and follow relevant compliance requirements for data retention
  • Use customer managed KMS keys for encryption of highly sensitive data
  • Implement guardrails to prevent prompt injection and memory poisoning

5. Observability

Maintain visibility into your memory systems by using built-in event tracking and logging for memory operations. Monitor memory extraction patterns to verify they’re capturing the right information and track the effectiveness of your memory strategies. Periodically review and adjust your memory architecture based on agent performance metrics and evolving use case requirements. The key is to balance comprehensive memory retention with efficient resource utilization while maintaining focus on your agent’s core objectives.

Conclusion

Amazon Bedrock AgentCore Memory provides a comprehensive solution to one of the most challenging aspects of building effective AI agents—maintaining context and learning from interactions. By combining flexible short-term event storage with intelligent long-term memory extraction using AgentCore Memory, you can create more personalized, contextual, and helpful AI experiences without managing complex memory infrastructure. The service’s hierarchical namespaces, customizable memory strategies, and advanced features provide the foundations for sophisticated agent behaviors that feel more natural and human-like.

To get started on AgentCore Memory, visit the following resources:


About the authors

Akarsha Sehwag is a WW Generative AI Data Scientist for Amazon Bedrock – Agents GTM team. With over six years of expertise in AI/ML product development, she has built enterprise solutions across diverse customer segments.

Dani Mitchell is a Generative AI Specialist Solutions Architect at Amazon Web Services (AWS). He is focused on helping accelerate enterprises across the world on their generative AI journeys with Amazon Bedrock.

Mani Khanuja is a Principal Generative AI Specialist SA, author of the book Applied Machine Learning and High-Performance Computing on AWS. She leads machine learning projects in various domains such as computer vision, natural language processing, and generative AI. She speaks at internal and external conferences such AWS re:Invent, Women in Manufacturing West, YouTube webinars, and GHC 23. In her free time, she likes to go for long runs along the beach.

Gopikrishnan Anilkumar is a Principal Technical Product Manager in Amazon. He has over 10 years of product management experience across a variety of domains and is passionate about AI/ML.

Noor Randhawa is a Software Engineering Lead on AgentCore Memory at Amazon Web Services (AWS).

Read More