Amazon AWS – Page 5

Use Amazon SageMaker Unified Studio to build complex AI workflows using Amazon Bedrock Flows

July 1, 2025

by Sumeet Tripathi Amazon AWS

Organizations face the challenge to manage data, multiple artificial intelligence and machine learning (AI/ML) tools, and workflows across different environments, impacting productivity and governance. A unified development environment consolidates data processing, model development, and AI application deployment into a single system. This integration streamlines workflows, enhances collaboration, and accelerates AI solution development from concept to production.

The next generation of Amazon SageMaker is the center for your data, analytics, and AI. SageMaker brings together AWS AI/ML and analytics capabilities and delivers an integrated experience for analytics and AI with unified access to data. Amazon SageMaker Unified Studio is a single data and AI development environment where you can find and access your data and act on it using AWS analytics and AI/ML services, for SQL analytics, data processing, model development, and generative AI application development.

With SageMaker Unified Studio, you can efficiently build generative AI applications in a trusted and secure environment using Amazon Bedrock. You can choose from a selection of high-performing foundation models (FMs) and advanced customization and tooling such as Amazon Bedrock Knowledge Bases, Amazon Bedrock Guardrails, Amazon Bedrock Agents, and Amazon Bedrock Flows. You can rapidly tailor and deploy generative AI applications, and share with the built-in catalog for discovery.

In this post, we demonstrate how you can use SageMaker Unified Studio to create complex AI workflows using Amazon Bedrock Flows.

Solution overview

Consider FinAssist Corp, a leading financial institution developing a generative AI-powered agent support application. The solution offers the following key features:

Complaint reference system – An AI-powered system providing quick access to historical complaint data, enabling customer service representatives to efficiently handle customer follow-ups, support internal audits, and aid in training new staff.
Intelligent knowledge base – A comprehensive data source of resolved complaints that quickly retrieves relevant complaint details, resolution actions, and outcome summaries.
Streamlined workflow management – Enhanced consistency in customer communications through standardized access to past case information, supporting compliance checks and process improvement initiatives.
Flexible query capability – A straightforward interface supporting various query scenarios, from customer inquiries about past resolutions to internal reviews of complaint handling procedures.

Let’s explore how SageMaker Unified Studio and Amazon Bedrock Flows, integrated with Amazon Bedrock Knowledge Bases and Amazon Bedrock Agents, address these challenges by creating an AI-powered complaint reference system. The following diagram illustrates the solution architecture.

The solution uses the following key components:

SageMaker Unified Studio – Provides the development environment
Flow app – Orchestrates the workflow, including:
- Knowledge base queries
- Prompt-based classification
- Conditional routing
- Agent-based response generation

The workflow processes user queries through the following steps:

A user submits a complaint-related question.
The knowledge base provides relevant complaint information.
The prompt classifies if the query is about resolution timing.
Based on the classification using the condition, the application takes the following action:
1. Routes the query to an AI agent for specific resolution responses.
2. Returns general complaint information.
The application generates an appropriate response for the user.

Prerequisites

For this example, you need the following:

Access to SageMaker Unified Studio. (You will need the SageMaker Unified Studio portal URL from your administrator). You can authenticate using either:
- AWS Identity and Access Management (IAM) user credentials.
- Single sign-on (SSO) credentials with AWS IAM Identity Center.
The IAM user or IAM Identity Center user must have appropriate permissions for:
- SageMaker Unified Studio.
- Amazon Bedrock (including Amazon Bedrock Flows, Amazon Bedrock Agents, Amazon Bedrock Prompt Management, and Amazon Bedrock Knowledge Bases).
- For more information, refer to Identity-based policy examples.
Access to Amazon Bedrock FMs (make sure these are enabled for your account), for example:Anthropic’s Claude 3 Haiku (for the agent).
Configure access to your Amazon Bedrock serverless models for Amazon Bedrock in SageMaker Unified Studio projects.
Amazon Titan Embedding (for the knowledge base).
Sample complaint data prepared in CSV format for creating the knowledge base.

Prepare your data

We have created a sample dataset to use for Amazon Bedrock Knowledge Bases. This dataset has information of complaints received by customer service representatives and resolution information.The following is an example from the sample dataset:

complaint_id,product,sub_product,issue,sub_issue,complaint_summary,action_taken,next_steps,financial_institution,state,submitted_via,resolution_type,timely_response
FIN-2024-001,04/26/24,"Mortgage","Conventional mortgage","Payment issue","Escrow dispute","Customer disputes mortgage payment increase after recent escrow analysis","Reviewed escrow analysis, explained property tax increase impact, provided detailed payment breakdown","1. Send written explanation of escrow analysis 2. Schedule annual escrow review 3. Provide payment assistance options","Financial Institution-1","TX","Web","Closed with explanation","Yes"
FIN-2024-002,04/26/24,"Money transfer","Wire transfer","Processing delay","International transfer","Wire transfer of $10,000 delayed, customer concerned about international payment deadline","Located wire transfer in system, expedited processing, waived wire fee","1. Confirm receipt with receiving bank 2. Update customer on delivery 3. Document process improvement needs","Financial Institution-2","FL","Phone","Closed with monetary relief","No"

Create a project

In SageMaker Unified Studio, users can use projects to collaborate on various business use cases. Within projects, you can manage data assets in the SageMaker Unified Studio catalog, perform data analysis, organize workflows, develop ML models, build generative AI applications, and more.

To create a project, complete the following steps:

Open the SageMaker Unified Studio landing page using the URL from your admin.
Choose Create project.
Enter a project name and optional description.
For Project profile, choose Generative AI application development.
Choose Continue.

Complete your project configuration, then choose Create project.

Create a prompt

Let’s create a reusable prompt to capture the instructions for FMs, which we will use later while creating the flow application. For more information, see Reuse and share Amazon Bedrock prompts.

In SageMaker Unified Studio, on the Build menu, choose Prompt under Machine Learning & Generative AI.

Provide a name for the prompt.
Choose the appropriate FM (for this example, we choose Claude 3 Haiku).
For Prompt message, we enter the following:

You are a complaint analysis classifier. You will receive complaint data from a knowledge base. Analyze the {{input}} and respond with a single letter:
T: If the input contains information about complaint resolution timing, response time, or processing timeline (whether timely or delayed)
F: For all other types of complaint information
Return only 'T' or 'F' based on whether the knowledge base response is about resolution timing. Do not add any additional text or explanation - respond with just the single letter 'T' or 'F'.

Choose Save.

Choose Create version.

Create a chat agent

Let’s create a chat agent to handle specific resolution responses. Complete the following steps:

In SageMaker Unified Studio, on the Build menu, choose Chat agent under Machine Learning & Generative AI.
Provide a name for the prompt.
Choose the appropriate FM (for this example, we choose Claude 3 Haiku).
For Enter a system prompt, we enter the following:

You are a Financial Complaints Assistant AI. You will receive complaint information from a knowledge base and questions about resolution timing.
When responding to resolution timing queries:
1. Use the provided complaint information to confirm if it was resolved within timeline
2. For timely resolutions, provide:
   - Confirmation of timely completion
   - Specific actions taken (from the provided complaint data)
   - Next steps that were completed
2. For delayed resolutions, provide:
   - Acknowledgment of delay
   - Standard compensation package:
     • $75 service credit
     • Priority Status upgrade for 6 months
     • Service fees waived for current billing cycle
   - Actions taken (from the provided complaint data)
   - Contact information for follow-up: Priority Line: ************** 
Always reference the specific complaint details provided in your input when discussing actions taken and resolution process.

Choose Save.

After the agent is saved, choose Deploy.
For Alias name, enter demoAlias.
Choose Deploy.

Create a flow

Now that we have our prompt and agent ready, let’s create a flow that will orchestrate the complaint handling process:

In SageMaker Unified Studio, on the Build menu, choose Flow under Machine Learning & Generative AI.

Create a new flow called demo-flow.

Add a knowledge base to your flow application

Complete the following steps to add a knowledge base node to the flow:

In the navigation pane, on the Nodes tab, choose Knowledge Base.
On the Configure tab, provide the following information:
1. For Node name, enter a name (for example, complaints_kb).
2. Choose Create new Knowledge Base.
In the Create Knowledge Base pane, enter the following information:
1. For Name, enter a name (for example, complaints).
2. For Description, enter a description (for example, user complaints information).
3. For Add data sources, select Local file and upload the complaints.txt file.
4. For Embeddings model, choose Titan Text Embeddings V2.
5. For Vector store, choose OpenSearch Serverless.
6. Choose Create.

After you create the knowledge base, choose it in the flow.
In the details name, provide the following information:
For Response generation model, choose Claude 3 Haiku.
Connect the output of the flow input node with the input of the knowledge base node.
Connect the output of the knowledge base node with the input of the flow output node.

Choose Save.

Add a prompt to your flow application

Now let’s add the prompt you created earlier to the flow:

On the Nodes tab in the Flow app builder pane, add a prompt node.
On the Configure tab for the prompt node, provide the following information:
For Node name, enter a name (for example, demo_prompt).
For Prompt, choose financeAssistantPrompt.
For Version, choose 1.
Connect the output of the knowledge base node with the input of the prompt node.
Choose Save.

Add a condition to your flow application

The condition node determines how the flow handles different types of queries. It evaluates whether a query is about resolution timing or general complaint information, enabling the flow to route the query appropriately. When a query is about resolution timing, it will be directed to the chat agent for specialized handling; otherwise, it will receive a direct response from the knowledge base. Complete the following steps to add a condition:

On the Nodes tab in the Flow app builder pane, add a condition node.
On the Configure tab for the condition node, provide the following information:
1. For Node name, enter a name (for example, demo_condition).
2. Under Conditions, for Condition, enter conditionInput == "T".
3. Connect the output of the prompt node with the input of the condition node.
Choose Save.

Add a chat agent to your flow application

Now let’s add the chat agent you created earlier to the flow:

On the Nodes tab in the Flow app builder pane, add the agent node.
On the Configure tab for the agent node, provide the following information:
1. For Node name, enter a name (for example, demo_agent).
2. For Chat agent, choose DemoAgent.
3. For Alias, choose demoAlias.
Create the following node connections:
1. Connect the input of the condition node (demo_condition) to the output of the prompt node (demo_prompt).
2. Connect the output of the condition node:
  1. Set If condition is true to the agent node (demo_agent).
  2. Set If condition is false to the existing flow output node (FlowOutputNode).
3. Connect the output of the knowledge base node (complaints_kb) to the input of the following:
  1. The agent node (demo_agent).
  2. The flow output node (FlowOutputNode).
4. Connect the output of the agent node (demo_agent) to a new flow output node named FlowOutputNode_2.
Choose Save.

Test the flow application

Now that the flow application is ready, let’s test it. On the right side of the page, choose the expand icon to open the Test pane.

In the Enter prompt text box, we can ask a few questions related to the dataset created earlier. The following screenshots show some examples.

Clean up

To clean up your resources, delete the flow, agent, prompt, knowledge base, and associated OpenSearch Serverless resources.

Conclusion

In this post, we demonstrated how to build an AI-powered complaint reference system using a flow application in SageMaker Unified Studio. By using the integrated capabilities of SageMaker Unified Studio with Amazon Bedrock features like Amazon Bedrock Knowledge Bases, Amazon Bedrock Agents, and Amazon Bedrock Flows, you can rapidly develop and deploy sophisticated AI applications without extensive coding.

As you build AI workflows using SageMaker Unified Studio, remember to adhere to the AWS Shared Responsibility Model for security. Implement SageMaker Unified Studio security best practices, including proper IAM configurations and data encryption. You can also refer to Secure a generative AI assistant with OWASP Top 10 mitigation for details on how to assess the security posture of a generative AI assistant using OWASP TOP 10 mitigations for common threats. Following these guidelines helps establish robust AI applications that maintain data integrity and system protection.

To learn more, refer to Amazon Bedrock in SageMaker Unified Studio and join discussions and share your experiences in AWS Generative AI Community.

We look forward to seeing the innovative solutions you will create with these powerful new features.

About the authors

Sumeet Tripathi is an Enterprise Support Lead (TAM) at AWS in North Carolina. He has over 17 years of experience in technology across various roles. He is passionate about helping customers to reduce operational challenges and friction. His focus area is AI/ML and Energy & Utilities Segment. Outside work, He enjoys traveling with family, watching cricket and movies.

Vishal Naik is a Sr. Solutions Architect at Amazon Web Services (AWS). He is a builder who enjoys helping customers accomplish their business needs and solve complex challenges with AWS solutions and best practices. His core area of focus includes Generative AI and Machine Learning. In his spare time, Vishal loves making short films on time travel and alternate universe themes.

Accelerating AI innovation: Scale MCP servers for enterprise workloads with Amazon Bedrock

July 1, 2025

by Xan Huang Amazon AWS

Generative AI has been moving at a rapid pace, with new tools, offerings, and models released frequently. According to Gartner, agentic AI is one of the top technology trends of 2025, and organizations are performing prototypes on how to use agents in their enterprise environment. Agents depend on tools, and each tool might have its own mechanism to send and receive information. Model Context Protocol (MCP) by Anthropic is an open source protocol that attempts to solve this challenge. It provides a protocol and communication standard that is cross-compatible with different tools, and can be used by an agentic application’s large language model (LLM) to connect to enterprise APIs or external tools using a standard mechanism. However, large enterprise organizations like financial services tend to have complex data governance and operating models, which makes it challenging to implement agents working with MCP.

One major challenge is the siloed approach in which individual teams build their own tools, leading to duplication of efforts and wasted resources. This approach slows down innovation and creates inconsistencies in integrations and enterprise design. Furthermore, managing multiple disconnected MCP tools across teams makes it difficult to scale AI initiatives effectively. These inefficiencies hinder enterprises from fully taking advantage of generative AI for tasks like post-trade processing, customer service automation, and regulatory compliance.

In this post, we present a centralized MCP server implementation using Amazon Bedrock that offers an innovative approach by providing shared access to tools and resources. With this approach, teams can focus on building AI capabilities rather than spending time developing or maintaining tools. By standardizing access to resources and tools through MCP, organizations can accelerate the development of AI agents, so teams can reach production faster. Additionally, a centralized approach provides consistency and standardization and reduces operational overhead, because the tools are managed by a dedicated team rather than across individual teams. It also enables centralized governance that enforces controlled access to MCP servers, which reduces the risk of data exfiltration and prevents unauthorized or insecure tool use across the organization.

Solution overview

The following figure illustrates a proposed solution based on a financial services use case that uses MCP servers across multiple lines of business (LoBs), such as compliance, trading, operations, and risk management. Each LoB performs distinct functions tailored to their specific business. For instance, the trading LoB focuses on trade execution, whereas the risk LoB performs risk limit checks. For performing these functions, each division provides a set of MCP servers that facilitate actions and access to relevant data within their LoBs. These servers are accessible to agents developed within the respective LoBs and can also be exposed to agents outside LoBs.

The development of MCP servers is decentralized. Each LoB is responsible for developing the servers that support their specific functions. When the development of a server is complete, it’s hosted centrally and accessible across LoBs. It takes the form of a registry or marketplace that facilitates integration of AI-driven solutions across divisions while maintaining control and governance over shared resources.

In the following sections, we explore what the solution looks like on a conceptual level.

Agentic application interaction with a central MCP server hub

The following flow diagram showcases how an agentic application built using Amazon Bedrock interacts with one of the MCP servers located in the MCP server hub.

The flow consists of the following steps:

The application connects to the central MCP hub through the load balancer and requests a list of available tools from the specific MCP server. This can be fine-grained based on what servers the agentic application has access to.
The trade server responds with list of tools available, including details such as tool name, description, and required input parameters.
The agentic application invokes an Amazon Bedrock agent and provides the list of tools available.
Using this information, the agent determines what to do next based on the given task and the list of tools available to it.
The agent chooses the most suitable tool and responds with the tool name and input parameters. The control comes back to the agentic application.
The agentic application calls for the execution of the tool through the MCP server using the tool name and input parameters.
The trade MCP server executes the tool and returns the results of the execution back to the application.
The application returns the results of the tool execution back to the Amazon Bedrock agent.
The agent observes the tool execution results and determines the next step.

Let’s dive into the technical architecture of the solution.

Architecture overview

The following diagram illustrates the architecture to host the centralized cluster of MCP servers for an LoB.

The architecture can be split in five sections:

MCP server discovery API
Agentic applications
Central MCP server hub
Tools and resources

Let’s explore each section in detail:

MCP server discovery API – This API is a dedicated endpoint for discovering various MCP servers. Different teams can call this API to find what MCP servers are available in the registry; read their description, tool, and resource details; and decide which MCP server would be the right one for their agentic application. When a new MCP server is published, it’s added to an Amazon DynamoDB database. MCP server owners are responsible for keeping the registry information up-to-date.
Agentic application – The agentic applications are hosted on AWS Fargate for Amazon Elastic Container Service (Amazon ECS) and built using Amazon Bedrock Agents. Teams can also use the newly released open source AWS Strands Agents SDK, or other agentic frameworks of choice, to build the agentic application and their own containerized solution to host the agentic application. The agentic applications access Amazon Bedrock through a secure private virtual private cloud (VPC) endpoint. It uses private VPC endpoints to access MCP servers.
Central MCP server hub – This is where the MCP servers are hosted. Access to servers is enabled through an AWS Network Load Balancer. Technically, each server is a Docker container that can is hosted on Amazon ECS, but you can choose your own container deployment solution. These servers can scale individually without impacting the other server. These servers in turn connect to one or more tools using private VPC endpoints.
Tools and resources – This component holds the tools, such as databases, another application, Amazon Simple Storage Service (Amazon S3), or other tools. For enterprises, access to the tools and resources is provided only through private VPC endpoints.

Benefits of the solution

The solution offers the following key benefits:

Scalability and resilience – Because you’re using Amazon ECS on Fargate, you get scalability out of the box without managing infrastructure and handling scaling concerns. Amazon ECS automatically detects and recovers from failures by restarting failed MCP server tasks locally or reprovisioning containers, minimizing downtime. It can also redirect traffic away from unhealthy Availability Zones and rebalance tasks across healthy Availability Zones to provide uninterrupted access to the server.
Security – Access to MCP servers is secured at the network level through network controls such as PrivateLink. This makes sure the agentic application only connects to trusted MCP servers hosted by the organization, and vice versa. Each Fargate workload runs in an isolated environment. This prevents resource sharing between tasks. For application authentication and authorization, we propose using an MCP Auth Server (refer to the following GitHub repo) to hand off those tasks to a dedicated component that can scale independently.

At the time of writing, the MCP protocol doesn’t provide built-in mechanisms for user-level access control or authorization. Organizations requiring user-specific access restrictions must implement additional security layers on top of the MCP protocol. For a reference implementation, refer to the following GitHub repo.

Let’s dive deeper in the implementation of this solution.

Use case

The implementation is based on a financial services use case featuring post-trade execution. Post-trade execution refers to the processes and steps that take place after an equity buy/sell order has been placed by a customer. It involves many steps, including verifying trade details, actual transfer of assets, providing a detailed report of the execution, running fraudulent checks, and more. For simplification of the demo, we focus on the order execution step.

Although this use case is tailored to the financial industry, you can apply the architecture and the approach to other enterprise workloads as well. The entire code of this implementation is available on GitHub. We use the AWS Cloud Development Kit (AWS CDK) for Python to deploy this solution, which creates an agentic application connected to tools through the MCP server. It also creates a Streamlit UI to interact with the agentic application.

The following code snippet provides access to the MCP discovery API:

def get_server_registry():
    # Initialize DynamoDB client
    dynamodb = boto3.resource('dynamodb')
    table = dynamodb.Table(DDBTBL_MCP_SERVER_REGISTRY)
    
    try:
        # Scan the table to get all items
        response = table.scan()
        items = response.get('Items', [])
        
        # Format the items to include only id, description, server
        formatted_items = []
        for item in items:
            formatted_item = {
                'id': item.get('id', ''),
                'description': item.get('description', ''),
                'server': item.get('server', ''),
            }
            formatted_items.append(formatted_item)
        
        # Return the formatted items as JSON
        return {
            'statusCode': 200,
            'headers': cors_headers,
            'body': json.dumps(formatted_items)
        }
    except Exception as e:
        # Handle any errors
        return {
            'statusCode': 500,
            'headers': cors_headers,
            'body': json.dumps({'error': str(e)})
        }

The preceding code is invoked through an AWS Lambda function. The complete code is available in the GitHub repository. The following graphic shows the response of the discovery API.

Let’s explore a scenario where the user submits a question: “Buy 100 shares of AMZN at USD 186, to be distributed equally between accounts A31 and B12.”To execute this task, the agentic application invokes the trade-execution MCP server. The following code is the sample implementation of the MCP server for trade execution:

from fastmcp import FastMCP
from starlette.requests import Request
from starlette.responses import PlainTextResponse
mcp = FastMCP("server")

@mcp.custom_route("/", methods=["GET"])
async def health_check(request: Request) -> PlainTextResponse:
    return PlainTextResponse("OK")

@mcp.tool()
async def executeTrade(ticker, quantity, price):
    """
    Execute a trade for the given ticker, quantity, and price.
    
    Sample input:
    {
        "ticker": "AMZN",
        "quantity": 1000,
        "price": 150.25
    }
    """
    # Simulate trade execution
    return {
        "tradeId": "T12345",
        "status": "Executed",
        "timestamp": "2025-04-09T22:58:00"
    }
    
@mcp.tool()
async def sendTradeDetails(tradeId):
    """
    Send trade details for the given tradeId.
    Sample input:
    {
        "tradeId": "T12345"
    }
    """
    return {
        "status": "Details Sent",
        "recipientSystem": "MiddleOffice",
        "timestamp": "2025-04-09T22:59:00"
    }
if __name__ == "__main__":
    mcp.run(host="0.0.0.0", transport="streamable-http")

The complete code is available in the following GitHub repo.

The following graphic shows the MCP server execution in action.

This is a sample implementation of the use case focusing on the deployment step. For a production scenario, we strongly recommend adding a human oversight workflow to monitor the execution and provide input at various steps of the trade execution.

Now you’re ready to deploy this solution.

Prerequisites

Prerequisites for the solution are available in the README.md of the GitHub repository.

Deploy the application

Complete the following steps to run this solution:

Navigate to the README.md file of the GitHub repository to find the instructions to deploy the solution. Follow these steps to complete deployment.

The successful deployment will exit with a message similar to the one shown in the following screenshot.

When the deployment is complete, access the Streamlit application.

You can find the Streamlit URL in the terminal output, similar to the following screenshot.

Enter the URL of the Streamlit application in a browser to open the application console.

On the application console, different sets of MCP servers are listed in the left pane under MCP Server Registry. Each set corresponds to an MCP server and includes the definition of the tools, such as the name, description, and input parameters.

In the right pane, Agentic App, a request is pre-populated: “Buy 100 shares of AMZN at USD 186, to be distributed equally between accounts A31 and B12.” This request is ready to be submitted to the agent for execution.

Choose Submit to invoke an Amazon Bedrock agent to process the request.

The agentic application will evaluate the request together with the list of tools it has access to, and iterate through a series of tools execution and evaluation to fulfil the request.You can view the trace output to see the tools that the agent used. For each tool used, you can see the values of the input parameters, followed by the corresponding results. In this case, the agent operated as follows:

The agent first used the function executeTrade with input parameters of ticker=AMZN, quantity=100, and price=186
After the trade was executed, used the allocateTrade tool to allocate the trade position between two portfolio accounts

Clean up

You will incur charges when you consume the services used in this solution. Instructions to clean up the resources are available in the README.md of the GitHub repository.

Summary

This solution offers a straightforward and enterprise-ready approach to implement MCP servers on AWS. With this centralized operating model, teams can focus on building their applications rather than maintaining the MCP servers. As enterprises continue to embrace agentic workflows, centralized MCP servers offer a practical solution for overcoming operational silos and inefficiencies. With the AWS scalable infrastructure and advanced tools like Amazon Bedrock Agents and Amazon ECS, enterprises can accelerate their journey toward smarter workflows and better customer outcomes.

Check out the GitHub repository to replicate the solution in your own AWS environment.

To learn more about how to run MCP servers on AWS, refer to the following resources:

About the authors

Xan Huang is a Senior Solutions Architect with AWS and is based in Singapore. He works with major financial institutions to design and build secure, scalable, and highly available solutions in the cloud. Outside of work, Xan dedicates most of his free time to his family, where he lovingly takes direction from his two young daughters, aged one and four. You can find Xan on LinkedIn: https://www.linkedin.com/in/xanhuang/

Vikesh Pandey is a Principal GenAI/ML Specialist Solutions Architect at AWS helping large financial institutions adopt and scale generative AI and ML workloads. He is the author of book “Generative AI for financial services.” He carries more than decade of experience building enterprise-grade applications on generative AI/ML and related technologies. In his spare time, he plays an unnamed sport with his son that lies somewhere between football and rugby.

Choosing the right approach for generative AI-powered structured data retrieval

July 1, 2025

by Akshara Shah Amazon AWS

Organizations want direct answers to their business questions without the complexity of writing SQL queries or navigating through business intelligence (BI) dashboards to extract data from structured data stores. Examples of structured data include tables, databases, and data warehouses that conform to a predefined schema. Large language model (LLM)-powered natural language query systems transform how we interact with data, so you can ask questions like “Which region has the highest revenue?” and receive immediate, insightful responses. Implementing these capabilities requires careful consideration of your specific needs—whether you need to integrate knowledge from other systems (for example, unstructured sources like documents), serve internal or external users, handle the analytical complexity of questions, or customize responses for business appropriateness, among other factors.

In this post, we discuss LLM-powered structured data query patterns in AWS. We provide a decision framework to help you select the best pattern for your specific use case.

Business challenge: Making structured data accessible

Organizations have vast amounts of structured data but struggle to make it effectively accessible to non-technical users for several reasons:

Business users lack the technical knowledge (like SQL) needed to query data
Employees rely on BI teams or data scientists for analysis, limiting self-service capabilities
Gaining insights often involves time delays that impact decision-making
Predefined dashboards constrain spontaneous exploration of data
Users might not know what questions are possible or where relevant data resides

Solution overview

An effective solution should provide the following:

A conversational interface that allows employees to query structured data sources without technical expertise
The ability to ask questions in everyday language and receive accurate, trustworthy answers
Automatic generation of visualizations and explanations to clearly communicate insights.
Integration of information from different data sources (both structured and unstructured) presented in a unified manner
Ease of integration with existing investments and rapid deployment capabilities
Access restriction based on identities, roles, and permissions

In the following sections, we explore five patterns that can address these needs, highlighting the architecture, ideal use cases, benefits, considerations, and implementation resources for each approach.

Pattern 1: Direct conversational interface using an enterprise assistant

This pattern uses Amazon Q Business, a generative AI-powered assistant, to provide a chat interface on data sources with native connectors. When users ask questions in natural language, Amazon Q Business connects to the data source, interprets the question, and retrieves relevant information without requiring intermediate services. The following diagram illustrates this workflow.

This approach is ideal for internal enterprise assistants that need to answer business user-facing questions from both structured and unstructured data sources in a unified experience. For example, HR personnel can ask “What’s our parental leave policy and how many employees used it last quarter?” and receive answers drawn from both leave policy documentation and employee databases together in one interaction. With this pattern, you can benefit from the following:

Simplified connectivity through the extensive Amazon Q Business library of built-in connectors
Streamlined implementation with a single service to configure and manage
Unified search experience for accessing both structured and unstructured information
Built-in understanding and respect existing identities, roles, and permissions

You can define the scope of data to be pulled in the form of a SQL query. Amazon Q Business pre-indexes database content based on defined SQL queries and uses this index when responding to user questions. Similarly, you can define the sync mode and schedule to determine how often you want to update your index. Amazon Q Business does the heavy lifting of indexing the data using a Retrieval Augmented Generation (RAG) approach and using an LLM to generate well-written answers. For more details on how to set up Amazon Q Business with an Amazon Aurora PostgreSQL-Compatible Edition connector, see Discover insights from your Amazon Aurora PostgreSQL database using the Amazon Q Business connector. You can also refer to the complete list of supported data source connectors.

Pattern 2: Enhancing BI tool with natural language querying capabilities

This pattern uses Amazon Q in QuickSight to process natural language queries against datasets that have been previously configured in Amazon QuickSight. Users can ask questions in everyday language within the QuickSight interface and get visualized answers without writing SQL. This approach works with QuickSight (Enterprise or Q edition) and supports various data sources, including Amazon Relational Database Service (Amazon RDS), Amazon Redshift, Amazon Athena, and others. The architecture is depicted in the following diagram.

This pattern is well-suited for internal BI and analytics use cases. Business analysts, executives, and other employees can ask ad-hoc questions to get immediate visualized insights in the form of dashboards. For example, executives can ask questions like “What were our top 5 regions by revenue last quarter?” and immediately see responsive charts, reducing dependency on analytics teams. The benefits of this pattern are as follows:

It enables natural language queries that produce rich visualizations and charts
No coding or machine learning (ML) experience is needed—the heavy lifting like natural language interpretation and SQL generation is managed by Amazon Q in QuickSight
It integrates seamlessly within the familiar QuickSight dashboard environment

Existing QuickSight users might find this the most straightforward way to take advantage of generative AI benefits. You can optimize this pattern for higher-quality results by configuring topics like curated fields, synonyms, and expected question phrasing. This pattern will pull data only from a specific configured data source in QuickSight to produce a dashboard as an output. For more details, check out QuickSight DemoCentral to view a demo in QuickSight, see the generative BI learning dashboard, and view guided instructions to create dashboards with Amazon Q. Also refer to the list of supported data sources.

Pattern 3: Combining BI visualization with conversational AI for a seamless experience

This pattern merges BI visualization capabilities with conversational AI to create a seamless knowledge experience. By integrating Amazon Q in QuickSight with Amazon Q Business (with the QuickSight plugin enabled), organizations can provide users with a unified conversational interface that draws on both unstructured and structured data. The following diagram illustrates the architecture.

This is ideal for enterprises that want an internal AI assistant to answer a variety of questions—whether it’s a metric from a database or knowledge from a document. For example, executives can ask “What was our Q4 revenue growth?” and see visualized results from data warehouses through Amazon Redshift through QuickSight, then immediately follow up with “What is our company vacation policy?” to access HR documentation—all within the same conversation flow. This pattern offers the following benefits:

It unifies answers from structured data (databases and warehouses) and unstructured data (documents, wikis, emails) in a single application
It delivers rich visualizations alongside conversational responses in a seamless experience with real-time analysis in chat
There is no duplication of work—if your BI team has already built datasets and topics in QuickSight for analytics, you use that in Amazon Q Business
It maintains conversational context when switching between data and document-based inquiries

For more details, see Query structured data from Amazon Q Business using Amazon QuickSight integration and Amazon Q Business now provides insights from your databases and data warehouses (preview).

Another variation of this pattern is recommended for BI users who want to expose unified data through rich visuals in QuickSight, as illustrated in the following diagram.

For more details, see Integrate unstructured data into Amazon QuickSight using Amazon Q Business.

Pattern 4: Building knowledge bases from structured data using managed text-to-SQL

This pattern uses Amazon Bedrock Knowledge Bases to enable structured data retrieval. The service provides a fully managed text-to-SQL module that alleviates common challenges in developing natural language query applications for structured data. This implementation uses Amazon Bedrock (Amazon Bedrock Agents and Amazon Bedrock Knowledge Bases) along with your choice of data warehouse such as Amazon Redshift or Amazon SageMaker Lakehouse. The following diagram illustrates the workflow.

For example, a seller can use this capability embedded into an ecommerce application to ask a complex query like “Give me top 5 products whose sales increased by 50% last year as compared to previous year? Also group the results by product category.” The system automatically generates the appropriate SQL, executes it against the data sources, and delivers results or a summarized narrative. This pattern features the following benefits:

It provides fully managed text-to-SQL capabilities without requiring model training
It enables direct querying of data from the source without data movement
It supports complex analytical queries on warehouse data
It offers flexibility in foundation model (FM) selection through Amazon Bedrock
API connectivity, personalization options, and context-aware chat features make it better suited for customer facing applications

Choose this pattern when you need a flexible, developer-oriented solution. This approach works well for applications (internal or external) where you control the UI design. Default outputs are primarily text or structured data. However, executing arbitrary SQL queries can be a security risk for text-to-SQL applications. It is recommended that you take precautions as needed, such as using restricted roles, read-only databases, and sandboxing. For more information on how to build this pattern, see Empower financial analytics by creating structured knowledge bases using Amazon Bedrock and Amazon Redshift. For a list of supported structured data stores, refer to Create a knowledge base by connecting to a structured data store.

Pattern 5: Custom text-to-SQL implementation with flexible model selection

This pattern represents a build-your-own solution using FMs to convert natural language to SQL, execute queries on data warehouses, and return results. Choose Amazon Bedrock when you want to quickly integrate this capability without deep ML expertise—it offers a fully managed service with ready-to-use FMs through a unified API, handling infrastructure needs with pay-as-you-go pricing. Alternatively, select Amazon SageMaker AI when you require extensive model customization to build specialized needs—it provides complete ML lifecycle tools for data scientists and ML engineers to build, train, and deploy custom models with greater control. For more information, refer to our Amazon Bedrock or Amazon SageMaker AI decision guide. The following diagram illustrates the architecture.

Use this pattern if your use case requires specific open-weight models, or you want to fine-tune models on your domain-specific data. For example, if you need highly accurate results for your query, then you can use this pattern to fine-tune models on specific schema structures, while maintaining the flexibility to integrate with existing workflows and multi-cloud environments. This pattern offers the following benefits:

It provides maximum customization in model selection, fine-tuning, and system design
It supports complex logic across multiple data sources
It offers complete control over security and deployment in your virtual private cloud (VPC)
It enables flexible interface implementation (Slack bots, custom web UIs, notebook plugins)
You can implement it for external user-facing solutions

For more information on steps to build this pattern, see Build a robust text-to-SQL solution generating complex queries, self-correcting, and querying diverse data sources.

Pattern comparison: Making the right choice

To make effective decisions, let’s compare these patterns across key criteria.

Data workload suitability

Different out-of-the-box patterns handle transactional (operational) and analytical (historical or aggregated) data with varying degrees of effectiveness. Patterns 1 and 3, which use Amazon Q Business, work with indexed data and are optimized for lookup-style queries against previously indexed content rather than real-time transactional database queries. Pattern 2, which uses Amazon Q in QuickSight, gets visual output for transactional information for ad-hoc analysis. Pattern 4, which uses Amazon Bedrock structured data retrieval, is specifically designed for analytical systems and data warehouses, excelling at complex queries on large datasets. Pattern 5 is a self-managed text-to-SQL option that can be built to support both transactional or analytical needs of users.

Target audience

Architectures highlighted in Patterns 1, 2, and 3 (using Amazon Q Business, Amazon Q in QuickSight, or a combination) are best suited for internal enterprise use. However, you can use Amazon QuickSight Embedded to embed data visuals, dashboards, and natural language queries into both internal or customer-facing applications. Amazon Q Business serves as an enterprise AI assistant for organizational knowledge that uses subscription-based pricing tiers that is designed for internal employees. Pattern 4 (using Amazon Bedrock) can be used to build both internal as well as customer-facing applications. This is because, unlike the subscription-based model of Amazon Q Business, Amazon Bedrock provides API-driven services that alleviate per-user costs and identity management overhead for external customer scenarios. This makes it well-suited for customer-facing experiences where you need to serve potentially thousands of external users. The custom LLM solutions in Pattern 5 can similarly be tailored to external application requirements.

Interface and output format

Different patterns deliver answers through different interaction models:

Conversational experiences – Patterns 1 and 3 (using Amazon Q Business) provide chat-based interfaces. Pattern 4 (using Amazon Bedrock Knowledge Bases for structured data retrieval) naturally supports AI assistant integration, and Pattern 5 (a custom text-to-SQL solution) can be designed for a variety of interaction models.
Visualization-focused output – Pattern 2 (using Amazon Q in QuickSight) specializes in generating on-the-fly visualizations such as charts and tables in response to user questions.
API integration – For embedding capabilities into existing applications, Patterns 4 and 5 offer the most flexible API-based integration options.

The following figure is a comparison matrix of AWS structured data query patterns.

Conclusion

Between these patterns, your optimal choice depends on the following key factors:

Data location and characteristics – Is your data in operational databases, already in a data warehouse, or distributed across various sources?
User profile and interaction model – Are you supporting internal or external users? Do they prefer conversational or visualization-focused interfaces?
Available resources and expertise – Do you have ML specialists available, or do you need a fully managed solution?
Accuracy and governance requirements – Do you need strictly controlled semantics and curation, or is broader query flexibility acceptable with monitoring?

By understanding these patterns and their trade-offs, you can architect solutions that align with your business objectives.

About the authors

Akshara Shah is a Senior Solutions Architect at Amazon Web Services. She helps commercial customers build cloud-based generative AI services to meet their business needs. She has been designing, developing, and implementing solutions that leverage AI and ML technologies for more than 10 years. Outside of work, she loves painting, exercising and spending time with family.

Sanghwa Na is a Generative AI Specialist Solutions Architect at Amazon Web Services. Based in San Francisco, he works with customers to design and build generative AI solutions using large language models and foundation models on AWS. He focuses on helping organizations adopt AI technologies that drive real business value

Revolutionizing drug data analysis using Amazon Bedrock multimodal RAG capabilities

July 1, 2025

by Vivek Mittal Amazon AWS

In the pharmaceutical industry, biotechnology and healthcare companies face an unprecedented challenge for efficiently managing and analyzing vast amounts of drug-related data from diverse sources. Traditional data analysis methods prove inadequate for processing complex medical documentation that includes a mix of text, images, graphs, and tables. Amazon Bedrock offers features like multimodal retrieval, advanced chunking capabilities, and citations to help organizations get high-accuracy responses.

Pharmaceutical and healthcare organizations process a vast number of complex document formats and unstructured data that pose analytical challenges. Clinical study documents and research papers related to them typically present an intricate blend of technical text, detailed tables, and sophisticated statistical graphs, making automated data extraction particularly challenging. Clinical study documents present additional challenges through non-standardized formatting and varied data presentation styles across multiple research institutions. This post showcases a solution to extract data-driven insights from complex research documents through a sample application with high-accuracy responses. It analyzes clinical trial data, patient outcomes, molecular diagrams, and safety reports from the research documents. It can help pharmaceutical companies accelerate their research process. The solution provides citations from the source documents, reducing hallucinations and enhancing the accuracy of the responses.

Solution overview

The sample application uses Amazon Bedrock to create an intelligent AI assistant that analyzes and summarizes research documents containing text, graphs, and unstructured data. Amazon Bedrock is a fully managed service that offers a choice of industry-leading foundation models (FMs) along with a broad set of capabilities to build generative AI applications, simplifying development with security, privacy, and responsible AI.

To equip FMs with up-to-date and proprietary information, organizations use Retrieval Augmented Generation (RAG), a technique that fetches data from company data sources and enriches the prompt to provide relevant and accurate responses.

Amazon Bedrock Knowledge Bases is a fully managed RAG capability within Amazon Bedrock with in-built session context management and source attribution that helps you implement the entire RAG workflow, from ingestion to retrieval and prompt augmentation, without having to build custom integrations to data sources and manage data flows.

Amazon Bedrock Knowledge Bases introduces powerful document parsing capabilities, including Amazon Bedrock Data Automation powered parsing and FM parsing, revolutionizing how we handle complex documents. Amazon Bedrock Data Automation is a fully managed service that processes multimodal data effectively, without the need to provide additional prompting. The FM option parses multimodal data using an FM. This parser provides the option to customize the default prompt used for data extraction. This advanced feature goes beyond basic text extraction by intelligently breaking down documents into distinct components, including text, tables, images, and metadata, while preserving document structure and context. When working with supported formats like PDF, specialized FMs interpret and extract tabular data, charts, and complex document layouts. Additionally, the service provides advanced chunking strategies like semantic chunking, which intelligently divides text into meaningful segments based on semantic similarity calculated by the embedding model. Unlike traditional syntactic chunking methods, this approach preserves the context and meaning of the content, improving the quality and relevance of information retrieval.

The solution architecture implements these capabilities through a seamless workflow that begins with administrators securely uploading knowledge base documents to an Amazon Simple Storage Service (Amazon S3) bucket. These documents are then ingested into Amazon Bedrock Knowledge Bases, where a large language model (LLM) processes and parses the ingested data. The solution employs semantic chunking to store document embeddings efficiently in Amazon OpenSearch Service for optimized retrieval. The solution features a user-friendly interface built with Streamlit, providing an intuitive chat experience for end-users. When users interact with the Streamlit application, it triggers AWS Lambda functions that handle the requests, retrieving relevant context from the knowledge base and generating appropriate responses. The architecture is secured through AWS Identity and Access Management (IAM), maintaining proper access control throughout the workflow. Amazon Bedrock uses AWS Key Management Service (AWS KMS) to encrypt resources related to your knowledge bases. By default, Amazon Bedrock encrypts this data using an AWS managed key. Optionally, you can encrypt the model artifacts using a customer managed key. This end-to-end solution provides efficient document processing, context-aware information retrieval, and secure user interactions, delivering accurate and comprehensive responses through a seamless chat interface.

The following diagram illustrates the solution architecture.

This solution uses the following additional services and features:

The Anthropic Claude 3 family offers Opus, Sonnet, and Haiku models that accept text, image, and video inputs and generate text output. They provide a broad selection of capability, accuracy, speed, and cost operation points. These models understand complex research documents that include charts, graphs, tables, diagrams, and reports.
AWS Lambda is a serverless computing service that empowers you to run code without provisioning or managing servers cost effectively.
Amazon S3 is a highly scalable, durable, and secure object storage service.
Amazon OpenSearch Service is a fully managed search and analytics engine for efficient document retrieval. The OpenSearch Service vector database capabilities enable semantic search, RAG with LLMs, recommendation engines, and search rich media.
Streamlit is a faster way to build and share data applications using interactive web-based data applications in pure Python.

Prerequisites

The following prerequisites are needed to proceed with this solution. For this post, we use the us-east-1 AWS Region. For details on available Regions, see Amazon Bedrock endpoints and quotas.

An AWS account with an IAM user who has permissions to Lambda, Amazon Bedrock, Amazon S3, and IAM.
Access to Anthropic’s Claude model in Amazon Bedrock. For instructions, see Access Amazon Bedrock foundation models.

Deploy the solution

Refer to the GitHub repository for the deployment steps listed under the deployment guide section. We use an AWS CloudFormation template to deploy solution resources, including S3 buckets to store the source data and knowledge base data.

Test the sample application

Imagine you are a member of an R&D department for a biotechnology firm, and your job requires you to derive insights from drug- and vaccine-related information from diverse sources like research studies, drug specifications, and industry papers. You are performing research on cancer vaccines and want to gain insights based on cancer research publications. You can upload the documents given in the reference section to the S3 bucket and sync the knowledge base. Let’s explore example interactions that demonstrate the application’s capabilities. The responses generated by the AI assistant are based on the documents uploaded to the S3 bucket connected with the knowledge base. Due to non-deterministic nature of machine learning (ML), your responses might be slightly different from the ones presented in this post.

Understanding historical context

We use the following query: “Create a timeline of major developments in mRNA vaccine technology for cancer treatment based on the information provided in the historical background sections.”The assistant analyzes multiple documents and presents a chronological progression of mRNA vaccine development, including key milestones based on the chunks of information retrieved from the OpenSearch Service vector database.

The following screenshot shows the AI assistant’s response.

Complex data analysis

We use the following query: “Synthesize the information from the text, figures, and tables to provide a comprehensive overview of the current state and future prospects of therapeutic cancer vaccines.”

The AI assistant is able to provide insights from complex data types, which is enabled by FM parsing, while ingesting the data to OpenSearch Service. It is also able to provide images in the source attribution using the multimodal data capabilities of Amazon Bedrock Knowledge Bases.

The following screenshot shows the AI assistant’s response.

The following screenshot shows the visuals provided in the citations when the mouse hovers over the question mark icon.

Comparative analysis

We use the following query: “Compare the efficacy and safety profiles of MAGE-A3 and NY-ESO-1 based vaccines as described in the text and any relevant tables or figures.”

The AI assistant used the semantically similar chunks returned from the OpenSearch Service vector database and added this context to the user’s question, which enabled the FM to provide a relevant answer.

The following screenshot shows the AI assistant’s response.

Technical deep dive

We use the following query: “Summarize the potential advantages of mRNA vaccines over DNA vaccines for targeting tumor angiogenesis, as described in the review.”

With the semantic chunking feature of the knowledge base, the AI assistant was able to get the relevant context from the OpenSearch Service database with higher accuracy.

The following screenshot shows the AI assistant’s response.

The following screenshot shows the diagram that was used for the answer as one of the citations.

The sample application demonstrates the following:

Accurate interpretation of complex scientific diagrams
Precise extraction of data from tables and graphs
Context-aware responses that maintain scientific accuracy
Source attribution for provided information
Ability to synthesize information across multiple documents

This application can help you quickly analyze vast amounts of complex scientific literature, extracting meaningful insights from diverse data types while maintaining accuracy and providing proper attribution to source materials. This is enabled by the advanced features of the knowledge bases, including FM parsing, which aides in interpreting complex scientific diagrams and extraction of data from tables and graphs, semantic chunking, which aides with high-accuracy context-aware responses, and multimodal data capabilities, which aides in providing relevant images as source attribution.

These are some of the many new features added to Amazon Bedrock, empowering you to generate high-accuracy results depending on your use case. To learn more, see New Amazon Bedrock capabilities enhance data processing and retrieval.

Production readiness

The proposed solution accelerates the time to value of the project development process. Solutions built on the AWS Cloud benefit from inherent scalability while maintaining robust security and privacy controls.

The security and privacy framework includes fine-grained user access controls using IAM for both OpenSearch Service and Amazon Bedrock services. In addition, Amazon Bedrock enhances security by providing encryption at rest and in transit, and private networking options using virtual private cloud (VPC) endpoints. Data protection is achieved using KMS keys, and API calls and usage are tracked through Amazon CloudWatch logs and metrics. For specific compliance validation for Amazon Bedrock, see Compliance validation for Amazon Bedrock.

For additional details on moving RAG applications to production, refer to From concept to reality: Navigating the Journey of RAG from proof of concept to production.

Clean up

Complete the following steps to clean up your resources.

Empty the SourceS3Bucket and KnowledgeBaseS3BucketName buckets.
Delete the main CloudFormation stack.

Conclusion

This post demonstrated the powerful multimodal document analysis (text, graphs, images) using advanced parsing and chunking features of Amazon Bedrock Knowledge Bases. By combining the powerful capabilities of Amazon Bedrock FMs, OpenSearch Service, and intelligent chunking strategies through Amazon Bedrock Knowledge Bases, organizations can transform their complex research documents into searchable, actionable insights. The integration of semantic chunking makes sure that document context and relationships are preserved, and the user-friendly Streamlit interface makes the system accessible to end-users through an intuitive chat experience. This solution not only streamlines the process of analyzing research documents, but also demonstrates the practical application of AI/ML technologies in enhancing knowledge discovery and information retrieval. As organizations continue to grapple with increasing volumes of complex documents, this scalable and intelligent system provides a robust framework for extracting maximum value from their document repositories.

Although our demonstration focused on the healthcare industry, the versatility of this technology extends beyond a single industry. RAG on Amazon Bedrock has proven its value across diverse sectors. Notable adopters include global brands like Adidas in retail, Empolis in information management, Fractal Analytics in AI solutions, Georgia Pacific in manufacturing, and Nasdaq in financial services. These examples illustrate the broad applicability and transformative potential of RAG technology across various business domains, highlighting its ability to drive innovation and efficiency in multiple industries.

Refer to the GitHub repo for the agentic RAG application, including samples and components for building agentic RAG solutions. Be on the lookout for additional features and samples in the repository in the coming months.

To learn more about Amazon Bedrock Knowledge Bases, check out the RAG workshop using Amazon Bedrock. Get started with Amazon Bedrock Knowledge Bases, and let us know your thoughts in the comments section.

References

The following are sample research documents available with an open access distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license https://creativecommons.org/licenses/by/4.0/:

About the authors

Vivek Mittal is a Solution Architect at Amazon Web Services, where he helps organizations architect and implement cutting-edge cloud solutions. With a deep passion for Generative AI, Machine Learning, and Serverless technologies, he specializes in helping customers harness these innovations to drive business transformation. He finds particular satisfaction in collaborating with customers to turn their ambitious technological visions into reality.

Shamika Ariyawansa, serving as a Senior AI/ML Solutions Architect in the Global Healthcare and Life Sciences division at Amazon Web Services (AWS), has a keen focus on Generative AI. He assists customers in integrating Generative AI into their projects, emphasizing the importance of explainability within their AI-driven initiatives. Beyond his professional commitments, Shamika passionately pursues skiing and off-roading adventures.

Shaik Abdulla is a Sr. Solutions Architect, specializes in architecting enterprise-scale cloud solutions with focus on Analytics, Generative AI and emerging technologies. His technical expertise is validated by his achievement of all 12 AWS certifications and the prestigious Golden jacket recognition. He has a passion to architect and implement innovative cloud solutions that drive business transformation. He speaks at major industry events like AWS re:Invent and regional AWS Summits, where he shares insights on cloud architecture and emerging technologies.

Build and deploy AI inference workflows with new enhancements to the Amazon SageMaker Python SDK

June 30, 2025

by Melanie Li Amazon AWS

Amazon SageMaker Inference has been a popular tool for deploying advanced machine learning (ML) and generative AI models at scale. As AI applications become increasingly complex, customers want to deploy multiple models in a coordinated group that collectively process inference requests for an application. In addition, with the evolution of generative AI applications, many use cases now require inference workflows—sequences of interconnected models operating in predefined logical flows. This trend drives a growing need for more sophisticated inference offerings.

To address this need, we are introducing a new capability in the SageMaker Python SDK that revolutionizes how you build and deploy inference workflows on SageMaker. We will take Amazon Search as an example to show case how this feature is used in helping customers building inference workflows. This new Python SDK capability provides a streamlined and simplified experience that abstracts away the underlying complexities of packaging and deploying groups of models and their collective inference logic, allowing you to focus on what matter most—your business logic and model integrations.

In this post, we provide an overview of the user experience, detailing how to set up and deploy these workflows with multiple models using the SageMaker Python SDK. We walk through examples of building complex inference workflows, deploying them to SageMaker endpoints, and invoking them for real-time inference. We also show how customers like Amazon Search plan to use SageMaker Inference workflows to provide more relevant search results to Amazon shoppers.

Whether you are building a simple two-step process or a complex, multimodal AI application, this new feature provides the tools you need to bring your vision to life. This tool aims to make it easy for developers and businesses to create and manage complex AI systems, helping them build more powerful and efficient AI applications.

In the following sections, we dive deeper into details of the SageMaker Python SDK, walk through practical examples, and showcase how this new capability can transform your AI development and deployment process.

Key improvements and user experience

The SageMaker Python SDK now includes new features for creating and managing inference workflows. These additions aim to address common challenges in developing and deploying inference workflows:

Deployment of multiple models – The core of this new experience is the deployment of multiple models as inference components within a single SageMaker endpoint. With this approach, you can create a more unified inference workflow. By consolidating multiple models into one endpoint, you can reduce the number of endpoints that need to be managed. This consolidation can also improve operational tasks, resource utilization, and potentially costs.
Workflow definition with workflow mode – The new workflow mode extends the existing Model Builder capabilities. It allows for the definition of inference workflows using Python code. Users familiar with the ModelBuilder class might find this feature to be an extension of their existing knowledge. This mode enables creating multi-step workflows, connecting models, and specifying the data flow between different models in the workflows. The goal is to reduce the complexity of managing these workflows and enable you to focus more on the logic of the resulting compound AI system.
Development and deployment options – A new deployment option has been introduced for the development phase. This feature is designed to allow for quicker deployment of workflows to development environments. The intention is to enable faster testing and refinement of workflows. This could be particularly relevant when experimenting with different configurations or adjusting models.
Invocation flexibility – The SDK now provides options for invoking individual models or entire workflows. You can choose to call a specific inference component used in a workflow or the entire workflow. This flexibility can be useful in scenarios where access to a specific model is needed, or when only a portion of the workflow needs to be executed.
Dependency management – You can use SageMaker Deep Learning Containers (DLCs) or the SageMaker distribution that comes preconfigured with various model serving libraries and tools. These are intended to serve as a starting point for common use cases.

To get started, use the SageMaker Python SDK to deploy your models as inference components. Then, use the workflow mode to create an inference workflow, represented as Python code using the container of your choice. Deploy the workflow container as another inference component on the same endpoints as the models or a dedicated endpoint. You can run the workflow by invoking the inference component that represents the workflow. The user experience is entirely code-based, using the SageMaker Python SDK. This approach allows you to define, deploy, and manage inference workflows using SDK abstractions offered by this feature and Python programming. The workflow mode provides flexibility to specify complex sequences of model invocations and data transformations, and the option to deploy as components or endpoints caters to various scaling and integration needs.

Solution overview

The following diagram illustrates a reference architecture using the SageMaker Python SDK.

The improved SageMaker Python SDK introduces a more intuitive and flexible approach to building and deploying AI inference workflows. Let’s explore the key components and classes that make up the experience:

ModelBuilder simplifies the process of packaging individual models as inference components. It handles model loading, dependency management, and container configuration automatically.
The CustomOrchestrator class provides a standardized way to define custom inference logic that orchestrates multiple models in the workflow. Users implement the handle() method to specify this logic and can use an orchestration library or none at all (plain Python).
A single deploy() call handles the deployment of the components and workflow orchestrator.
The Python SDK supports invocation against the custom inference workflow or individual inference components.
The Python SDK supports both synchronous and streaming inference.

CustomOrchestrator is an abstract base class that serves as a template for defining custom inference orchestration logic. It standardizes the structure of entry point-based inference scripts, making it straightforward for users to create consistent and reusable code. The handle method in the class is an abstract method that users implement to define their custom orchestration logic.

class CustomOrchestrator (ABC):
"""
Templated class used to standardize the structure of an entry point based inference script.
"""

    @abstractmethod
    def handle(self, data, context=None):
        """abstract class for defining an entrypoint for the model server"""
        return NotImplemented

With this templated class, users can integrate into their custom workflow code, and then point to this code in the model builder using a file path or directly using a class or method name. Using this class and the ModelBuilder class, it enables a more streamlined workflow for AI inference:

Users define their custom workflow by implementing the CustomOrchestrator class.
The custom CustomOrchestrator is passed to ModelBuilder using the ModelBuilder inference_spec parameter.
ModelBuilder packages the CustomOrchestrator along with the model artifacts.
The packaged model is deployed to a SageMaker endpoint (for example, using a TorchServe container).
When invoked, the SageMaker endpoint uses the custom handle() function defined in the CustomOrchestrator to handle the input payload.

In the follow sections, we provide two examples of custom workflow orchestrators implemented with plain Python code. For simplicity, the examples use two inference components.

We explore how to create a simple workflow that deploys two large language models (LLMs) on SageMaker Inference endpoints along with a simple Python orchestrator that calls the two models. We create an IT customer service workflow where one model processes the initial request and another suggests solutions. You can find the example notebook in the GitHub repo.

Prerequisites

To run the example notebooks, you need an AWS account with an AWS Identity and Access Management (IAM) role with least-privilege permissions to manage resources created. For details, refer to Create an AWS account. You might need to request a service quota increase for the corresponding SageMaker hosting instances. In this example, we host multiple models on the same SageMaker endpoint, so we use two ml.g5.24xlarge SageMaker hosting instances.

Python inference orchestration

First, let’s define our custom orchestration class that inherits from CustomOrchestrator. The workflow is structured around a custom inference entry point that handles the request data, processes it, and retrieves predictions from the configured model endpoints. See the following code:

class PythonCustomInferenceEntryPoint(CustomOrchestrator):
    def __init__(self, region_name, endpoint_name, component_names):
        self.region_name = region_name
        self.endpoint_name = endpoint_name
        self.component_names = component_names
    
    def preprocess(self, data):
        payload = {
            "inputs": data.decode("utf-8")
        }
        return json.dumps(payload)

    def _invoke_workflow(self, data):
        # First model (Llama) inference
        payload = self.preprocess(data)
        
        llama_response = self.client.invoke_endpoint(
            EndpointName=self.endpoint_name,
            Body=payload,
            ContentType="application/json",
            InferenceComponentName=self.component_names[0]
        )
        llama_generated_text = json.loads(llama_response.get('Body').read())['generated_text']
        
        # Second model (Mistral) inference
        parameters = {
            "max_new_tokens": 50
        }
        payload = {
            "inputs": llama_generated_text,
            "parameters": parameters
        }
        mistral_response = self.client.invoke_endpoint(
            EndpointName=self.endpoint_name,
            Body=json.dumps(payload),
            ContentType="application/json",
            InferenceComponentName=self.component_names[1]
        )
        return {"generated_text": json.loads(mistral_response.get('Body').read())['generated_text']}
    
    def handle(self, data, context=None):
        return self._invoke_workflow(data)

This code performs the following functions:

Defines the orchestration that sequentially calls two models using their inference component names
Processes the response from the first model before passing it to the second model
Returns the final generated response

This plain Python approach provides flexibility and control over the request-response flow, enabling seamless cascading of outputs across multiple model components.

Build and deploy the workflow

To deploy the workflow, we first create our inference components and then build the custom workflow. One inference component will host a Meta Llama 3.1 8B model, and the other will host a Mistral 7B model.

from sagemaker.serve import ModelBuilder
from sagemaker.serve.builder.schema_builder import SchemaBuilder

# Create a ModelBuilder instance for Llama 3.1 8B
# Pre-benchmarked ResourceRequirements will be taken from JumpStart, as Llama-3.1-8b is a supported model.
llama_model_builder = ModelBuilder(
    model="meta-textgeneration-llama-3-1-8b",
    schema_builder=SchemaBuilder(sample_input, sample_output),
    inference_component_name=llama_ic_name,
    instance_type="ml.g5.24xlarge"
)

# Create a ModelBuilder instance for Mistral 7B model.
mistral_mb = ModelBuilder(
    model="huggingface-llm-mistral-7b",
    instance_type="ml.g5.24xlarge",
    schema_builder=SchemaBuilder(sample_input, sample_output),
    inference_component_name=mistral_ic_name,
    resource_requirements=ResourceRequirements(
        requests={
           "memory": 49152,
           "num_accelerators": 2,
           "copies": 1
        }
    ),
    instance_type="ml.g5.24xlarge"
)

Now we can tie it all together to create one more ModelBuilder to which we pass the modelbuilder_list, which contains the ModelBuilder objects we just created for each inference component and the custom workflow. Then we call the build() function to prepare the workflow for deployment.

# Create workflow ModelBuilder
orchestrator= ModelBuilder(
    inference_spec=PythonCustomInferenceEntryPoint(
        region_name=region,
        endpoint_name=llama_mistral_endpoint_name,
        component_names=[llama_ic_name, mistral_ic_name],
    ),
    dependencies={
        "auto": False,
        "custom": [
            "cloudpickle",
            "graphene",
            # Define other dependencies here.
        ],
    },
    sagemaker_session=Session(),
    role_arn=role,
    resource_requirements=ResourceRequirements(
        requests={
           "memory": 4096,
           "num_accelerators": 1,
           "copies": 1,
           "num_cpus": 2
        }
    ),
    name=custom_workflow_name, # Endpoint name for your custom workflow
    schema_builder=SchemaBuilder(sample_input={"inputs": "test"}, sample_output="Test"),
    modelbuilder_list=[llama_model_builder, mistral_mb] # Inference Component ModelBuilders created in Step 2
)
# call the build function to prepare the workflow for deployment
orchestrator.build()

In the preceding code snippet, you can comment out the section that defines the resource_requirements to have the custom workflow deployed on a separate endpoint instance, which can be a dedicated CPU instance to handle the custom workflow payload.

By calling the deploy() function, we deploy the custom workflow and the inference components to your desired instance type, in this example ml.g5.24.xlarge. If you choose to deploy the custom workflow to a separate instance, by default, it will use the ml.c5.xlarge instance type. You can set inference_workflow_instance_type and inference_workflow_initial_instance_count to configure the instances required to host the custom workflow.

predictors = orchestrator.deploy(
    instance_type="ml.g5.24xlarge",
    initial_instance_count=1,
    accept_eula=True, # Required for Llama3
    endpoint_name=llama_mistral_endpoint_name
    # inference_workflow_instance_type="ml.t2.medium", # default
    # inference_workflow_initial_instance_count=1 # default
)

Invoke the endpoint

After you deploy the workflow, you can invoke the endpoint using the predictor object:

from sagemaker.serializers import JSONSerializer
predictors[-1].serializer = JSONSerializer()
predictors[-1].predict("Tell me a story about ducks.")

You can also invoke each inference component in the deployed endpoint. For example, we can test the Llama inference component with a synchronous invocation, and Mistral with streaming:

from sagemaker.predictor import Predictor
# create predictor for the inference component of Llama model
llama_predictor = Predictor(endpoint_name=llama_mistral_endpoint_name, component_name=llama_ic_name)
llama_predictor.content_type = "application/json"

llama_predictor.predict(json.dumps(payload))

When handling the streaming response, we need to read each line of the output separately. The following example code demonstrates this streaming handling by checking for newline characters to separate and print each token in real time:

mistral_predictor = Predictor(endpoint_name=llama_mistral_endpoint_name, component_name=mistral_ic_name)
mistral_predictor.content_type = "application/json"

body = json.dumps({
    "inputs": prompt,
    # specify the parameters as needed
    "parameters": parameters
})

for line in mistral_predictor.predict_stream(body):
    decoded_line = line.decode('utf-8')
    if 'n' in decoded_line:
        # Split by newline to handle multiple tokens in the same line
        tokens = decoded_line.split('n')
        for token in tokens[:-1]:  # Print all tokens except the last one with a newline
            print(token)
        # Print the last token without a newline, as it might be followed by more tokens
        print(tokens[-1], end='')
    else:
        # Print the token without a newline if it doesn't contain 'n'
        print(decoded_line, end='')

So far, we have walked through the example code to demonstrate how to build complex inference logic using Python orchestration, deploy them to SageMaker endpoints, and invoke them for real-time inference. The Python SDK automatically handles the following:

Model packaging and container configuration
Dependency management and environment setup
Endpoint creation and component coordination

Whether you’re building a simple workflow of two models or a complex multimodal application, the new SDK provides the building blocks needed to bring your inference workflows to life with minimal boilerplate code.

Customer story: Amazon Search

Amazon Search is a critical component of the Amazon shopping experience, processing an enormous volume of queries across billions of products across diverse categories. At the core of this system are sophisticated matching and ranking workflows, which determine the order and relevance of search results presented to customers. These workflows execute large deep learning models in predefined sequences, often sharing models across different workflows to improve price-performance and accuracy. This approach makes sure that whether a customer is searching for electronics, fashion items, books, or other products, they receive the most pertinent results tailored to their query.

The SageMaker Python SDK enhancement offers valuable capabilities that align well with Amazon Search’s requirements for these ranking workflows. It provides a standard interface for developing and deploying complex inference workflows crucial for effective search result ranking. The enhanced Python SDK enables efficient reuse of shared models across multiple ranking workflows while maintaining the flexibility to customize logic for specific product categories. Importantly, it allows individual models within these workflows to scale independently, providing optimal resource allocation and performance based on varying demand across different parts of the search system.

Amazon Search is exploring the broad adoption of these Python SDK enhancements across their search ranking infrastructure. This initiative aims to further refine and improve search capabilities, enabling the team to build, version, and catalog workflows that power search ranking more effectively across different product categories. The ability to share models across workflows and scale them independently offers new levels of efficiency and adaptability in managing the complex search ecosystem.

Vaclav Petricek, Sr. Manager of Applied Science at Amazon Search, highlighted the potential impact of these SageMaker Python SDK enhancements: “These capabilities represent a significant advancement in our ability to develop and deploy sophisticated inference workflows that power search matching and ranking. The flexibility to build workflows using Python, share models across workflows, and scale them independently is particularly exciting, as it opens up new possibilities for optimizing our search infrastructure and rapidly iterating on our matching and ranking algorithms as well as new AI features. Ultimately, these SageMaker Inference enhancements will allow us to more efficiently create and manage the complex algorithms powering Amazon’s search experience, enabling us to deliver even more relevant results to our customers.”

The following diagram illustrates a sample solution architecture used by Amazon Search.

Clean up

When you’re done testing the models, as a best practice, delete the endpoint to save costs if the endpoint is no longer required. You can follow the cleanup section the demo notebook or use following code to delete the model and endpoint created by the demo:

mistral_predictor.delete_predictor()
llama_predictor.delete_predictor()
llama_predictor.delete_endpoint()
workflow_predictor.delete_predictor()

Conclusion

The new SageMaker Python SDK enhancements for inference workflows mark a significant advancement in the development and deployment of complex AI inference workflows. By abstracting the underlying complexities, these enhancements empower inference customers to focus on innovation rather than infrastructure management. This feature bridges sophisticated AI applications with the robust SageMaker infrastructure, enabling developers to use familiar Python-based tools while harnessing the powerful inference capabilities of SageMaker.

Early adopters, including Amazon Search, are already exploring how these capabilities can drive major improvements in AI-powered customer experiences across diverse industries. We invite all SageMaker users to explore this new functionality, whether you’re developing classic ML models, building generative AI applications or multi-model workflows, or tackling multi-step inference scenarios. The enhanced SDK provides the flexibility, ease of use, and scalability needed to bring your ideas to life. As AI continues to evolve, SageMaker Inference evolves with it, providing you with the tools to stay at the forefront of innovation. Start building your next-generation AI inference workflows today with the enhanced SageMaker Python SDK.

About the authors

Melanie Li, PhD, is a Senior Generative AI Specialist Solutions Architect at AWS based in Sydney, Australia, where her focus is on working with customers to build solutions leveraging state-of-the-art AI and machine learning tools. She has been actively involved in multiple Generative AI initiatives across APJ, harnessing the power of Large Language Models (LLMs). Prior to joining AWS, Dr. Li held data science roles in the financial and retail industries.

Saurabh Trikande is a Senior Product Manager for Amazon Bedrock and SageMaker Inference. He is passionate about working with customers and partners, motivated by the goal of democratizing AI. He focuses on core challenges related to deploying complex AI applications, inference with multi-tenant models, cost optimizations, and making the deployment of Generative AI models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch, and spending time with his family.

Osho Gupta is a Senior Software Developer at AWS SageMaker. He is passionate about ML infrastructure space, and is motivated to learn & advance underlying technologies that optimize Gen AI training & inference performance. In his spare time, Osho enjoys paddle boarding, hiking, traveling, and spending time with his friends & family.

Joseph Zhang is a software engineer at AWS. He started his AWS career at EC2 before eventually transitioning to SageMaker, and now works on developing GenAI-related features. Outside of work he enjoys both playing and watching sports (go Warriors!), spending time with family, and making coffee.

Gary Wang is a Software Developer at AWS SageMaker. He is passionate about AI/ML operations and building new things. In his spare time, Gary enjoys running, hiking, trying new food, and spending time with his friends and family.

James Park is a Solutions Architect at Amazon Web Services. He works with Amazon.com to design, build, and deploy technology solutions on AWS, and has a particular interest in AI and machine learning. In h is spare time he enjoys seeking out new cultures, new experiences, and staying up to date with the latest technology trends. You can find him on LinkedIn.

Vaclav Petricek is a Senior Applied Science Manager at Amazon Search, where he led teams that built Amazon Rufus and now leads science and engineering teams that work on the next generation of Natural Language Shopping. He is passionate about shipping AI experiences that make people’s lives better. Vaclav loves off-piste skiing, playing tennis, and backpacking with his wife and three children.

Wei Li is a Senior Software Dev Engineer in Amazon Search. She is passionate about Large Language Model training and inference technologies, and loves integrating these solutions into Search Infrastructure to enhance natural language shopping experiences. During her leisure time, she enjoys gardening, painting, and reading.

Brian Granger is a Senior Principal Technologist at Amazon Web Services and a professor of physics and data science at Cal Poly State University in San Luis Obispo, CA. He works at the intersection of UX design and engineering on tools for scientific computing, data science, machine learning, and data visualization. Brian is a co-founder and leader of Project Jupyter, co-founder of the Altair project for statistical visualization, and creator of the PyZMQ project for ZMQ-based message passing in Python. At AWS he is a technical and open source leader in the AI/ML organization. Brian also represents AWS as a board member of the PyTorch Foundation. He is a winner of the 2017 ACM Software System Award and the 2023 NASA Exceptional Public Achievement Medal for his work on Project Jupyter. He has a Ph.D. in theoretical physics from the University of Colorado.

Context extraction from image files in Amazon Q Business using LLMs

June 30, 2025

by Nikhil Jha Amazon AWS

To effectively convey complex information, organizations increasingly rely on visual documentation through diagrams, charts, and technical illustrations. Although text documents are well-integrated into modern knowledge management systems, rich information contained in diagrams, charts, technical schematics, and visual documentation often remains inaccessible to search and AI assistants. This creates significant gaps in organizational knowledge bases, leading to interpreting visual data manually and preventing automation systems from using critical visual information for comprehensive insights and decision-making. While Amazon Q Business already handles embedded images within documents, the custom document enrichment (CDE) feature extends these capabilities significantly by processing standalone image files (for example, JPGs and PNGs).

In this post, we look at a step-by-step implementation for using the CDE feature within an Amazon Q Business application. We walk you through an AWS Lambda function configured within CDE to process various image file types, and we showcase an example scenario of how this integration enhances the Amazon Q Business ability to provide comprehensive insights. By following this practical guide, you can significantly expand your organization’s searchable knowledge base, enabling more complete answers and insights that incorporate both textual and visual information sources.

Example scenario: Analyzing regional educational demographics

Consider a scenario where you’re working for a national educational consultancy that has charts, graphs, and demographic data across different AWS Regions stored in an Amazon Simple Storage Service (Amazon S3) bucket. The following image shows student distribution by age range across various cities using a bar chart. The insights in visualizations like this are valuable for decision-making but traditionally locked within image formats in your S3 buckets and other storage.

With Amazon Q Business and CDE, we show you how to enable natural language queries against such visualizations. For example, your team could ask questions such as “Which city has the highest number of students in the 13–15 age range?” or “Compare the student demographics between City 1 and City 4” directly through the Amazon Q Business application interface.

You can bridge this gap using the Amazon Q Business CDE feature to:

Detect and process image files during the document ingestion process
Use Amazon Bedrock with AWS Lambda to interpret the visual information
Extract structured data and insights from charts and graphs
Make this information searchable using natural language queries

Solution overview

In this solution, we walk you through how to implement a CDE-based solution for your educational demographic data visualizations. The solution empowers organizations to extract meaningful information from image files using the CDE capability of Amazon Q Business. When Amazon Q Business encounters the S3 path during ingestion, CDE rules automatically trigger a Lambda function. The Lambda function identifies the image files and calls the Amazon Bedrock API, which uses multimodal large language models (LLMs) to analyze and extract contextual information from each image. The extracted text is then seamlessly integrated into the knowledge base in Amazon Q Business. End users can then quickly search for valuable data and insights from images based on their actual context. By bridging the gap between visual content and searchable text, this solution helps organizations unlock valuable insights previously hidden within their image repositories.

The following figure shows the high-level architecture diagram used for this solution.

For this use case, we use Amazon S3 as our data source. However, this same solution is adaptable to other data source types supported by Amazon Q Business, or it can be implemented with custom data sources as needed.To complete the solution, follow these high-level implementation steps:

Create an Amazon Q Business application and sync with an S3 bucket.
Configure the Amazon Q Business application CDE for the Amazon S3 data source.
Extract context from the images.

Prerequisites

The following prerequisites are needed for implementation:

An AWS account.
At least one Amazon Q Business Pro user that has admin permissions to set up and configure Amazon Q Business. For pricing information, refer to Amazon Q Business pricing.
AWS Identity and Access Management (IAM) permissions to create and manage IAM roles and policies.
A supported data source to connect, such as an S3 bucket containing your public documents.
Access to an Amazon Bedrock LLM in the required AWS Region.

Create an Amazon Q Business application and sync with an S3 bucket

To create an Amazon Q Business application and connect it to your S3 bucket, complete the following steps. These steps provide a general overview of how to create an Amazon Q Business application and synchronize it with an S3 bucket. For more comprehensive, step-by-step guidance, follow the detailed instructions in the blog post Discover insights from Amazon S3 with Amazon Q S3 connector.

Initiate your application setup through either the AWS Management Console or AWS Command Line Interface (AWS CLI).
Create an index for your Amazon Q Business application.
Use the built-in Amazon S3 connector to link your application with documents stored in your organization’s S3 buckets.

Configure the Amazon Q Business application CDE for the Amazon S3 data source

With the CDE feature of Amazon Q Business, you can make the most of your Amazon S3 data sources by using the sophisticated capabilities to modify, enhance, and filter documents during the ingestion process, ultimately making enterprise content more discoverable and valuable. When connecting Amazon Q Business to S3 repositories, you can use CDE to seamlessly transform your raw data, applying modifications that significantly improve search quality and information accessibility. This powerful functionality extends to extracting context from binary files such as images through integration with Amazon Bedrock services, enabling organizations to unlock insights from previously inaccessible content formats. By implementing CDE for Amazon S3 data sources, businesses can maximize the utility of their enterprise data within Amazon Q, creating a more comprehensive and intelligent knowledge base that responds effectively to user queries.To configure the Amazon Q Business application CDE for the Amazon S3 data source, complete the following steps:

Select your application and navigate to Data sources.
Choose your existing Amazon S3 data source or create a new one. Verify that Audio/Video under Multi-media content configuration is not enabled.
In the data source configuration, locate the Custom Document Enrichment section.
Configure the pre-extraction rules to trigger a Lambda function when specific S3 bucket conditions are satisfied. Check the following screenshot for an example configuration.

Pre-extraction rules are executed before Amazon Q Business processes files from your S3 bucket.

Extract context from the images

To extract insights from an image file, the Lambda function makes an Amazon Bedrock API call using Anthropic’s Claude 3.7 Sonnet model. You can modify the code to use other Amazon Bedrock models based on your use case.

Constructing the prompt is a critical piece of the code. We recommend trying various prompts to get the desired output for your use case. Amazon Bedrock offers the capability to optimize a prompt that you can use to enhance your use case specific input.

Examine the following Lambda function code snippets, written in Python, to understand the Amazon Bedrock model setup along with a sample prompt to extract insights from an image.

In the following code snippet, we start by importing relevant Python libraries, define constants, and initialize AWS SDK for Python (Boto3) clients for Amazon S3 and Amazon Bedrock runtime. For more information, refer to the Boto3 documentation.

import boto3
import logging
import json
from typing import List, Dict, Any
from botocore.config import Config

MODEL_ID = "us.anthropic.claude-3-7-sonnet-20250219-v1:0"
MAX_TOKENS = 2000
MAX_RETRIES = 2
FILE_FORMATS = ("jpg", "jpeg", "png")

logger = logging.getLogger()
logger.setLevel(logging.INFO)
s3 = boto3.client('s3')
bedrock = boto3.client('bedrock-runtime', config=Config(read_timeout=3600, region_name='us-east-1'))

The prompt passed to the Amazon Bedrock model, Anthropic’s Claude 3.7 Sonnet in this case, is broken into two parts: prompt_prefix and prompt_suffix. The prompt breakdown makes it more readable and manageable. Additionally, the Amazon Bedrock prompt caching feature can be used to reduce response latency as well as input token cost. You can modify the prompt to extract information based on your specific use case as needed.

prompt_prefix = """You are an expert image reader tasked with generating detailed descriptions for various """
"""types of images. These images may include technical diagrams,"""
""" graphs and charts, categorization diagrams, data flow and process flow diagrams,"""
""" hierarchical and timeline diagrams, infographics, """
"""screenshots and product diagrams/images from user manuals. """
""" The description of these images needs to be very detailed so that user can ask """
""" questions based on the image, which can be answered by only looking at the descriptions """
""" that you generate.
Here is the image you need to analyze:

<image>
"""

prompt_suffix = """
</image>

Please follow these steps to analyze the image and generate a comprehensive description:

1. Image type: Classify the image as one of technical diagrams, graphs and charts, categorization diagrams, data flow and process flow diagrams, hierarchical and timeline diagrams, infographics, screenshots and product diagrams/images from user manuals. The description of these images needs to be very detailed so that user can ask questions based on the image, which can be answered by only looking at the descriptions that you generate or other.

2. Items:
   Carefully examine the image and extract all entities, texts, and numbers present. List these elements in <image_items> tags.

3. Detailed Description:
   Using the information from the previous steps, provide a detailed description of the image. This should include the type of diagram or chart, its main purpose, and how the various elements interact or relate to each other.  Capture all the crucial details that can be used to answer any followup questions. Write this description in <image_description> tags.

4. Data Estimation (for charts and graphs only):
   If the image is a chart or graph, capture the data in the image in CSV format to be able to recreate the image from the data. Ensure your response captures all relevant details from the chart that might be necessary to answer any follow up questions from the chart.
   If exact values cannot be inferred, provide an estimated range for each value in <estimation> tags.
   If no data is present, respond with "No data found".

Present your analysis in the following format:

<analysis>
<image_type>
[Classify the image type here]
</image_type>

<image_items>
[List all extracted entities, texts, and numbers here]
</image_items>

<image_description>
[Provide a detailed description of the image here]
</image_description>

<data>
[If applicable, provide estimated number ranges for chart elements here]
</data>
</analysis>

Remember to be thorough and precise in your analysis. If you're unsure about any aspect of the image, state your uncertainty clearly in the relevant section.
"""

The lambda_handler is the main entry point for the Lambda function. While invoking this Lambda function, the CDE passes the data source’s information within event object input. In this case, the S3 bucket and the S3 object key are retrieved from the event object along with the file format. Further processing of the input happens only if the file_format matches the expected file types. For production ready code, implement proper error handling for unexpected errors.

def lambda_handler(event, context):
    logger.info("Received event: %s" % json.dumps(event))
    s3Bucket = event.get("s3Bucket")
    s3ObjectKey = event.get("s3ObjectKey")
    metadata = event.get("metadata")
    file_format = s3ObjectKey.lower().split('.')[-1]
    new_key = 'cde_output/' + s3ObjectKey + '.txt'
    if (file_format in FILE_FORMATS):
        afterCDE = generate_image_description(s3Bucket, s3ObjectKey, file_format)
        s3.put_object(Bucket = s3Bucket, Key = new_key, Body=afterCDE)
    return {
        "version" : "v0",
        "s3ObjectKey": new_key,
        "metadataUpdates": []
    }

The generate_image_description function calls two other functions: first to construct the message that is passed to the Amazon Bedrock model and second to invoke the model. It returns the final text output extracted from the image file by the model invocation.

def generate_image_description(s3Bucket: str, s3ObjectKey: str, file_format: str) -> str:
    """
    Generate a description for an image.
    Inputs:
        image_file: str - Path to the image file
    Output:
        str - Generated image description
    """
    messages = _llm_input(s3Bucket, s3ObjectKey, file_format)
    response = _invoke_model(messages)
    return response['output']['message']['content'][0]['text']

The _llm_input function takes in the S3 object’s details passed as input along with the file type (png, jpg) and builds the message in the format expected by the model invoked by Amazon Bedrock.

def _llm_input(s3Bucket: str, s3ObjectKey: str, file_format: str) -> List[Dict[str, Any]]:
    s3_response = s3.get_object(Bucket = s3Bucket, Key = s3ObjectKey)
    image_content = s3_response['Body'].read()
    message = {
        "role": "user",
        "content": [
            {"text": prompt_prefix},
            {
                "image": {
                    "format": file_format,
                    "source": {
                        "bytes": image_content
                    }
                }
            },
            {"text": prompt_suffix}
        ]
    }
    return [message]

The _invoke_model function calls the converse API using the Amazon Bedrock runtime client. This API returns the response generated by the model. The values within inferenceConfig settings for maxTokens and temperature are used to limit the length of the response and make the responses more deterministic (less random) respectively.

def _invoke_model(messages: List[Dict[str, Any]]) -> Dict[str, Any]:
    """
    Call the Bedrock model with retry logic.
    Input:
        messages: List[Dict[str, Any]] - Prepared messages for the model
    Output:
        Dict[str, Any] - Model response
    """
    for attempt in range(MAX_RETRIES):
        try:
            response = bedrock.converse(
                modelId=MODEL_ID,
                messages=messages,
                inferenceConfig={
                    "maxTokens": MAX_TOKENS,
                    "temperature": 0,
                }
            )
            return response
        except Exception as e:
            print(e)
    
    raise Exception(f"Failed to call model after {MAX_RETRIES} attempts")

Putting all the preceding code pieces together, the full Lambda function code is shown in the following block:

# Example Lambda function for image processing
import boto3
import logging
import json
from typing import List, Dict, Any
from botocore.config import Config

MODEL_ID = "us.anthropic.claude-3-7-sonnet-20250219-v1:0"
MAX_TOKENS = 2000
MAX_RETRIES = 2
FILE_FORMATS = ("jpg", "jpeg", "png")

logger = logging.getLogger()
logger.setLevel(logging.INFO)
s3 = boto3.client('s3')
bedrock = boto3.client('bedrock-runtime', config=Config(read_timeout=3600, region_name='us-east-1'))

prompt_prefix = """You are an expert image reader tasked with generating detailed descriptions for various """
"""types of images. These images may include technical diagrams,"""
""" graphs and charts, categorization diagrams, data flow and process flow diagrams,"""
""" hierarchical and timeline diagrams, infographics, """
"""screenshots and product diagrams/images from user manuals. """
""" The description of these images needs to be very detailed so that user can ask """
""" questions based on the image, which can be answered by only looking at the descriptions """
""" that you generate.
Here is the image you need to analyze:

<image>
"""

prompt_suffix = """
</image>

Please follow these steps to analyze the image and generate a comprehensive description:

1. Image type: Classify the image as one of technical diagrams, graphs and charts, categorization diagrams, data flow and process flow diagrams, hierarchical and timeline diagrams, infographics, screenshots and product diagrams/images from user manuals. The description of these images needs to be very detailed so that user can ask questions based on the image, which can be answered by only looking at the descriptions that you generate or other.

2. Items:
   Carefully examine the image and extract all entities, texts, and numbers present. List these elements in <image_items> tags.

3. Detailed Description:
   Using the information from the previous steps, provide a detailed description of the image. This should include the type of diagram or chart, its main purpose, and how the various elements interact or relate to each other.  Capture all the crucial details that can be used to answer any followup questions. Write this description in <image_description> tags.

4. Data Estimation (for charts and graphs only):
   If the image is a chart or graph, capture the data in the image in CSV format to be able to recreate the image from the data. Ensure your response captures all relevant details from the chart that might be necessary to answer any follow up questions from the chart.
   If exact values cannot be inferred, provide an estimated range for each value in <estimation> tags.
   If no data is present, respond with "No data found".

Present your analysis in the following format:

<analysis>
<image_type>
[Classify the image type here]
</image_type>

<image_items>
[List all extracted entities, texts, and numbers here]
</image_items>

<image_description>
[Provide a detailed description of the image here]
</image_description>

<data>
[If applicable, provide estimated number ranges for chart elements here]
</data>
</analysis>

Remember to be thorough and precise in your analysis. If you're unsure about any aspect of the image, state your uncertainty clearly in the relevant section.
"""

def _llm_input(s3Bucket: str, s3ObjectKey: str, file_format: str) -> List[Dict[str, Any]]:
    s3_response = s3.get_object(Bucket = s3Bucket, Key = s3ObjectKey)
    image_content = s3_response['Body'].read()
    message = {
        "role": "user",
        "content": [
            {"text": prompt_prefix},
            {
                "image": {
                    "format": file_format,
                    "source": {
                        "bytes": image_content
                    }
                }
            },
            {"text": prompt_suffix}
        ]
    }
    return [message]

def _invoke_model(messages: List[Dict[str, Any]]) -> Dict[str, Any]:
    """
    Call the Bedrock model with retry logic.
    Input:
        messages: List[Dict[str, Any]] - Prepared messages for the model
    Output:
        Dict[str, Any] - Model response
    """
    for attempt in range(MAX_RETRIES):
        try:
            response = bedrock.converse(
                modelId=MODEL_ID,
                messages=messages,
                inferenceConfig={
                    "maxTokens": MAX_TOKENS,
                    "temperature": 0,
                }
            )
            return response
        except Exception as e:
            print(e)
    
    raise Exception(f"Failed to call model after {MAX_RETRIES} attempts")

def generate_image_description(s3Bucket: str, s3ObjectKey: str, file_format: str) -> str:
    """
    Generate a description for an image.
    Inputs:
        image_file: str - Path to the image file
    Output:
        str - Generated image description
    """
    messages = _llm_input(s3Bucket, s3ObjectKey, file_format)
    response = _invoke_model(messages)
    return response['output']['message']['content'][0]['text']

def lambda_handler(event, context):
    logger.info("Received event: %s" % json.dumps(event))
    s3Bucket = event.get("s3Bucket")
    s3ObjectKey = event.get("s3ObjectKey")
    metadata = event.get("metadata")
    file_format = s3ObjectKey.lower().split('.')[-1]
    new_key = 'cde_output/' + s3ObjectKey + '.txt'
    if (file_format in FILE_FORMATS):
        afterCDE = generate_image_description(s3Bucket, s3ObjectKey, file_format)
        s3.put_object(Bucket = s3Bucket, Key = new_key, Body=afterCDE)
    return {
        "version" : "v0",
        "s3ObjectKey": new_key,
        "metadataUpdates": []
    }

We strongly recommend testing and validating code in a nonproduction environment before deploying it to production. In addition to Amazon Q pricing, this solution will incur charges for AWS Lambda and Amazon Bedrock. For more information, refer to AWS Lambda pricing and Amazon Bedrock pricing.

After the Amazon S3 data is synced with the Amazon Q index, you can prompt the Amazon Q Business application to get the extracted insights as shown in the following section.

Example prompts and results

The following question and answer pairs refer the Student Age Distribution graph at the beginning of this post.

Q: Which City has the highest number of students in the 13-15 age range?

Q: Compare the student demographics between City 1 and City 4?

In the original graph, the bars representing student counts lacked explicit numerical labels, which could make data interpretation challenging on a scale. However, with Amazon Q Business and its integration capabilities, this limitation can be overcome. By using Amazon Q Business to process these visualizations with Amazon Bedrock LLMs using the CDE feature, we’ve enabled a more interactive and insightful analysis experience. The service effectively extracts the contextual information embedded in the graph, even when explicit labels are absent. This powerful combination means that end users can ask questions about the visualization and receive responses based on the underlying data. Rather than being limited by what’s explicitly labeled in the graph, users can now explore deeper insights through natural language queries. This capability demonstrates how Amazon Q Business transforms static visualizations into queryable knowledge assets, enhancing the value of your existing data visualizations without requiring additional formatting or preparation work.

Best practices for Amazon S3 CDE configuration

When setting up CDE for your Amazon S3 data source, consider these best practices:

Use conditional rules to only process specific file types that need transformation.
Monitor Lambda execution with Amazon CloudWatch to track processing errors and performance.
Set appropriate timeout values for your Lambda functions, especially when processing large files.
Consider incremental syncing to process only new or modified documents in your S3 bucket.
Use document attributes to track which documents have been processed by CDE.

Cleanup

Complete the following steps to clean up your resources:

Go to the Amazon Q Business application and select Remove and unsubscribe for users and groups.
Delete the Amazon Q Business application.
Delete the Lambda function.
Empty and delete the S3 bucket. For instructions, refer to Deleting a general purpose bucket.

Conclusion

This solution demonstrates how combining Amazon Q Business, custom document enrichment, and Amazon Bedrock can transform static visualizations into queryable knowledge assets, significantly enhancing the value of existing data visualizations without additional formatting work. By using these powerful AWS services together, organizations can bridge the gap between visual information and actionable insights, enabling users to interact with different file types in more intuitive ways.

Explore What is Amazon Q Business? and Getting started with Amazon Bedrock in the documentation to implement this solution for your specific use cases and unlock the potential of your visual data.

About the Authors

About the authors

Amit Chaudhary Amit Chaudhary is a Senior Solutions Architect at Amazon Web Services. His focus area is AI/ML, and he helps customers with generative AI, large language models, and prompt engineering. Outside of work, Amit enjoys spending time with his family.

Nikhil Jha Nikhil Jha is a Senior Technical Account Manager at Amazon Web Services. His focus areas include AI/ML, building Generative AI resources, and analytics. In his spare time, he enjoys exploring the outdoors with his family.

Build AWS architecture diagrams using Amazon Q CLI and MCP

June 30, 2025

by Joel Asante Amazon AWS

Creating professional AWS architecture diagrams is a fundamental task for solutions architects, developers, and technical teams. These diagrams serve as essential communication tools for stakeholders, documentation of compliance requirements, and blueprints for implementation teams. However, traditional diagramming approaches present several challenges:

Time-consuming process – Creating detailed architecture diagrams manually can take hours or even days
Steep learning curve – Learning specialized diagramming tools requires significant investment
Inconsistent styling – Maintaining visual consistency across multiple diagrams is difficult
Outdated AWS icons – Keeping up with the latest AWS service icons and best practices challenging.
Difficult maintenance – Updating diagrams as architectures evolve can become increasingly burdensome

Amazon Q Developer CLI with the Model Context Protocol (MCP) offers a streamlined approach to creating AWS architecture diagrams. By using generative AI through natural language prompts, architects can now generate professional diagrams in minutes rather than hours, while adhering to AWS best practices.

In this post, we explore how to use Amazon Q Developer CLI with the AWS Diagram MCP and the AWS Documentation MCP servers to create sophisticated architecture diagrams that follow AWS best practices. We discuss techniques for basic diagrams and real-world diagrams, with detailed examples and step-by-step instructions.

Solution overview

Developed by Anthropic as an open protocol, the Model Context Protocol (MCP) provides a standardized way to connect AI models to virtually any data source or tool. Using a client-server architecture (as illustrated in the following diagram), the MCP helps developers expose their data through lightweight MCP servers while building AI applications as MCP clients that connect to these servers.

The MCP uses a client-server architecture containing the following components:

Host – A program or AI tool that requires access to data through the MCP protocol, such as Anthropic’s Claude Desktop, an integrated development environment (IDE), AWS MCP CLI, or other AI applications
Client – Protocol clients that maintain one-to-one connections with server
Server – Lightweight programs that expose capabilities through standardized MCP or act as tools
Data sources – Local data sources such as databases and file systems, or external systems available over the internet through APIs (web APIs) that MCP servers can connect with

As announced in April 2025, MCP enables Amazon Q Developer to connect with specialized servers that extend its capabilities beyond what’s possible with the base model alone. MCP servers act as plugins for Amazon Q, providing domain-specific knowledge and functionality. The AWS Diagram MCP server specifically enables Amazon Q to generate architecture diagrams using the Python diagrams package, with access to the complete AWS icon set and architectural best practices.

Prerequisites

To implement this solution, you must have an AWS account with appropriate permissions and follow the steps below.

Set up your environment

Before you can start creating diagrams, you need to set up your environment with Amazon Q CLI, the AWS Diagram MCP server, and AWS Documentation MCP server. This section provides detailed instructions for installation and configuration.

Install Amazon Q Developer CLI

Amazon Q Developer CLI is available as a standalone installation. Complete the following steps to install it:

Download and install Amazon Q Developer CLI. For instructions, see Using Amazon Q Developer on the command line.
Verify the installation by running the following command: q --version
You should see output similar to the following: Amazon Q Developer CLI version 1.x.x
Configure Amazon Q CLI with your AWS credentials: q login
Choose the login method suitable for you:
- Use for free with AWS Builder ID
- Use with Pro license

Set up MCP servers

Complete the following steps to set up your MCP servers:

Install uv using the following command: pip install uv
Install Python 3.10 or newer: uv python install 3.10
Install GraphViz for your operating system.
Add the servers to your ~/.aws/amazonq/mcp.json file:

{
  "mcpServers": {
    "awslabs.aws-diagram-mcp-server": {
      "command": "uvx",
      "args": ["awslabs.aws-diagram-mcp-server"],
      "env": {
        "FASTMCP_LOG_LEVEL": "ERROR"
      },
      "autoApprove": [],
      "disabled": false
    },
    "awslabs.aws-documentation-mcp-server": {
      "command": "uvx",
      "args": ["awslabs.aws-documentation-mcp-server@latest"],
      "env": {
        "FASTMCP_LOG_LEVEL": "ERROR"
      },
      "autoApprove": [],
      "disabled": false
    }
  }
}

Now, Amazon Q CLI automatically discovers MCP servers in the ~/.aws/amazonq/mcp.json file.

Understanding MCP server tools

The AWS Diagram MCP server provides several powerful tools:

list_icons – Lists available icons from the diagrams package, organized by provider and service category
get_diagram_examples – Provides example code for different types of diagrams (AWS, sequence, flow, class, and others)
generate_diagram – Creates a diagram from Python code using the diagrams package

The AWS Documentation MCP server provides the following useful tools:

search_documentation – Searches AWS documentation using the official AWS Documentation Search API
read_documentation – Fetches and converts AWS documentation pages to markdown format
recommend – Gets content recommendations for AWS documentation pages

These tools work together to help you create accurate architecture diagrams that follow AWS best practices.

Test your setup

Let’s verify that everything is working correctly by generating a simple diagram:

Start the Amazon Q CLI chat interface and verify the output shows the MCP servers being loaded and initialized: q chat
In the chat interface, enter the following prompt:
Please create a diagram showing an EC2 instance in a VPC connecting to an external S3 bucket. Include essential networking components (VPC, subnets, Internet Gateway, Route Table), security elements (Security Groups, NACLs), and clearly mark the connection between EC2 and S3. Label everything appropriately concisely and indicate that all resources are in the us-east-1 region. Check for AWS documentation to ensure it adheres to AWS best practices before you create the diagram.
Amazon Q CLI will ask you to trust the tool that is being used; enter t to trust it.Amazon Q CLI will generate and display a simple diagram showing the requested architecture. Your diagram should look similar to the following screenshot, though there might be variations in layout, styling, or specific details because it’s created using generative AI. The core architectural components and relationships will be represented, but the exact visual presentation might differ slightly with each generation.

If you see the diagram, your environment is set up correctly. If you encounter issues, verify that Amazon Q CLI can access the MCP servers by making sure you installed the necessary tools and the servers are in the ~/.aws/amazonq/mcp.json file.

Configuration options

The AWS Diagram MCP server supports several configuration options to customize your diagramming experience:

Output directory – By default, diagrams are saved in a generated-diagrams directory in your current working directory. You can specify a different location in your prompts.
Diagram format – The default output format is PNG, but you can request other formats like SVG in your prompts.
Styling options – You can specify colors, shapes, and other styling elements in your prompts.

Now that our environment is set up, let’s create more diagrams.

Create AWS architecture diagrams

In this section, we walk through the process of multiple AWS architecture diagrams using Amazon Q CLI with the AWS Diagram MCP server and AWS Documentation MCP server to make sure our requirements follow best practices.

When you provide a prompt to Amazon Q CLI, the AWS Diagram and Documentation MCP servers complete the following steps:

Interpret your requirements.
Check for best practices on the AWS documentation.
Generate Python code using the diagrams package.
Execute the code to create the diagram.
Return the diagram as an image.

This process happens seamlessly, so you can focus on describing what you want rather than how to create it.

AWS architecture diagrams typically include the following components:

Nodes – AWS services and resources
Edges – Connections between nodes showing relationships or data flow
Clusters – Logical groupings of nodes, such as virtual private clouds (VPCs), subnets, and Availability Zones
Labels – Text descriptions for nodes and connections

Example 1: Create a web application architecture

Let’s create a diagram for a simple web application hosted on AWS. Enter the following prompt:

Create a diagram for a simple web application with an Application Load Balancer, two EC2 instances, and an RDS database. Check for AWS documentation to ensure it adheres to AWS best practices before you create the diagram

After you enter your prompt, Amazon Q CLI will search AWS documentation for best practices using the search_documentation tool from awslabsaws_documentation_mcp_server.

Following the search of the relevant AWS documentation, it will read the documentation using the read_documentation tool from the MCP server awslabsaws_documentation_mcp_server.
read_document-img

Amazon Q CLI will then list the needed AWS service icons using the list_icons tool, and will use generate_diagram with awslabsaws_diagram_mcp_server.
list and generate on cli

You should receive an output with a description of the diagram created based on the prompt along with the location of where the diagram was saved.
final-output-1stexample

Amazon Q CLI will generate and display the diagram.

The generated diagram shows the following key components:

An Application Load Balancer as the entry point
Two Amazon Elastic Compute Cloud (Amazon EC2) instances for the application tier
An Amazon Relational Database Service (Amazon RDS) instance for the database tier
Connections showing the flow of traffic

Example 2: Create a multi-tier architecture

Multi-tier architectures separate applications into functional layers (presentation, application, and data) to improve scalability and security. We use the following prompt to create our diagram:

Create a diagram for a three-tier web application with a presentation tier (ALB and CloudFront), application tier (ECS with Fargate), and data tier (Aurora PostgreSQL). Include VPC with public and private subnets across multiple AZs. Check for AWS documentation to ensure it adheres to AWS best practices before you create the diagram.

The diagram shows the following key components:

A presentation tier in public subnets
An application tier in private subnets
A data tier in isolated private subnets
Proper security group configurations
Traffic flow between tiers

Example 3: Create a serverless architecture

We use the following prompt to create a diagram for a serverless architecture:

Create a diagram for a serverless web application using API Gateway, Lambda, DynamoDB, and S3 for static website hosting. Include Cognito for user authentication and CloudFront for content delivery. Check for AWS documentation to ensure it adheres to AWS best practices before you create the diagram.

The diagram includes the following key components:

Amazon Simple Storage Service (Amazon S3) hosting static website content
Amazon CloudFront distributing content globally
Amazon API Gateway handling API requests
AWS Lambda functions implementing business logic
Amazon DynamoDB storing application data
Amazon Cognito managing user authentication

Example 4: Create a data processing diagram

We use the following prompt to create a diagram for a data processing pipeline:

Create a diagram for a data processing pipeline with components organized in clusters for data ingestion, processing, storage, and analytics. Include Kinesis, Lambda, S3, Glue, and QuickSight. Check for AWS documentation to ensure it adheres to AWS best practices before you create the diagram.

The diagram organizes components into distinct clusters:

Data ingestion – Amazon Kinesis Data Streams, Amazon Data Firehose, Amazon Simple Queue Service
Data processing – Lambda functions, AWS Glue jobs
Data storage – S3 buckets, DynamoDB tables
Data analytics – AWS Glue, Amazon Athena, Amazon QuickSight

Real-world examples

Let’s explore some real-world architecture patterns and how to create diagrams for them using Amazon Q CLI with the AWS Diagram MCP server.

Ecommerce platform

Ecommerce platforms require scalable, resilient architectures to handle variable traffic and maintain high availability. We use the following prompt to create an example diagram:

Create a diagram for an e-commerce platform with microservices architecture. Include components for product catalog, shopping cart, checkout, payment processing, order management, and user authentication. Ensure the architecture follows AWS best practices for scalability and security. Check for AWS documentation to ensure it adheres to AWS best practices before you create the diagram.

The diagram includes the following key components:

API Gateway as the entry point for client applications
Microservices implemented as containers in Amazon Elastic Container Service (Amazon ECS) with AWS Fargate
RDS databases for product catalog, shopping cart, and order data
Amazon ElastiCache for product data caching and session management
Amazon Cognito for authentication
Amazon Simple Queue Service (Amazon SQS) and Amazon Simple Notification Service (Amazon SNS) for asynchronous communication between services
CloudFront for content delivery and static assets from Amazon S3
Amazon Route 53 for DNS management
AWS WAF for web application security
AWS Lambda functions for serverless microservice implementation
AWS Secrets Manager for secure credential storage
Amazon CloudWatch for monitoring and observability

Intelligent document processing solution

We use the following prompt to create a diagram for an intelligent document processing (IDP) architecture:

Create a diagram for an intelligent document processing (IDP) application on AWS. Include components for document ingestion, OCR and text extraction, intelligent data extraction (using NLP and/or computer vision), human review and validation, and data output/integration. Ensure the architecture follows AWS best practices for scalability and security, leveraging services like S3, Lambda, Textract, Comprehend, SageMaker (for custom models, if applicable), and potentially Augmented AI (A2I). Check for AWS documentation related to intelligent document processing best practices to ensure it adheres to AWS best practices before you create the diagram.

The diagram includes the following key components:

Amazon API Gateway as the entry point for client applications, providing a secure and scalable interface
Microservices implemented as containers in ECS with Fargate, enabling flexible and scalable processing
Amazon RDS databases for product catalog, shopping cart, and order data, providing reliable structured data storage
Amazon ElastiCache for product data caching and session management, improving performance and user experience
Amazon Cognito for authentication, ensuring secure access control
Amazon Simple Queue Service and Amazon Simple Notification Service for asynchronous communication between services, enabling decoupled and resilient architecture
Amazon CloudFront for content delivery and static assets from S3, optimizing global performance
Amazon Route53 for DNS management, providing reliable routing
AWS WAF for web application security, protecting against common web exploits
AWS Lambda functions for serverless microservice implementation, offering cost-effective scaling
AWS Secrets Manager for secure credential storage, enhancing security posture
Amazon CloudWatch for monitoring and observability, providing insights into system performance and health.

Clean up

If you no longer need to use the AWS Cost Analysis MCP server with Amazon Q CLI, you can remove it from your configuration:

Open your ~/.aws/amazonq/mcp.json file.
Remove or comment out the MCP server entries.
Save the file.

This will prevent the server from being loaded when you start Amazon Q CLI in the future.

Conclusion

In this post, we explored how to use Amazon Q CLI with the AWS Documentation MCP and AWS Diagram MCP servers to create professional AWS architecture diagrams that adhere to AWS best practices referenced from official AWS documentation. This approach offers significant advantages over traditional diagramming methods:

Time savings – Generate complex diagrams in minutes instead of hours
Consistency – Make sure diagrams follow the same style and conventions
Best practices – Automatically incorporate AWS architectural guidelines
Iterative refinement – Quickly modify diagrams through simple prompts
Validation – Check architectures against official AWS documentation and recommendations

As you continue your journey with AWS architecture diagrams, we encourage you to deepen your knowledge by learning more about the Model Context Protocol (MCP) to understand how it enhances the capabilities of Amazon Q. When seeking inspiration for your own designs, the AWS Architecture Center offers a wealth of reference architectures that follow best practices. For creating visually consistent diagrams, be sure to visit the AWS Icons page, where you can find the complete official icon set. And to stay at the cutting edge of these tools, keep an eye on updates to the official AWS MCP Servers—they’re constantly evolving with new features to make your diagramming experience even better.

About the Authors

Joel Asante, an Austin-based Solutions Architect at Amazon Web Services (AWS), works with GovTech (Government Technology) customers. With a strong background in data science and application development, he brings deep technical expertise to creating secure and scalable cloud architectures for his customers. Joel is passionate about data analytics, machine learning, and robotics, leveraging his development experience to design innovative solutions that meet complex government requirements. He holds 13 AWS certifications and enjoys family time, fitness, and cheering for the Kansas City Chiefs and Los Angeles Lakers in his spare time.

Dunieski Otano is a Solutions Architect at Amazon Web Services based out of Miami, Florida. He works with World Wide Public Sector MNO (Multi-International Organizations) customers. His passion is Security, Machine Learning and Artificial Intelligence, and Serverless. He works with his customers to help them build and deploy high available, scalable, and secure solutions. Dunieski holds 14 AWS certifications and is an AWS Golden Jacket recipient. In his free time, you will find him spending time with his family and dog, watching a great movie, coding, or flying his drone.

Varun Jasti is a Solutions Architect at Amazon Web Services, working with AWS Partners to design and scale artificial intelligence solutions for public sector use cases to meet compliance standards. With a background in Computer Science, his work covers broad range of ML use cases primarily focusing on LLM training/inferencing and computer vision. In his spare time, he loves playing tennis and swimming.

AWS costs estimation using Amazon Q CLI and AWS Cost Analysis MCP

June 27, 2025

by Joel Asante Amazon AWS

Managing and optimizing AWS infrastructure costs is a critical challenge for organizations of all sizes. Traditional cost analysis approaches often involve the following:

Complex spreadsheets – Creating and maintaining detailed cost models, which requires significant effort
Multiple tools – Switching between the AWS Pricing Calculator, AWS Cost Explorer, and third-party tools
Specialized knowledge – Understanding the nuances of AWS pricing across services and AWS Regions
Time-consuming analysis – Manually comparing different deployment options and scenarios
Delayed optimization – Cost insights often come too late to inform architectural decisions

Amazon Q Developer CLI with the Model Context Protocol (MCP) offers a revolutionary approach to AWS cost analysis. By using generative AI through natural language prompts, teams can now generate detailed cost estimates, comparisons, and optimization recommendations in minutes rather than hours, while providing accuracy through integration with official AWS pricing data.

In this post, we explore how to use Amazon Q CLI with the AWS Cost Analysis MCP server to perform sophisticated cost analysis that follows AWS best practices. We discuss basic setup and advanced techniques, with detailed examples and step-by-step instructions.

Solution overview

Amazon Q Developer CLI is a command line interface that brings the generative AI capabilities of Amazon Q directly to your terminal. Developers can interact with Amazon Q through natural language prompts, making it an invaluable tool for various development tasks.
Developed by Anthropic as an open protocol, the Model Context Protocol (MCP) provides a standardized way to connect AI models to different data sources or tools. Using a client-server architecture (as illustrated in the following diagram), the MCP helps developers expose their data through lightweight MCP servers while building AI applications as MCP clients that connect to these servers.

The MCP uses a client-server architecture containing the following components:

Host – A program or AI tool that requires access to data through the MCP protocol, such as Anthropic’s Claude Desktop, an integrated development environment (IDE), or other AI applications
Client – Protocol clients that maintain one-to-one connections with servers
Server – Lightweight programs that expose capabilities through standardized MCP or act as tools
Data sources – Local data sources such as databases and file systems, or external systems available over the internet through APIs (web APIs) that MCP servers can connect with

As announced in April 2025, the MCP enables Amazon Q Developer to connect with specialized servers that extend its capabilities beyond what’s possible with the base model alone. MCP servers act as plugins for Amazon Q, providing domain-specific knowledge and functionality. The AWS Cost Analysis MCP server specifically enables Amazon Q to generate detailed cost estimates, reports, and optimization recommendations using real-time AWS pricing data.

Prerequisites

To implement this solution, you must have an AWS account with appropriate permissions and follow the steps below.

Set up your environment

Before you can start analyzing costs, you need to set up your environment with Amazon Q CLI and the AWS Cost Analysis MCP server. This section provides detailed instructions for installation and configuration.

Install Amazon Q Developer CLI

Amazon Q Developer CLI is available as a standalone installation. Complete the following steps to install it:

Download and install Amazon Q Developer CLI. For instructions, see Using Amazon Q Developer on the command line.
Verify the installation by running the following command: q --version
You should see output similar to the following: Amazon Q Developer CLI version 1.x.x
Configure Amazon Q CLI with your AWS credentials: q login
Choose the login method suitable for you:
- Use for free with AWS Builder ID
- Use with Pro license

Set up MCP servers

Before using the AWS Cost Analysis MCP server with Amazon Q CLI, you must install several tools and configure your environment. The following steps guide you through installing the necessary tools and setting up the MCP server configuration:

Install Panoc using the following command (you can install with brew as well), converting the output to PDF: pip install pandoc
Install uv with the following command: pip install uv
Install Python 3.10 or newer: uv python install 3.10

Add the servers to your ~/.aws/amazonq/mcp.json file:

{
  "mcpServers": {
    "awslabs.cost-analysis-mcp-server": {
      "command": "uvx",
      "args": ["awslabs.cost-analysis-mcp-server"],
      "env": {
        "FASTMCP_LOG_LEVEL": "ERROR"
      },
      "autoApprove": [],
      "disabled": false
    }
  }
}

Now, Amazon Q CLI automatically discovers MCP servers in the ~/.aws/amazonq/mcp.json file.

Understanding MCP server tools

The AWS Cost Analysis MCP server provides several powerful tools:

get_pricing_from_web – Retrieves pricing information from AWS pricing webpages
get_pricing_from_api – Fetches pricing data from the AWS Price List API
generate_cost_report – Creates detailed cost analysis reports with breakdowns and visualizations
analyze_cdk_project – Analyzes AWS Cloud Development Kit (AWS CDK) projects to identify services used and estimate costs
analyze_terraform_project – Analyzes Terraform projects to identify services used and estimate costs
get_bedrock_patterns – Retrieves architecture patterns for Amazon Bedrock with cost considerations

These tools work together to help you create accurate cost estimates that follow AWS best practices.

Test your setup

Let’s verify that everything is working correctly by generating a simple cost analysis:

Start the Amazon Q CLI chat interface and verify the output shows the MCP server being loaded and initialized: q chat
In the chat interface, enter the following prompt:Please create a cost analysis for a simple web application with an Application Load Balancer, two t3.medium EC2 instances, and an RDS db.t3.medium MySQL database. Assume 730 hours of usage per month and moderate traffic of about 100 GB data transfer. Convert estimation to a PDF format.
Amazon Q CLI will ask for permission to trust the tool that is being used; enter t to trust it. Amazon Q should generate and display a detailed cost analysis. Your output should look like the following screenshot.

If you see the cost analysis report, your environment is set up correctly. If you encounter issues, verify that Amazon Q CLI can access the MCP servers by making sure you installed install the necessary tools and the servers are in the ~/.aws/amazonq/mcp.json file.

Configuration options

The AWS Cost Analysis MCP server supports several configuration options to customize your cost analysis experience:

Output format – Choose between markdown, CSV formats, or PDF (which we installed the package for) for cost reports
Pricing model – Specify on-demand, reserved instances, or savings plans
Assumptions and exclusions – Customize the assumptions and exclusions in your cost analysis
Detailed cost data – Provide specific usage patterns for more accurate estimates

Now that our environment is set up, let’s create more cost analyses.

Create AWS Cost Analysis reports

In this section, we walk through the process of creating AWS cost analysis reports using Amazon Q CLI with the AWS Cost Analysis MCP server.

When you provide a prompt to Amazon Q CLI, the AWS Cost Analysis MCP server completes the following steps:

Interpret your requirements.
Retrieve pricing data from AWS pricing sources.
Generate a detailed cost analysis report.
Provide optimization recommendations.

This process happens seamlessly, so you can focus on describing what you want rather than how to create it.

AWS Cost Analysis reports typically include the following information:

Service costs – Breakdown of costs by AWS service
Unit pricing – Detailed unit pricing information
Usage quantities – Estimated usage quantities for each service
Calculation details – Step-by-step calculations showing how costs were derived
Assumptions – Clearly stated assumptions used in the analysis
Exclusions – Costs that were not included in the analysis
Recommendations – Cost optimization suggestions

Example 1: Analyze a serverless application

Let’s create a cost analysis for a simple serverless application. Use the following prompt:

Create a cost analysis for a serverless application using API Gateway, Lambda, and DynamoDB. Assume 1 million API calls per month, average Lambda execution time of 200ms with 512MB memory, and 10GB of DynamoDB storage with 5 million read requests and 1 million write requests per month. Convert estimation to a PDF format.

Upon entering your prompt, Amazon Q CLI will retrieve pricing data using the get_pricing_from_web or get_pricing_from_api tools, and will use generate_cost_report with awslabscost_analysis_mcp_server.

You should receive an output giving a detailed cost breakdown based on the prompt along with optimization recommendations.

The generated cost analysis shows the following information:

Amazon API Gateway costs for 1 million requests
AWS Lambda costs for compute time and requests
Amazon DynamoDB costs for storage, read, and write capacity
Total monthly cost estimate
Cost optimization recommendations

Example 2: Analyze multi-tier architectures

Multi-tier architectures separate applications into functional layers (presentation, application, and data) to improve scalability and security. This example analyzes costs for implementing such an architecture on AWS with components for each tier:

Create a cost analysis for a three-tier web application with a presentation tier (ALB and CloudFront), application tier (ECS with Fargate), and data tier (Aurora PostgreSQL). Include costs for 2 Fargate tasks with 1 vCPU and 2GB memory each, an Aurora db.r5.large instance with 100GB storage, an Application Load Balancer with 10

This time, we are formatting it into both PDF and DOCX.

The cost analysis shows the following information:

Presentation tier costs (Application Load Balancer and AWS CloudFront)
Application tier costs (Amazon Elastic Container Service (Amazon ECS) and AWS Fargate)
Data tier costs (Amazon Aurora PostgreSQL-Compatible Edition)
Detailed breakdown of each component’s pricing
Total monthly cost estimate
Cost optimization recommendations for each tier

Example 3: Compare deployment options

When deploying containers on AWS, choosing between Amazon ECS with Amazon Elastic Compute Cloud (Amazon EC2) or Fargate involves different cost structures and management overhead. This example compares these options to determine the most cost-effective solution for a specific workload:

Compare the costs between running a containerized application on ECS with EC2 launch type versus Fargate launch type. Assume 4 containers each needing 1 vCPU and 2GB memory, running 24/7 for a month. For EC2, use t3.medium instances. Provide a recommendation on which option is more cost-effective for this workload. Convert estimation to a HTML webpage.

This time, we are formatting it into a HTML webpage.

The cost comparison includes the following information:

Amazon ECS with Amazon EC2 launch type costs
Amazon ECS with Fargate launch type costs
Detailed breakdown of each option’s pricing components
Side-by-side comparison of total costs
Recommendations for the most cost-effective option
Considerations for when each option might be preferred

Real-world examples

Let’s explore some real-world architecture patterns and how to analyze their costs using Amazon Q CLI with the AWS Cost Analysis MCP server.

Ecommerce platform

Ecommerce platforms require scalable, resilient architectures with careful cost management. These systems typically use microservices to handle various functions independently while maintaining high availability. This example analyzes costs for a complete ecommerce solution with multiple components serving moderate traffic levels:

Create a cost analysis for an e-commerce platform with microservices architecture. Include components for product catalog, shopping cart, checkout, payment processing, order management, and user authentication. Assume moderate traffic of 500,000 monthly active users, 2 million page views per day, and 50,000 orders per month. Ensure the analysis follows AWS best practices for cost optimization. Convert estimation to a PDF format.

The cost analysis includes the following key components:

Frontend delivery costs (Amazon Simple Storage Service (Amazon S3) and CloudFront)
API Gateway and Lambda costs for serverless components
Container costs for microservices (Amazon Elastic Kubernetes Service (Amazon EKS) and Amazon ECS)
Database costs (Amazon Relational Database Service (Amazon RDS) and DynamoDB)
Caching costs (Amazon ElastiCache)
Storage and data transfer costs
Monitoring and security costs
Total monthly cost estimate
Cost optimization recommendations for each component
Reserved instance and savings plan opportunities

Data analytics platform

Modern data analytics platforms need to efficiently ingest, store, process, and visualize large volumes of data while managing costs effectively. This example examines the AWS services and costs involved in building a complete analytics pipeline handling significant daily data volumes with multiple user access requirements:

Create a cost analysis for a data analytics platform processing 500GB of new data daily. Include components for data ingestion (Kinesis), storage (S3), processing (EMR), and visualization (QuickSight). Assume 50 users accessing dashboards daily and data retention of 90 days. Ensure the analysis follows AWS best practices for cost optimization and includes recommendations for cost-effective scaling. Convert estimation to a HTML webpage.

The cost analysis includes the following key components:

Data ingestion costs (Amazon Kinesis Data Streams and Amazon Data Firehose)
Storage costs (Amazon S3 with lifecycle policies)
Processing costs (Amazon EMR cluster)
Visualization costs (Amazon QuickSight)
Data transfer costs between services
Total monthly cost estimate
Cost optimization recommendations for each component
Scaling considerations and their cost implications

Clean up

If you no longer need to use the AWS Cost Analysis MCP server with Amazon Q CLI, you can remove it from your configuration:

Open your ~/.aws/amazonq/mcp.json file.
Remove or comment out the “awslabs.cost-analysis-mcp-server” entry.
Save the file.

This will prevent the server from being loaded when you start Amazon Q CLI in the future.

Conclusion

In this post, we explored how to use Amazon Q CLI with the AWS Cost Analysis MCP server to create detailed cost analyses that use accurate AWS pricing data. This approach offers significant advantages over traditional cost estimation methods:

Time savings – Generate complex cost analyses in minutes instead of hours
Accuracy – Make sure estimates use the latest AWS pricing information
Comprehensive – Include relevant cost components and considerations
Actionable – Receive specific optimization recommendations
Iterative – Quickly compare different scenarios through simple prompts
Validation – Check estimates against official AWS pricing

As you continue exploring AWS cost analysis, we encourage you to deepen your knowledge by learning more about the Model Context Protocol (MCP) to understand how it enhances the capabilities of Amazon Q. For hands-on cost estimation, the AWS Pricing Calculator offers an interactive experience to model and compare different deployment scenarios. To make sure your architectures follow financial best practices, the AWS Well-Architected Framework Cost Optimization Pillar provides comprehensive guidance on building cost-efficient systems. And to stay at the cutting edge of these tools, keep an eye on updates to the official AWS MCP servers—they’re constantly evolving with new features to make your cost analysis experience even more powerful and accurate.

About the Authors

Tailor responsible AI with new safeguard tiers in Amazon Bedrock Guardrails

June 26, 2025

by Koushik Kethamakka Amazon AWS

Amazon Bedrock Guardrails provides configurable safeguards to help build trusted generative AI applications at scale. It provides organizations with integrated safety and privacy safeguards that work across multiple foundation models (FMs), including models available in Amazon Bedrock, as well as models hosted outside Amazon Bedrock from other model providers and cloud providers. With the standalone ApplyGuardrail API, Amazon Bedrock Guardrails offers a model-agnostic and scalable approach to implementing responsible AI policies for your generative AI applications. Guardrails currently offers six key safeguards: content filters, denied topics, word filters, sensitive information filters, contextual grounding checks, and Automated Reasoning checks (preview), to help prevent unwanted content and align AI interactions with your organization’s responsible AI policies.

As organizations strive to implement responsible AI practices across diverse use cases, they face the challenge of balancing safety controls with varying performance and language requirements across different applications, making a one-size-fits-all approach ineffective. To address this, we’ve introduced safeguard tiers for Amazon Bedrock Guardrails, so you can choose appropriate safeguards based on your specific needs. For instance, a financial services company can implement comprehensive, multi-language protection for customer-facing AI assistants while using more focused, lower-latency safeguards for internal analytics tools, making sure each application upholds responsible AI principles with the right level of protection without compromising performance or functionality.

In this post, we introduce the new safeguard tiers available in Amazon Bedrock Guardrails, explain their benefits and use cases, and provide guidance on how to implement and evaluate them in your AI applications.

Solution overview

Until now, when using Amazon Bedrock Guardrails, you were provided with a single set of the safeguards associated to specific AWS Regions and a limited set of languages supported. The introduction of safeguard tiers in Amazon Bedrock Guardrails provides three key advantages for implementing AI safety controls:

A tier-based approach that gives you control over which guardrail implementations you want to use for content filters and denied topics, so you can select the appropriate protection level for each use case. We provide more details about this in the following sections.
Cross-Region Inference Support (CRIS) for Amazon Bedrock Guardrails, so you can use compute capacity across multiple Regions, achieving better scaling and availability for your guardrails. With this, your requests get automatically routed during guardrail policy evaluation to the optimal Region within your geography, maximizing available compute resources and model availability. This helps maintain guardrail performance and reliability when demand increases. There’s no additional cost for using CRIS with Amazon Bedrock Guardrails, and you can select from specific guardrail profiles for controlling model versioning and future upgrades.
Advanced capabilities as a configurable tier option for use cases where more robust protection or broader language support are critical priorities, and where you can accommodate a modest latency increase.

Safeguard tiers are applied at the guardrail policy level, specifically for content filters and denied topics. You can tailor your protection strategy for different aspects of your AI application. Let’s explore the two available tiers:

Classic tier (default):
- Maintains the existing behavior of Amazon Bedrock Guardrails
- Limited language support: English, French, and Spanish
- Does not require CRIS for Amazon Bedrock Guardrails
- Optimized for lower-latency applications
Standard tier:
- Provided as a new capability that you can enable for existing or new guardrails
- Multilingual support for more than 60 languages
- Enhanced robustness against prompt typos and manipulated inputs
- Enhanced prompt attack protection covering modern jailbreak and prompt injection techniques, including token smuggling, AutoDAN, and many-shot, among others
- Enhanced topic detection with improved understanding and handling of complex topics
- Requires the use of CRIS for Amazon Bedrock Guardrails and might have a modest increase in latency profile compared to the Classic tier option

You can select each tier independently for content filters and denied topics policies, allowing for mixed configurations within the same guardrail, as illustrated in the following hierarchy. With this flexibility, companies can implement the right level of protection for each specific application.

Policy: Content filters
- Tier: Classic or Standard
Policy: Denied topics
- Tier: Classic or Standard
Other policies: Word filters, sensitive information filters, contextual grounding checks, and Automated Reasoning checks (preview)

To illustrate how these tiers can be applied, consider a global financial services company deploying AI in both customer-facing and internal applications:

For their customer service AI assistant, they might choose the Standard tier for both content filters and denied topics, to provide comprehensive protection across many languages.
For internal analytics tools, they could use the Classic tier for content filters prioritizing low latency, while implementing the Standard tier for denied topics to provide robust protection against sensitive financial information disclosure.

You can configure the safeguard tiers for content filters and denied topics in each guardrail through the AWS Management Console, or programmatically through the Amazon Bedrock SDK and APIs. You can use a new or existing guardrail. For information on how to create or modify a guardrail, see Create your guardrail.

Your existing guardrails are automatically set to the Classic tier by default to make sure you have no impact on your guardrails’ behavior.

Quality enhancements with the Standard tier

According to our tests, the new Standard tier improves harmful content filtering recall by more than 15% with a more than 7% gain in balanced accuracy compared to the Classic tier. A key differentiating feature of the new Standard tier is its multilingual support, maintaining strong performance with over 78% recall and over 88% balanced accuracy for the most common 14 languages.The enhancements in protective capabilities extend across several other aspects. For example, content filters for prompt attacks in the Standard tier show a 30% improvement in recall and 16% gain in balanced accuracy compared to the Classic tier, while maintaining a lower false positive rate. For denied topic detection, the new Standard tier delivers a 32% increase in recall, resulting in an 18% improvement in balanced accuracy.These substantial evolutions in detection capabilities for Amazon Bedrock Guardrails, combined with consistently low false positive rates and robust multilingual performance, also represent a significant advancement in content protection technology compared to other commonly available solutions. The multilingual improvements are particularly noteworthy, with the new Standard tier in Amazon Bedrock Guardrails showing consistent performance gains of 33–49% in recall across different language evaluations compared to other competitors’ options.

Benefits of safeguard tiers

Different AI applications have distinct safety requirements based on their audience, content domain, and geographic reach. For example:

Customer-facing applications often require stronger protection against potential misuse compared to internal applications
Applications serving global customers need guardrails that work effectively across many languages
Internal enterprise tools might prioritize controlling specific topics in just a few primary languages

The combination of the safeguard tiers with CRIS for Amazon Bedrock Guardrails also addresses various operational needs with practical benefits that go beyond feature differences:

Independent policy evolution – Each policy (content filters or denied topics) can evolve at its own pace without disrupting the entire guardrail system. You can configure these with specific guardrail profiles in CRIS for controlling model versioning in the models powering your guardrail policies.
Controlled adoption – You decide when and how to adopt new capabilities, maintaining stability for production applications. You can continue to use Amazon Bedrock Guardrails with your previous configurations without changes and only move to the new tiers and CRIS configurations when you consider it appropriate.
Resource efficiency – You can implement enhanced protections only where needed, balancing security requirements with performance considerations.
Simplified migration path – When new capabilities become available, you can evaluate and integrate them gradually by policy area rather than facing all-or-nothing choices. This also simplifies testing and comparison mechanisms such as A/B testing or blue/green deployments for your guardrails.

This approach helps organizations balance their specific protection requirements with operational considerations in a more nuanced way than a single-option system could provide.

Configure safeguard tiers on the Amazon Bedrock console

On the Amazon Bedrock console, you can configure the safeguard tiers for your guardrail in the Content filters tier or Denied topics tier sections by selecting your preferred tier.

Use of the new Standard tier requires setting up cross-Region inference for Amazon Bedrock Guardrails, choosing the guardrail profile of your choice.

Configure safeguard tiers using the AWS SDK

You can also configure the guardrail’s tiers using the AWS SDK. The following is an example to get started with the Python SDK:

import boto3
import json

bedrock = boto3.client(
    "bedrock",
    region_name="us-east-1"
)

# Create a guardrail with Standard tier for both Content Filters and Denied Topics
response = bedrock.create_guardrail(
    name="enhanced-safety-guardrail",
    # cross-Region is required for STANDARD tier
    crossRegionConfig={
        'guardrailProfileIdentifier': 'us.guardrail.v1:0'
    },
    # Configure Denied Topics with Standard tier
    topicPolicyConfig={
        "topicsConfig": [
            {
                "name": "Financial Advice",
                "definition": "Providing specific investment advice or financial recommendations",
                "type": "DENY",
                "inputEnabled": True,
                "inputAction": "BLOCK",
                "outputEnabled": True,
                "outputAction": "BLOCK"
            }
        ],
        "tierConfig": {
            "tierName": "STANDARD"
        }
    },
    # Configure Content Filters with Standard tier
    contentPolicyConfig={
        "filtersConfig": [
            {
                "inputStrength": "HIGH",
                "outputStrength": "HIGH",
                "type": "SEXUAL"
            },
            {
                "inputStrength": "HIGH",
                "outputStrength": "HIGH",
                "type": "VIOLENCE"
            }
        ],
        "tierConfig": {
            "tierName": "STANDARD"
        }
    },
    blockedInputMessaging="I cannot respond to that request.",
    blockedOutputsMessaging="I cannot provide that information."
)

Within a given guardrail, the content filter and denied topic policies can be configured with its own tier independently, giving you granular control over how guardrails behave. For example, you might choose the Standard tier for content filtering while keeping denied topics in the Classic tier, based on your specific requirements.

For migrating existing guardrails’ configurations to use the Standard tier, add the sections highlighted in the preceding example for crossRegionConfig and tierConfig to your current guardrail definition. You can do this using the UpdateGuardrail API, or create a new guardrail with the CreateGuardrail API.

Evaluating your guardrails

To thoroughly evaluate your guardrails’ performance, consider creating a test dataset that includes the following:

Safe examples – Content that should pass through guardrails
Harmful examples – Content that should be blocked
Edge cases – Content that tests the boundaries of your policies
Examples in multiple languages – Especially important when using the Standard tier

You can also rely on openly available datasets for this purpose. Ideally, your dataset should be labeled with the expected response for each case for assessing accuracy and recall of your guardrails.

With your dataset ready, you can use the Amazon Bedrock ApplyGuardrail API as shown in the following example to efficiently test your guardrail’s behavior for user inputs without invoking FMs. This way, you can save the costs associated with the large language model (LLM) response generation.

import boto3
import json

bedrock_runtime = boto3.client(
    "bedrock-runtime",
    region_name="us-east-1"
)

# Test the guardrail with potentially problematic content
content = [
    {
        "text": {
            "text": "Your test prompt here"
        }
    }
]

response = bedrock_runtime.apply_guardrail(
    content=content,
    source="INPUT",
    guardrailIdentifier="your-guardrail-id",
    guardrailVersion="DRAFT"
)

print(json.dumps(response, indent=2, default=str))

Later, you can repeat the process for the outputs of the LLMs if needed. For this, you can use the ApplyGuardrail API if you want an independent evaluation for models in AWS or outside in another provider, or you can directly use the Converse API if you intend to use models in Amazon Bedrock. When using the Converse API, the inputs and outputs are evaluated with the same invocation request, optimizing latency and reducing coding overheads.

Because your dataset is labeled, you can directly implement a mechanism for assessing the accuracy, recall, and potential false negatives or false positives through the use of libraries like SKLearn Metrics:

# scoring script
# labels and preds store list of ground truth label and guardrails predictions

from sklearn.metrics import confusion_matrix

tn, fp, fn, tp = confusion_matrix(labels, preds, labels=[0, 1]).ravel()

recall = tp / (tp + fn) if (tp + fn) != 0 else 0
fpr = fp / (fp + tn) if (fp + tn) != 0 else 0
balanced_accuracy = 0.5 * (recall + 1 - fpr)

Alternatively, if you don’t have labeled data or your use cases have subjective responses, you can also rely on mechanisms such as LLM-as-a-judge, where you pass the inputs and guardrails’ evaluation outputs to an LLM for assessing a score based on your own predefined criteria. For more information, see Automate building guardrails for Amazon Bedrock using test-drive development.

Best practices for implementing tiers

We recommend considering the following aspects when configuring your tiers for Amazon Bedrock Guardrails:

Start with staged testing – Test both tiers with a representative sample of your expected inputs and responses before making broad deployment decisions.
Consider your language requirements – If your application serves users in multiple languages, the Standard tier’s expanded language support might be essential.
Balance safety and performance – Evaluate both the accuracy improvements and latency differences to make informed decisions. Consider if you can afford a few additional milliseconds of latency for improved robustness with the Standard tier or prefer a latency-optimized option for more straight forward evaluations with the Classic tier.
Use policy-level tier selection – Take advantage of the ability to select different tiers for different policies to optimize your guardrails. You can choose separate tiers for content filters and denied topics, while combining with the rest of the policies and features available in Amazon Bedrock Guardrails.
Remember cross-Region requirements – The Standard tier requires cross-Region inference, so make sure your architecture and compliance requirements can accommodate this. With CRIS, your request originates from the Region where your guardrail is deployed, but it might be served from a different Region from the ones included in the guardrail inference profile for optimizing latency and availability.

Conclusion

The introduction of safeguard tiers in Amazon Bedrock Guardrails represents a significant step forward in our commitment to responsible AI. By providing flexible, powerful, and evolving safety tools for generative AI applications, we’re empowering organizations to implement AI solutions that are not only innovative but also ethical and trustworthy. This capabilities-based approach enables you to tailor your responsible AI practices to each specific use case. You can now implement the right level of protection for different applications while creating a path for continuous improvement in AI safety and ethics.The new Standard tier delivers significant improvements in multilingual support and detection accuracy, making it an ideal choice for many applications, especially those serving diverse global audiences or requiring enhanced protection. This aligns with responsible AI principles by making sure AI systems are fair and inclusive across different languages and cultures. Meanwhile, the Classic tier remains available for use cases prioritizing low latency or those with simpler language requirements, allowing organizations to balance performance with protection as needed.

By offering these customizable protection levels, we’re supporting organizations in their journey to develop and deploy AI responsibly. This approach helps make sure that AI applications are not only powerful and efficient but also align with organizational values, comply with regulations, and maintain user trust.

To learn more about safeguard tiers in Amazon Bedrock Guardrails, refer to Detect and filter harmful content by using Amazon Bedrock Guardrails, or visit the Amazon Bedrock console to create your first tiered guardrail.

About the Authors

Koushik Kethamakka is a Senior Software Engineer at AWS, focusing on AI/ML initiatives. At Amazon, he led real-time ML fraud prevention systems for Amazon.com before moving to AWS to lead development of AI/ML services like Amazon Lex and Amazon Bedrock. His expertise spans product and system design, LLM hosting, evaluations, and fine-tuning. Recently, Koushik’s focus has been on LLM evaluations and safety, leading to the development of products like Amazon Bedrock Evaluations and Amazon Bedrock Guardrails. Prior to joining Amazon, Koushik earned his MS from the University of Houston.

Hang Su is a Senior Applied Scientist at AWS AI. He has been leading the Amazon Bedrock Guardrails Science team. His interest lies in AI safety topics, including harmful content detection, red-teaming, sensitive information detection, among others.

Shyam Srinivasan is on the Amazon Bedrock product team. He cares about making the world a better place through technology and loves being part of this journey. In his spare time, Shyam likes to run long distances, travel around the world, and experience new cultures with family and friends.

Aartika Sardana Chandras is a Senior Product Marketing Manager for AWS Generative AI solutions, with a focus on Amazon Bedrock. She brings over 15 years of experience in product marketing, and is dedicated to empowering customers to navigate the complexities of the AI lifecycle. Aartika is passionate about helping customers leverage powerful AI technologies in an ethical and impactful manner.

Satveer Khurpa is a Sr. WW Specialist Solutions Architect, Amazon Bedrock at Amazon Web Services, specializing in Amazon Bedrock security. In this role, he uses his expertise in cloud-based architectures to develop innovative generative AI solutions for clients across diverse industries. Satveer’s deep understanding of generative AI technologies and security principles allows him to design scalable, secure, and responsible applications that unlock new business opportunities and drive tangible value while maintaining robust security postures.

Antonio Rodriguez is a Principal Generative AI Specialist Solutions Architect at Amazon Web Services. He helps companies of all sizes solve their challenges, embrace innovation, and create new business opportunities with Amazon Bedrock. Apart from work, he loves to spend time with his family and play sports with his friends.

Structured data response with Amazon Bedrock: Prompt Engineering and Tool Use

June 26, 2025

by Adam Nemeth Amazon AWS

Generative AI is revolutionizing industries by streamlining operations and enabling innovation. While textual chat interactions with GenAI remain popular, real-world applications often depend on structured data for APIs, databases, data-driven workloads, and rich user interfaces. Structured data can also enhance conversational AI, enabling more reliable and actionable outputs. A key challenge is that LLMs (Large Language Models) are inherently unpredictable, which makes it difficult for them to produce consistently structured outputs like JSON. This challenge arises because their training data mainly includes unstructured text, such as articles, books, and websites, with relatively few examples of structured formats. As a result, LLMs can struggle with precision when generating JSON outputs, which is crucial for seamless integration into existing APIs and databases. Models vary in their ability to support structured responses, including recognizing data types and managing complex hierarchies effectively. These capabilities can make a difference when choosing the right model.

This blog demonstrates how Amazon Bedrock, a managed service for securely accessing top AI models, can help address these challenges by showcasing two alternative options:

Prompt Engineering: A straightforward approach to shaping structured outputs using well-crafted prompts.
Tool Use with the Bedrock Converse API: An advanced method that enables better control, consistency, and native JSON schema integration.

We will use a customer review analysis example to demonstrate how Bedrock generates structured outputs, such as sentiment scores, with simplified Python code.

Building a prompt engineering solution

This section will demonstrate how to use prompt engineering effectively to generate structured outputs using Amazon Bedrock. Prompt engineering involves crafting precise input prompts to guide large language models (LLMs) in producing consistent and structured responses. It is a fundamental technique for developing Generative AI applications, particularly when structured outputs are required.Here are the five key steps we will follow:

Configure the Bedrock client and runtime parameters.
Create a JSON schema for structured outputs.
Craft a prompt and guide the model with clear instructions and examples.
Add a customer review as input data to analyse.
Invoke Bedrock, call the model, and process the response.

While we demonstrate customer review analysis to generate a JSON output, these methods can also be used with other formats like XML or CSV.

Step 1: Configure Bedrock

To begin, we’ll set up some constants and initialize a Python Bedrock client connection object using the Python Boto3 SDK for Bedrock runtime, which facilitates interaction with Bedrock:

The REGION specifies the AWS region for model execution, while the MODEL_ID identifies the specific Bedrock model. The TEMPERATURE constant controls the output randomness, where higher values increase creativity, and lower values maintain precision, such as when generating structured output. MAX_TOKENS determines the output length, balancing cost-efficiency and data completeness.

Step 2: Define the Schema

Defining a schema is essential for facilitating structured and predictable model outputs, maintaining data integrity, and enabling seamless API integration. Without a well-defined schema, models may generate inconsistent or incomplete responses, leading to errors in downstream applications. The JSON standard schema used in the code below serves as a blueprint for structured data generation, guiding the model on how to format its output with explicit instructions.

Let’s create a JSON schema for customer reviews with three required fields: reviewId (string, max 50 chars), sentiment (number, -1 to 1), and summary (string, max 200 chars).

Step 3: Craft the Prompt text

To generate consistent, structured, and accurate responses, prompts must be clear and well-structured, as LLMs rely on precise input to produce reliable outputs. Poorly designed prompts can lead to ambiguity, errors, or formatting issues, disrupting structured workflows, so we follow these best practices:

Clearly outline the AI’s role and objectives to avoid ambiguity.
Divide tasks into smaller, manageable numbered steps for clarity.
Indicate that a JSON schema will be provided (see Step 5 below) to maintain a consistent and valid structure.
Use one-shot prompting with a sample output to guide the model; add more examples if needed for consistency, but avoid too many, as they may limit the model’s ability to handle new inputs.
Define how to handle missing or invalid data.

Step 4: Integrate Input Data

For demonstration purposes, we’ll include a review text in the prompt as a Python variable:

Separating the input data with <input> tags improve readability and clarity, making it straightforward to identify and reference. This hardcoded input simulates real-world data integration. For production use, you might dynamically populate input data from APIs or user submissions.

Step 5: Call Bedrock

In this section, we construct a Bedrock request by defining a body object that includes the JSON schema, prompt, and input review data from previous steps. This structured request makes sure the model receives clear instructions, adheres to a predefined schema, and processes sample input data correctly. Once the request is prepared, we invoke Amazon Bedrock to generate a structured JSON response.

We reuse the MAX_TOKENS, TEMPERATURE, and MODEL_ID constants defined in Step 1. The body object has essential inference configurations like anthropic_version for model compatibility and the messages array, which includes a single message to provide the model with task instructions, the schema, and the input data. The role defines the “speaker” in the interaction context, with user value representing the program sending the request. Alternatively, we could simplify the input by combining instructions, schema, and data into one text prompt, which is straightforward to manage but less modular.

Finally, we use the client.invoke_model method to send the request. After invoking, the model processes the request, and the JSON data must be properly (not explained here) extracted from the Bedrock response. For example:

Tool Use with the Amazon Bedrock Converse API

In the previous chapter, we explored a solution using Bedrock Prompt Engineering. Now, let’s look at an alternative approach for generating structured responses with Bedrock.

We will extend the previous solution by using the Amazon Bedrock Converse API, a consistent interface designed to facilitate multi-turn conversations with Generative AI models. The API abstracts model-specific configurations, including inference parameters, simplifying integration.

A key feature of the Converse API is Tool Use (also known as Function Calling), which enables the model to execute external tools, such as calling an external API. This method supports standard JSON schema integration directly into tool definitions, facilitating output alignment with predefined formats. Not all Bedrock models support Tool Use, so make sure you check which models are compatible with these feature.

Building on the previously defined data, the following code provides a straightforward example of Tool Use tailored to our curstomer review use case:

In this code the tool_list defines a custom customer review analysis tool with its input schema and purpose, while the messages provide the earlier defined instructions and input data. Unlike in the previous prompt engineering example we used the earlier defined JSON schema in the definition of a tool. Finally, the client.converse call combines these components, specifying the tool to use and inference configurations, resulting in outputs tailored to the given schema and task. After exploring Prompt Engineering and Tool Use in Bedrock solutions for structured response generation, let’s now evaluate how different foundation models perform across these approaches.

Test Results: Claude Models on Amazon Bedrock

Understanding the capabilities of foundation models in structured response generation is essential for maintaining reliability, optimizing performance, and building scalable, future-proof Generative AI applications with Amazon Bedrock. To evaluate how well models handle structured outputs, we conducted extensive testing of Anthropic’s Claude models, comparing prompt-based and tool-based approaches across 1,000 iterations per model. Each iteration processed 100 randomly generated items, providing broad test coverage across different input variations.The examples shown earlier in this blog are intentionally simplified for demonstration purposes, where Bedrock performed seamlessly with no issues. To better assess the models under real-world challenges, we used a more complex schema that featured nested structures, arrays, and diverse data types to identify edge cases and potential issues. The outputs were validated for adherence to the JSON format and schema, maintaining consistency and accuracy. The following diagram summarizes the results, showing the number of successful, valid JSON responses for each model across the two demonstrated approaches: Prompt Engineering and Tool Use.

The results demonstrated that all models achieved over 93% success across both approaches, with Tool Use methods consistently outperforming prompt-based ones. While the evaluation was conducted using a highly complex JSON schema, simpler schemas result in significantly fewer issues, often nearly none. Future updates to the models are expected to further enhance performance.

Final Thoughts

In conclusion, we demonstrated two methods for generating structured responses with Amazon Bedrock: Prompt Engineering and Tool Use with the Converse API. Prompt Engineering is flexible, works with Bedrock models (including those without Tool Use support), and handles various schema types (e.g., Open API schemas), making it a great starting point. However, it can be fragile, requiring exact prompts and struggling with complex needs. On the other hand, Tool Use offers greater reliability, consistent results, seamless API integration, and runtime validation of JSON schema for enhanced control.

For simplicity, we did not demonstrate a few areas in this blog. Other techniques for generating structured responses include using models with built-in support for configurable response formats, such as JSON, when invoking models, or leveraging constraint decoding techniques with third-party libraries like LMQL. Additionally, generating structured data with GenAI can be challenging due to issues like invalid JSON, missing fields, or formatting errors. To maintain data integrity and handle unexpected outputs or API failures, effective error handling, thorough testing, and validation are essential.

To try the Bedrock techniques demonstrated in this blog, follow the steps to Run example Amazon Bedrock API requests through the AWS SDK for Python (Boto3). With pay-as-you-go pricing, you’re only charged for API calls, so little to no cleanup is required after testing. For more details on best practices, refer to the Bedrock prompt engineering guidelines and model-specific documentation, such as Anthropic’s best practices.

Structured data is key to leveraging Generative AI in real-world scenarios like APIs, data-driven workloads, and rich user interfaces beyond text-based chat. Start using Amazon Bedrock today to unlock its potential for reliable structured responses.

About the authors

Adam Nemeth is a Senior Solutions Architect at AWS, where he helps global financial customers embrace cloud computing through architectural guidance and technical support. With over 24 years of IT expertise, Adam previously worked at UBS before joining AWS. He lives in Switzerland with his wife and their three children.

Dominic Searle is a Senior Solutions Architect at Amazon Web Services, where he has had the pleasure of working with Global Financial Services customers as they explore how Generative AI can be integrated into their technology strategies. Providing technical guidance, he enjoys helping customers effectively leverage AWS Services to solve real business problems.