Build and deploy AI inference workflows with new enhancements to the Amazon SageMaker Python SDK

Build and deploy AI inference workflows with new enhancements to the Amazon SageMaker Python SDK

Amazon SageMaker Inference has been a popular tool for deploying advanced machine learning (ML) and generative AI models at scale. As AI applications become increasingly complex, customers want to deploy multiple models in a coordinated group that collectively process inference requests for an application. In addition, with the evolution of generative AI applications, many use cases now require inference workflows—sequences of interconnected models operating in predefined logical flows. This trend drives a growing need for more sophisticated inference offerings.

To address this need, we are introducing a new capability in the SageMaker Python SDK that revolutionizes how you build and deploy inference workflows on SageMaker. We will take Amazon Search as an example to show case how this feature is used in helping customers building inference workflows. This new Python SDK capability provides a streamlined and simplified experience that abstracts away the underlying complexities of packaging and deploying groups of models and their collective inference logic, allowing you to focus on what matter most—your business logic and model integrations.

In this post, we provide an overview of the user experience, detailing how to set up and deploy these workflows with multiple models using the SageMaker Python SDK. We walk through examples of building complex inference workflows, deploying them to SageMaker endpoints, and invoking them for real-time inference. We also show how customers like Amazon Search plan to use SageMaker Inference workflows to provide more relevant search results to Amazon shoppers.

Whether you are building a simple two-step process or a complex, multimodal AI application, this new feature provides the tools you need to bring your vision to life. This tool aims to make it easy for developers and businesses to create and manage complex AI systems, helping them build more powerful and efficient AI applications.

In the following sections, we dive deeper into details of the SageMaker Python SDK, walk through practical examples, and showcase how this new capability can transform your AI development and deployment process.

Key improvements and user experience

The SageMaker Python SDK now includes new features for creating and managing inference workflows. These additions aim to address common challenges in developing and deploying inference workflows:

  • Deployment of multiple models – The core of this new experience is the deployment of multiple models as inference components within a single SageMaker endpoint. With this approach, you can create a more unified inference workflow. By consolidating multiple models into one endpoint, you can reduce the number of endpoints that need to be managed. This consolidation can also improve operational tasks, resource utilization, and potentially costs.
  • Workflow definition with workflow mode – The new workflow mode extends the existing Model Builder capabilities. It allows for the definition of inference workflows using Python code. Users familiar with the ModelBuilder class might find this feature to be an extension of their existing knowledge. This mode enables creating multi-step workflows, connecting models, and specifying the data flow between different models in the workflows. The goal is to reduce the complexity of managing these workflows and enable you to focus more on the logic of the resulting compound AI system.
  • Development and deployment options – A new deployment option has been introduced for the development phase. This feature is designed to allow for quicker deployment of workflows to development environments. The intention is to enable faster testing and refinement of workflows. This could be particularly relevant when experimenting with different configurations or adjusting models.
  • Invocation flexibility – The SDK now provides options for invoking individual models or entire workflows. You can choose to call a specific inference component used in a workflow or the entire workflow. This flexibility can be useful in scenarios where access to a specific model is needed, or when only a portion of the workflow needs to be executed.
  • Dependency management – You can use SageMaker Deep Learning Containers (DLCs) or the SageMaker distribution that comes preconfigured with various model serving libraries and tools. These are intended to serve as a starting point for common use cases.

To get started, use the SageMaker Python SDK to deploy your models as inference components. Then, use the workflow mode to create an inference workflow, represented as Python code using the container of your choice. Deploy the workflow container as another inference component on the same endpoints as the models or a dedicated endpoint. You can run the workflow by invoking the inference component that represents the workflow. The user experience is entirely code-based, using the SageMaker Python SDK. This approach allows you to define, deploy, and manage inference workflows using SDK abstractions offered by this feature and Python programming. The workflow mode provides flexibility to specify complex sequences of model invocations and data transformations, and the option to deploy as components or endpoints caters to various scaling and integration needs.

Solution overview

The following diagram illustrates a reference architecture using the SageMaker Python SDK.

The improved SageMaker Python SDK introduces a more intuitive and flexible approach to building and deploying AI inference workflows. Let’s explore the key components and classes that make up the experience:

  • ModelBuilder simplifies the process of packaging individual models as inference components. It handles model loading, dependency management, and container configuration automatically.
  • The CustomOrchestrator class provides a standardized way to define custom inference logic that orchestrates multiple models in the workflow. Users implement the handle() method to specify this logic and can use an orchestration library or none at all (plain Python).
  • A single deploy() call handles the deployment of the components and workflow orchestrator.
  • The Python SDK supports invocation against the custom inference workflow or individual inference components.
  • The Python SDK supports both synchronous and streaming inference.

CustomOrchestrator is an abstract base class that serves as a template for defining custom inference orchestration logic. It standardizes the structure of entry point-based inference scripts, making it straightforward for users to create consistent and reusable code. The handle method in the class is an abstract method that users implement to define their custom orchestration logic.

class CustomOrchestrator (ABC):
"""
Templated class used to standardize the structure of an entry point based inference script.
"""

    @abstractmethod
    def handle(self, data, context=None):
        """abstract class for defining an entrypoint for the model server"""
        return NotImplemented

With this templated class, users can integrate into their custom workflow code, and then point to this code in the model builder using a file path or directly using a class or method name. Using this class and the ModelBuilder class, it enables a more streamlined workflow for AI inference:

  1. Users define their custom workflow by implementing the CustomOrchestrator class.
  2. The custom CustomOrchestrator is passed to ModelBuilder using the ModelBuilder inference_spec parameter.
  3. ModelBuilder packages the CustomOrchestrator along with the model artifacts.
  4. The packaged model is deployed to a SageMaker endpoint (for example, using a TorchServe container).
  5. When invoked, the SageMaker endpoint uses the custom handle() function defined in the CustomOrchestrator to handle the input payload.

In the follow sections, we provide two examples of custom workflow orchestrators implemented with plain Python code. For simplicity, the examples use two inference components.

We explore how to create a simple workflow that deploys two large language models (LLMs) on SageMaker Inference endpoints along with a simple Python orchestrator that calls the two models. We create an IT customer service workflow where one model processes the initial request and another suggests solutions. You can find the example notebook in the GitHub repo.

Prerequisites

To run the example notebooks, you need an AWS account with an AWS Identity and Access Management (IAM) role with least-privilege permissions to manage resources created. For details, refer to Create an AWS account. You might need to request a service quota increase for the corresponding SageMaker hosting instances. In this example, we host multiple models on the same SageMaker endpoint, so we use two ml.g5.24xlarge SageMaker hosting instances.

Python inference orchestration

First, let’s define our custom orchestration class that inherits from CustomOrchestrator. The workflow is structured around a custom inference entry point that handles the request data, processes it, and retrieves predictions from the configured model endpoints. See the following code:

class PythonCustomInferenceEntryPoint(CustomOrchestrator):
    def __init__(self, region_name, endpoint_name, component_names):
        self.region_name = region_name
        self.endpoint_name = endpoint_name
        self.component_names = component_names
    
    def preprocess(self, data):
        payload = {
            "inputs": data.decode("utf-8")
        }
        return json.dumps(payload)

    def _invoke_workflow(self, data):
        # First model (Llama) inference
        payload = self.preprocess(data)
        
        llama_response = self.client.invoke_endpoint(
            EndpointName=self.endpoint_name,
            Body=payload,
            ContentType="application/json",
            InferenceComponentName=self.component_names[0]
        )
        llama_generated_text = json.loads(llama_response.get('Body').read())['generated_text']
        
        # Second model (Mistral) inference
        parameters = {
            "max_new_tokens": 50
        }
        payload = {
            "inputs": llama_generated_text,
            "parameters": parameters
        }
        mistral_response = self.client.invoke_endpoint(
            EndpointName=self.endpoint_name,
            Body=json.dumps(payload),
            ContentType="application/json",
            InferenceComponentName=self.component_names[1]
        )
        return {"generated_text": json.loads(mistral_response.get('Body').read())['generated_text']}
    
    def handle(self, data, context=None):
        return self._invoke_workflow(data)

This code performs the following functions:

  • Defines the orchestration that sequentially calls two models using their inference component names
  • Processes the response from the first model before passing it to the second model
  • Returns the final generated response

This plain Python approach provides flexibility and control over the request-response flow, enabling seamless cascading of outputs across multiple model components.

Build and deploy the workflow

To deploy the workflow, we first create our inference components and then build the custom workflow. One inference component will host a Meta Llama 3.1 8B model, and the other will host a Mistral 7B model.

from sagemaker.serve import ModelBuilder
from sagemaker.serve.builder.schema_builder import SchemaBuilder

# Create a ModelBuilder instance for Llama 3.1 8B
# Pre-benchmarked ResourceRequirements will be taken from JumpStart, as Llama-3.1-8b is a supported model.
llama_model_builder = ModelBuilder(
    model="meta-textgeneration-llama-3-1-8b",
    schema_builder=SchemaBuilder(sample_input, sample_output),
    inference_component_name=llama_ic_name,
    instance_type="ml.g5.24xlarge"
)

# Create a ModelBuilder instance for Mistral 7B model.
mistral_mb = ModelBuilder(
    model="huggingface-llm-mistral-7b",
    instance_type="ml.g5.24xlarge",
    schema_builder=SchemaBuilder(sample_input, sample_output),
    inference_component_name=mistral_ic_name,
    resource_requirements=ResourceRequirements(
        requests={
           "memory": 49152,
           "num_accelerators": 2,
           "copies": 1
        }
    ),
    instance_type="ml.g5.24xlarge"
)

Now we can tie it all together to create one more ModelBuilder to which we pass the modelbuilder_list, which contains the ModelBuilder objects we just created for each inference component and the custom workflow. Then we call the build() function to prepare the workflow for deployment.

# Create workflow ModelBuilder
orchestrator= ModelBuilder(
    inference_spec=PythonCustomInferenceEntryPoint(
        region_name=region,
        endpoint_name=llama_mistral_endpoint_name,
        component_names=[llama_ic_name, mistral_ic_name],
    ),
    dependencies={
        "auto": False,
        "custom": [
            "cloudpickle",
            "graphene",
            # Define other dependencies here.
        ],
    },
    sagemaker_session=Session(),
    role_arn=role,
    resource_requirements=ResourceRequirements(
        requests={
           "memory": 4096,
           "num_accelerators": 1,
           "copies": 1,
           "num_cpus": 2
        }
    ),
    name=custom_workflow_name, # Endpoint name for your custom workflow
    schema_builder=SchemaBuilder(sample_input={"inputs": "test"}, sample_output="Test"),
    modelbuilder_list=[llama_model_builder, mistral_mb] # Inference Component ModelBuilders created in Step 2
)
# call the build function to prepare the workflow for deployment
orchestrator.build()

In the preceding code snippet, you can comment out the section that defines the resource_requirements to have the custom workflow deployed on a separate endpoint instance, which can be a dedicated CPU instance to handle the custom workflow payload.

By calling the deploy() function, we deploy the custom workflow and the inference components to your desired instance type, in this example ml.g5.24.xlarge. If you choose to deploy the custom workflow to a separate instance, by default, it will use the ml.c5.xlarge instance type. You can set inference_workflow_instance_type and inference_workflow_initial_instance_count to configure the instances required to host the custom workflow.

predictors = orchestrator.deploy(
    instance_type="ml.g5.24xlarge",
    initial_instance_count=1,
    accept_eula=True, # Required for Llama3
    endpoint_name=llama_mistral_endpoint_name
    # inference_workflow_instance_type="ml.t2.medium", # default
    # inference_workflow_initial_instance_count=1 # default
)

Invoke the endpoint

After you deploy the workflow, you can invoke the endpoint using the predictor object:

from sagemaker.serializers import JSONSerializer
predictors[-1].serializer = JSONSerializer()
predictors[-1].predict("Tell me a story about ducks.")

You can also invoke each inference component in the deployed endpoint. For example, we can test the Llama inference component with a synchronous invocation, and Mistral with streaming:

from sagemaker.predictor import Predictor
# create predictor for the inference component of Llama model
llama_predictor = Predictor(endpoint_name=llama_mistral_endpoint_name, component_name=llama_ic_name)
llama_predictor.content_type = "application/json"

llama_predictor.predict(json.dumps(payload))

When handling the streaming response, we need to read each line of the output separately. The following example code demonstrates this streaming handling by checking for newline characters to separate and print each token in real time:

mistral_predictor = Predictor(endpoint_name=llama_mistral_endpoint_name, component_name=mistral_ic_name)
mistral_predictor.content_type = "application/json"

body = json.dumps({
    "inputs": prompt,
    # specify the parameters as needed
    "parameters": parameters
})

for line in mistral_predictor.predict_stream(body):
    decoded_line = line.decode('utf-8')
    if 'n' in decoded_line:
        # Split by newline to handle multiple tokens in the same line
        tokens = decoded_line.split('n')
        for token in tokens[:-1]:  # Print all tokens except the last one with a newline
            print(token)
        # Print the last token without a newline, as it might be followed by more tokens
        print(tokens[-1], end='')
    else:
        # Print the token without a newline if it doesn't contain 'n'
        print(decoded_line, end='')

So far, we have walked through the example code to demonstrate how to build complex inference logic using Python orchestration, deploy them to SageMaker endpoints, and invoke them for real-time inference. The Python SDK automatically handles the following:

  • Model packaging and container configuration
  • Dependency management and environment setup
  • Endpoint creation and component coordination

Whether you’re building a simple workflow of two models or a complex multimodal application, the new SDK provides the building blocks needed to bring your inference workflows to life with minimal boilerplate code.

Customer story: Amazon Search

Amazon Search is a critical component of the Amazon shopping experience, processing an enormous volume of queries across billions of products across diverse categories. At the core of this system are sophisticated matching and ranking workflows, which determine the order and relevance of search results presented to customers. These workflows execute large deep learning models in predefined sequences, often sharing models across different workflows to improve price-performance and accuracy. This approach makes sure that whether a customer is searching for electronics, fashion items, books, or other products, they receive the most pertinent results tailored to their query.

The SageMaker Python SDK enhancement offers valuable capabilities that align well with Amazon Search’s requirements for these ranking workflows. It provides a standard interface for developing and deploying complex inference workflows crucial for effective search result ranking. The enhanced Python SDK enables efficient reuse of shared models across multiple ranking workflows while maintaining the flexibility to customize logic for specific product categories. Importantly, it allows individual models within these workflows to scale independently, providing optimal resource allocation and performance based on varying demand across different parts of the search system.

Amazon Search is exploring the broad adoption of these Python SDK enhancements across their search ranking infrastructure. This initiative aims to further refine and improve search capabilities, enabling the team to build, version, and catalog workflows that power search ranking more effectively across different product categories. The ability to share models across workflows and scale them independently offers new levels of efficiency and adaptability in managing the complex search ecosystem.

Vaclav Petricek, Sr. Manager of Applied Science at Amazon Search, highlighted the potential impact of these SageMaker Python SDK enhancements: “These capabilities represent a significant advancement in our ability to develop and deploy sophisticated inference workflows that power search matching and ranking. The flexibility to build workflows using Python, share models across workflows, and scale them independently is particularly exciting, as it opens up new possibilities for optimizing our search infrastructure and rapidly iterating on our matching and ranking algorithms as well as new AI features. Ultimately, these SageMaker Inference enhancements will allow us to more efficiently create and manage the complex algorithms powering Amazon’s search experience, enabling us to deliver even more relevant results to our customers.”

The following diagram illustrates a sample solution architecture used by Amazon Search.

Clean up

When you’re done testing the models, as a best practice, delete the endpoint to save costs if the endpoint is no longer required. You can follow the cleanup section the demo notebook or use following code to delete the model and endpoint created by the demo:

mistral_predictor.delete_predictor()
llama_predictor.delete_predictor()
llama_predictor.delete_endpoint()
workflow_predictor.delete_predictor()

Conclusion

The new SageMaker Python SDK enhancements for inference workflows mark a significant advancement in the development and deployment of complex AI inference workflows. By abstracting the underlying complexities, these enhancements empower inference customers to focus on innovation rather than infrastructure management. This feature bridges sophisticated AI applications with the robust SageMaker infrastructure, enabling developers to use familiar Python-based tools while harnessing the powerful inference capabilities of SageMaker.

Early adopters, including Amazon Search, are already exploring how these capabilities can drive major improvements in AI-powered customer experiences across diverse industries. We invite all SageMaker users to explore this new functionality, whether you’re developing classic ML models, building generative AI applications or multi-model workflows, or tackling multi-step inference scenarios. The enhanced SDK provides the flexibility, ease of use, and scalability needed to bring your ideas to life. As AI continues to evolve, SageMaker Inference evolves with it, providing you with the tools to stay at the forefront of innovation. Start building your next-generation AI inference workflows today with the enhanced SageMaker Python SDK.


About the authors

Melanie Li, PhD, is a Senior Generative AI Specialist Solutions Architect at AWS based in Sydney, Australia, where her focus is on working with customers to build solutions leveraging state-of-the-art AI and machine learning tools. She has been actively involved in multiple Generative AI initiatives across APJ, harnessing the power of Large Language Models (LLMs). Prior to joining AWS, Dr. Li held data science roles in the financial and retail industries.

Saurabh Trikande is a Senior Product Manager for Amazon Bedrock and SageMaker Inference. He is passionate about working with customers and partners, motivated by the goal of democratizing AI. He focuses on core challenges related to deploying complex AI applications, inference with multi-tenant models, cost optimizations, and making the deployment of Generative AI models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch, and spending time with his family.

Osho Gupta is a Senior Software Developer at AWS SageMaker. He is passionate about ML infrastructure space, and is motivated to learn & advance underlying technologies that optimize Gen AI training & inference performance. In his spare time, Osho enjoys paddle boarding, hiking, traveling, and spending time with his friends & family.

Joseph Zhang is a software engineer at AWS. He started his AWS career at EC2 before eventually transitioning to SageMaker, and now works on developing GenAI-related features. Outside of work he enjoys both playing and watching sports (go Warriors!), spending time with family, and making coffee.

Gary Wang is a Software Developer at AWS SageMaker. He is passionate about AI/ML operations and building new things. In his spare time, Gary enjoys running, hiking, trying new food, and spending time with his friends and family.

James Park is a Solutions Architect at Amazon Web Services. He works with Amazon.com to design, build, and deploy technology solutions on AWS, and has a particular interest in AI and machine learning. In h is spare time he enjoys seeking out new cultures, new experiences,  and staying up to date with the latest technology trends. You can find him on LinkedIn.

Vaclav Petricek is a Senior Applied Science Manager at Amazon Search, where he led teams that built Amazon Rufus and now leads science and engineering teams that work on the next generation of Natural Language Shopping. He is passionate about shipping AI experiences that make people’s lives better. Vaclav loves off-piste skiing, playing tennis, and backpacking with his wife and three children.

Wei Li is a Senior Software Dev Engineer in Amazon Search. She is passionate about Large Language Model training and inference technologies, and loves integrating these solutions into Search Infrastructure to enhance natural language shopping experiences. During her leisure time, she enjoys gardening, painting, and reading.

Brian Granger is a Senior Principal Technologist at Amazon Web Services and a professor of physics and data science at Cal Poly State University in San Luis Obispo, CA. He works at the intersection of UX design and engineering on tools for scientific computing, data science, machine learning, and data visualization. Brian is a co-founder and leader of Project Jupyter, co-founder of the Altair project for statistical visualization, and creator of the PyZMQ project for ZMQ-based message passing in Python. At AWS he is a technical and open source leader in the AI/ML organization. Brian also represents AWS as a board member of the PyTorch Foundation. He is a winner of the 2017 ACM Software System Award and the 2023 NASA Exceptional Public Achievement Medal for his work on Project Jupyter. He has a Ph.D. in theoretical physics from the University of Colorado.

Read More

AI Testing and Evaluation: Learnings from genome editing

AI Testing and Evaluation: Learnings from genome editing

illustration of R. Alta Charo, Kathleen Sullivan, and Daniel Kluttz for the Microsoft Research Podcast

Generative AI presents a unique challenge and opportunity to reexamine governance practices for the responsible development, deployment, and use of AI. To advance thinking in this space, Microsoft has tapped into the experience and knowledge of experts across domains—from genome editing to cybersecurity—to investigate the role of testing and evaluation as a governance tool. AI Testing and Evaluation: Learnings from Science and Industry, hosted by Microsoft Research’s Kathleen Sullivan, explores what the technology industry and policymakers can learn from these fields and how that might help shape the course of AI development.

In this episode, Alta Charo (opens in new tab), emerita professor of law and bioethics at the University of Wisconsin–Madison, joins Sullivan for a conversation on the evolving landscape of genome editing and its regulatory implications. Drawing on decades of experience in biotechnology policy, Charo emphasizes the importance of distinguishing between hazards and risks and describes the field’s approach to regulating applications of technology rather than the technology itself. The discussion also explores opportunities and challenges in biotech’s multi-agency oversight model and the role of international coordination. Later, Daniel Kluttz (opens in new tab), a partner general manager in Microsoft’s Office of Responsible AI, joins Sullivan to discuss how insights from genome editing could inform more nuanced and robust governance frameworks for emerging technologies like AI.

Transcript

[MUSIC]

KATHLEEN SULLIVAN: Welcome to AI Testing and Evaluation: Learnings from Science and Industry. I’m your host, Kathleen Sullivan.

As generative AI continues to advance, Microsoft has gathered a range of experts—from genome editing to cybersecurity—to share how their fields approach evaluation and risk assessment. Our goal is to learn from their successes and their stumbles to move the science and practice of AI testing forward. In this series, we’ll explore how these insights might help guide the future of AI development, deployment, and responsible use.

[MUSIC ENDS]

Today I’m excited to welcome R. Alta Charo, the Warren P. Knowles Professor Emerita of Law and Bioethics at the University of Wisconsin–Madison, to explore testing and risk assessment in genome editing.

Professor Charo has been at the forefront of biotechnology policy and governance for decades, advising former President Obama’s transition team on issues of medical research and public health, as well as serving as a senior policy advisor at the Food and Drug Administration. She consults on gene therapy and genome editing for various companies and organizations and has held positions on a number of advisory committees, including for the National Academy of Sciences. Her committee work has spanned women’s health, stem cell research, genome editing, biosecurity, and more.

After our conversation with Professor Charo, we’ll hear from Daniel Kluttz, a partner general manager in Microsoft’s Office of Responsible AI, about what these insights from biotech regulation could mean for AI governance and risk assessment and his team’s work governing sensitive AI uses and emerging technologies.

Alta, thank you so much for being here today. I’m a follower of your work and have really been looking forward to our conversation.


ALTA CHARO: It’s my pleasure. Thanks for having me.

SULLIVAN: Alta, I’d love to begin by stepping back in time a bit before you became a leading figure in bioethics and legal policy. You’ve shared that your interest in science was really inspired by your brothers’ interest in the topic and that your upbringing really helped shape your perseverance and resilience. Can you talk to us about what put you on the path to law and policy?

CHARO: Well, I think it’s true that many of us are strongly influenced by our families and certainly my family had, kind of, a science-y, techy orientation. My father was a refugee, you know, escaping the Nazis, and when he finally was able to start working in the United States, he took advantage of the G.I. Bill to learn how to repair televisions and radios, which were really just coming in in the 1950s. So he was, kind of, technically oriented.

My mother retrained from being a talented amateur artist to becoming a math teacher, and not surprisingly, both my brothers began to aim toward things like engineering and chemistry and physics. And our form of entertainment was to watch PBS or Star Trek. [LAUGHTER]

And so the interest comes from that background coupled with, in the 1960s, this enormous surge of interest in the so-called nature-versus-nurture debate about the degree to which we are destined by our biology or shaped by our environments. It was a heady debate, and one that perfectly combined the two interests in politics and science.

SULLIVAN: For listeners who are brand new to your field in genomic editing, can you give us what I’ll call a “90-second survey” of the space in perhaps plain language and why it’s important to have a framework for ensuring its responsible use.

CHARO: Well, you know, genome editing is both very old and very new. At base, what we’re talking about is a way to either delete sections of the genome, our collection of genes, or to add things or to alter what’s there. The goal is simply to be able to take what might not be healthy and make it healthy, whether it’s a plant, an animal, or a human.

Many people have compared it to a word processor, where you can edit text by swapping things in and out. You could change the letter g to the letter h in every word, and in our genomes, you can do similar kinds of things.

But because of this, we have a responsibility to make sure that whatever we change doesn’t become dangerous and that it doesn’t become socially disruptive. Now the earliest forms of genome editing were very inefficient, and so we didn’t worry that much. But with the advances that were spearheaded by people like Jennifer Doudna and Emmanuelle Charpentier, who won the Nobel Prize for their work in this area, genome editing has become much easier to do.

It’s become more efficient. It doesn’t require as much sophisticated laboratory equipment. It’s moved from being something that only a few people can do to something that we’re going to be seeing in our junior high school biology labs. And that means you have to pay attention to who’s doing it, why are they doing it, what are they releasing, if anything, into the environment, what are they trying to sell, and is it honest and is it safe?

SULLIVAN: How would you describe the risks, and are there, you know, sort of, specifically inherent risks in the technology itself, or do those risks really emerge only when it’s applied in certain contexts, like CRISPR in agriculture or CRISPR for human therapies?

CHARO: Well, to answer that, I’m going to do something that may seem a little picky, even pedantic. [LAUGHTER] But I’m going to distinguish between hazards and risks. So there are certain intrinsic hazards. That is, there are things that can go wrong.

You want to change one particular gene or one particular portion of a gene, and you might accidentally change something else, a so-called off-target effect. Or you might change something in a gene expecting a certain effect but not necessarily anticipating that there’s going to be an interaction between what you changed and what was there, a gene-gene interaction, that might have an unanticipated kind of result, a side effect essentially.

So there are some intrinsic hazards, but risk is a hazard coupled with the probability that it’s going to actually create something harmful. And that really depends upon the application.

If you are doing something that is making a change in a human being that is going to be a lifelong change, that enhances the significance of that hazard. It amplifies what I call the risk because if something goes wrong, then its consequences are greater.

It may also be that in other settings, what you’re doing is going to have a much lower risk because you’re working with a more familiar substance, your predictive power is much greater, and it’s not going into a human or an animal or into the environment. So I think that you have to say that the risk and the benefits, by the way, all are going to depend upon the particular application.

SULLIVAN: Yeah, I think on this point of application, there’s many players involved in that, right. Like, we often hear about this puzzle of who’s actually responsible for ensuring safety and a reasonable balance between risks and benefits or hazards and benefits, to quote you. Is it the scientists, the biotech companies, government agencies? And then if you could touch upon, as well, maybe how does the nature of genome editing risks … how do those responsibilities get divvied up?

CHARO: Well, in the 1980s, we had a very significant policy discussion about whether we should regulate the technology—no matter how it’s used or for whatever purpose—or if we should simply fold the technology in with all the other technologies that we currently have and regulate its applications the way we regulate applications generally. And we went for the second, the so-called coordinated framework.

So what we have in the United States is a system in which if you use genome editing in purely laboratory-based work, then you will be regulated the way we regulate laboratories.

There’s also, at most universities because of the way the government works with this, something called Institutional Biosafety Committees, IBCs. You want to do research that involves recombinant DNA and modern biotechnology, including genome editing but not limited to it, you have to go first to your IBC, and they look and see what you’re doing to decide if there’s a danger there that you have not anticipated that requires special attention.

If what you’re doing is going to get released into the environment or it’s going to be used to change an animal that’s going to be in the environment, then there are agencies that oversee the safety of our environment, predominantly the Environmental Protection Agency and the U.S. Department of Agriculture.

If you’re working with humans and you’re doing medical therapies, like you’re doing the gene therapies that just have been developed for things like sickle cell anemia, then you have to go through a very elaborate regulatory process that’s overseen by the Food and Drug Administration and also seen locally at the research stages overseen by institutional review boards that make sure the people who are being recruited into research understand what they’re getting into, that they’re the right people to be recruited, etc.

So we do have this kind of Jenga game …

SULLIVAN: [LAUGHS] Yeah, sounds like it.

CHARO: … of regulatory agencies. And on top of all that, most of this involves professionals who’ve had to be licensed in some way. There may be state laws specifically on licensing. If you are dealing with things that might cross national borders, there may be international treaties and agreements that cover this.

And, of course, the insurance industry plays a big part because they decide whether or not what you’re doing is safe enough to be insured. So all of these things come together in a way that is not at all easy to understand if you’re not, kind of, working in the field. But the bottom-line thing to remember, the way to really think about it is, we don’t regulate genome editing; we regulate the things that use genome editing.

SULLIVAN: Yeah, that makes a lot of sense. Actually, maybe just following up a little bit on this notion of a variety of different, particularly like government agencies being involved. You know, in this multi-stakeholder model, where do you see gaps today that need to be filled, some of the pros and cons to keep in mind, and, you know, just as we think about distributing these systems at a global level, like, what are some of the considerations you are keeping in mind on that front?

CHARO: Well, certainly there are times where the way the statutes were written that govern the regulation of drugs or the regulation of foods did not anticipate this tremendous capacity we now have in the area of biotechnology generally or genome editing in particular. And so you can find that there are times where it feels a little bit ambiguous, and the agencies have to figure out how to apply their existing rules.

So an example. If you’re going to make alterations in an animal, right, we have a system for regulating drugs, including veterinary drugs. But we didn’t have something that regulated genome editing of animals. But in a sense, genome editing of an animal is the same thing as using a veterinary drug. You’re trying to affect the animal’s physical constitution in some fashion.

And it took a long time within the FDA to, sort of, work out how the regulation of veterinary drugs would apply if you think about the genetic construct that’s being used to alter the animal as the same thing as injecting a chemically based drug. And on that basis, they now know here’s the regulatory path—here are the tests you have to do; here are the permissions you have to do; here’s the surveillance you have to do after it goes on the market.

Even there, sometimes, it was confusing. What happens when it’s not the kind of animal you’re thinking about when you think about animal drugs? Like, we think about pigs and dogs, but what about mosquitoes?

Because there, you’re really thinking more about pests, and if you’re editing the mosquito so that it can’t, for example, transmit dengue fever, right, it feels more like a public health thing than it is a drug for the mosquito itself, and it, kind of, fell in between the agencies that possibly had jurisdiction. And it took a while for the USDA, the Department of Agriculture, and the Food and Drug Administration to work out an agreement about how they would share this responsibility. So you do get those kinds of areas in which you have at least ambiguity.

We also have situations where frankly the fact that some things can move across national borders means you have to have a system for harmonizing or coordinating national rules. If you want to, for example, genetically engineer mosquitoes that can’t transmit dengue, mosquitoes have a tendency to fly. [LAUGHTER] And so … they can’t fly very far. That’s good. That actually makes it easier to control.

But if you’re doing work that’s right near a border, then you have to be sure that the country next to you has the same rules for whether it’s permitted to do this and how to surveil what you’ve done in order to be sure that you got the results you wanted to get and no other results. And that also is an area where we have a lot of work to be done in terms of coordinating across government borders and harmonizing our rules.

SULLIVAN: Yeah, I mean, you’ve touched on this a little bit, but there is such this striking balance between advancing technology, ensuring public safety, and sometimes, I think it feels just like you’re walking a tightrope where, you know, if we clamp down too hard, we’ll stifle innovation, and if we’re too lax, we risk some of these unintended consequences. And on a global scale like you just mentioned, as well. How has the field of genome editing found its balance?

CHARO: It’s still being worked out, frankly, but it’s finding its balance application by application. So in the United States, we have two very different approaches on regulation of things that are going to go into the market.

Some things can’t be marketed until they’ve gotten an approval from the government. So you come up with a new drug, you can’t sell that until it’s gone through FDA approval.

On the other hand, for most foods that are made up of familiar kinds of things, you can go on the market, and it’s only after they’re on the market that the FDA can act to withdraw it if a problem arises. So basically, we have either pre-market controls: you can’t go on without permission. Or post-market controls: we can take you off the market if a problem occurs.

How do we decide which one is appropriate for a particular application? It’s based on our experience. New drugs typically are both less familiar than existing things on the market and also have a higher potential for injury if they, in fact, are not effective or they are, in fact, dangerous and toxic.

If you have foods, even bioengineered foods, that are basically the same as foods that are already here, it can go on the market with notice but without a prior approval. But if you create something truly novel, then it has to go through a whole long process.

And so that is the way that we make this balance. We look at the application area. And we’re just now seeing in the Department of Agriculture a new approach on some of the animal editing, again, to try and distinguish between things that are simply a more efficient way to make a familiar kind of animal variant and those things that are genuinely novel and to have a regulatory process that is more rigid the more unfamiliar it is and the more that we see a risk associated with it.

SULLIVAN: I know we’re at the end of our time here and maybe just a quick kind of lightning-round of a question. For students, young scientists, lawyers, or maybe even entrepreneurs listening who are inspired by your work, what’s the single piece of advice you give them if they’re interested in policy, regulation, the ethical side of things in genomics or other fields?

CHARO: I’d say be a bio-optimist and read a lot of science fiction. Because it expands your imagination about what the world could be like. Is it going to be a world in which we’re now going to be growing our buildings instead of building them out of concrete?

Is it going to be a world in which our plants will glow in the evening so we don’t need to be using batteries or electrical power from other sources but instead our environment is adapting to our needs?

You know, expand your imagination with a sense of optimism about what could be and see ethics and regulation not as an obstacle but as a partner to bringing these things to fruition in a way that’s responsible and helpful to everyone.

[TRANSITION MUSIC]

SULLIVAN: Wonderful. Well, Alta, this has been just an absolute pleasure. So thank you.

CHARO: It was my pleasure. Thank you for having me.

SULLIVAN: Now, I’m happy to bring in Daniel Kluttz. As a partner general manager in Microsoft’s Office of Responsible AI, Daniel leads the group’s Sensitive Uses and Emerging Technologies program.

Daniel, it’s great to have you here. Thanks for coming in.

DANIEL KLUTTZ: It’s great to be here, Kathleen.

SULLIVAN: Yeah. So maybe before we unpack Alta Charo’s insights, I’d love to just understand the elevator pitch here. What exactly is [the] Sensitive Uses and Emerging Tech program, and what was the impetus for establishing it?

KLUTTZ: Yeah. So the Sensitive Uses and Emerging Technologies program sits within our Office of Responsible AI at Microsoft. And inherent in the name, there are two real core functions. There’s the sensitive uses and emerging technologies. What does that mean?

Sensitive uses, think of that as Microsoft’s internal consulting and oversight function for our higher-risk, most impactful AI system deployments. And so my team is a team of multidisciplinary experts who engages in sort of a white-glove-treatment sort of way with product teams at Microsoft that are designing, building, and deploying these higher-risk AI systems, and where that sort of consulting journey culminates is in a set of bespoke requirements tailored to the use case of that given system that really implement and apply our more standardized, generalized requirements that apply across the board.

Then the emerging technologies function of my team faces a little bit further out, trying to look around corners to see what new and novel and emerging risks are coming out of new AI technologies with the idea that we work with our researchers, our engineering partners, and, of course, product leaders across the company to understand where Microsoft is going with those emerging technologies, and we’re developing sort of rapid, quick-fire early-steer guidance that implements our policies ahead of that formal internal policymaking process, which can take a bit of time. So it’s designed to, sort of, both afford that innovation speed that we like to optimize for at Microsoft but also integrate our responsible AI commitments and our AI principles into emerging product development.

SULLIVAN: That segues really nicely, actually, as we met with Professor Charo and she was, you know, talking about the field of genome editing and the governing at the application level. I’d love to just understand how similar or not is that to managing the risks of AI in our world?

KLUTTZ: Yeah. I mean, Professor Charo’s comments were music to my ears because, you know, where we make our bread and butter, so to speak, in our team is in applying to use cases. AI systems, especially in this era of generative AI, are almost inherently multi-use, dual use. And so what really matters is how you’re going to apply that more general-purpose technology. Who’s going to use it? In what domain is it going to be deployed? And then tailor that oversight to those use cases. Try to be risk proportionate.

Professor Charo talked a little bit about this, but if it’s something that’s been done before and it’s just a new spin on an old thing, maybe we’re not so concerned about how closely we need to oversee and gate that application of that technology, whereas if it’s something new and novel or some new risk that might be posed by that technology, we take a little bit closer look and we are overseeing that in a more sort of high-touch way.

SULLIVAN: Maybe following up on that, I mean, how do you define sensitive use or maybe like high-impact application, and once that’s labeled, what happens? Like, what kind of steps kick in from there?

KLUTTZ: Yeah. So we have this Sensitive Uses program that’s been at Microsoft since 2019. I came to Microsoft in 2019 when we were starting this program in the Office of Responsible AI, and it had actually been incubated in Microsoft Research with our Aether community of colleagues who are experts in sociotechnical approaches to responsible AI, as well. Once we put it in the Office of Responsible AI, I came over. I came from academia. I was a researcher myself …

SULLIVAN: At Berkeley, right?

KLUTTZ: At Berkeley. That’s right. Yep. Sociologist by training and a lawyer in a past life. [LAUGHTER] But that has helped sort of bridge those fields for me.

But Sensitive Uses, we force all of our teams when they’re envisioning their system design to think about, could the reasonably foreseeable use or misuse of the system that they’re developing in practice result in three really major, sort of, risk types. One is, could that deployment result in a consequential impact on someone’s legal position or life opportunity? Another category we have is, could that foreseeable use or misuse result in significant psychological or physical injury or harm? And then the third really ties in with a longstanding commitment we’ve had to human rights at Microsoft. And so could that system in it’s reasonably foreseeable use or misuse result in human rights impacts and injurious consequences to folks along different dimensions of human rights?

Once you decide, we have a process to reporting that project into my office, and we will triage that project, working with the product team, for example, and our Responsible AI Champs community, which are folks who are dispersed throughout the ecosystem at Microsoft and educated in our responsible AI program, and then determine, OK, is it in scope for our program? If it is, say, OK, we’re going to go along for that ride with you, and then we get into that whole sort of consulting arrangement that then culminates in this set of bespoke use-case-based requirements applying our AI principles.

SULLIVAN: That’s super fascinating. What are some of the approaches in the governance of genome editing are you maybe seeing happening in AI governance or maybe just, like, bubbling up in conversations around it?

KLUTTZ: Yeah, I mean, I think we’ve learned a lot from fields like genome editing that Professor Charo talked about and others. And again, it gets back to this, sort of, risk-proportionate-based approach. It’s a balancing test. It’s a tradeoff of trying to, sort of, foster innovation and really look for the beneficial uses of these technologies. I appreciated her speaking about that. What are the intended uses of the system, right? And then getting to, OK, how do we balance trying to, again, foster that innovation in a very fast-moving space, a pretty complex space, and a very unsettled space contrasting to other, sort of, professional fields or technological fields that have a long history and are relatively settled from an oversight and regulatory standpoint? This one is not, and for good reason. It is still developing.

And I think, you know, there are certain oversight and policy regimes that exist today that can be applied. Professor Charo talked about this, as well, where, you know, maybe you have certain policy and oversight regimes that, depending on how the application of that technology is applied, applies there versus some horizontal, overarching regulatory sort of framework. And I think that applies from an internal governance standpoint, as well.

SULLIVAN: Yeah. It’s a great point. So what isn’t being explored from genome editing that, you know, maybe we think could be useful to AI governance, or as we think about the evolving frameworks …

KLUTTZ: Yeah.

SULLIVAN: … what maybe we should be taking into account from what Professor Charo shared with us?

KLUTTZ: So one of the things I’ve thought about and took from Professor Charo’s discussion was she had just this amazing way of framing up how genome editing regulation is done. And she said, you know, we don’t regulate genome editing; we regulate the things that use genome editing. And while it’s not a one-to-one analogy with the AI space because we do have this sort of very general model level distinction versus application layer and even platform layer distinctions, I think it’s fair to say, you know, we don’t regulate AI applications writ large. We regulate the things that use AI in a very similar way. And that’s how we think of our internal policy and oversight process at Microsoft, as well.

And maybe there are things that we regulated and oversaw internally at the first instance and the first time we saw it come through, and it graduates into more of a programmatic framework for how we manage that. So one good example of that is some of our higher-risk AI systems that we offer out of Azure at the platform level. When I say that, I mean APIs that you call that developers can then build their own applications on top of. We were really deep in evaluating and assessing mitigations on those platform systems in the first instance, but we also graduated them into what we call our Limited Access AI services program.

And some of the things that Professor Charo discussed really resonated with me. You know, she had this moment where she was mentioning how, you know, you want to know who’s using your tools and how they’re being used. And it’s the same concepts. We want to have trust in our customers, we want to understand their use cases, and we want to apply technical controls that, sort of, force those use cases or give us signal post-deployment that use cases are being done in a way that may give us some level of concern, to reach out and understand what those use cases are.

SULLIVAN: Yeah, you’re hitting on a great point. And I love this kind of layered approach that we’re taking and that Alta highlighted, as well. Maybe to double-click a little bit just on that post-market control and what we’re tracking, kind of, once things are out and being used by our customers. How do we take some of that deployment data and bring it back in to maybe even better inform upfront governance or just how we think about some of the frameworks that we’re operating in?

KLUTTZ: It’s a great question. The number one thing is for us at Microsoft, we want to know the voice of our customer. We want our customers to talk to us. We don’t want to just understand telemetry and data. But it’s really getting out there and understanding from our customers and not just our customers. I would say our stakeholders is maybe a better term because that includes civil society organizations. It includes governments. It includes all of these non, sort of, customer actors that we care about and that we’re trying to sort of optimize for, as well. It includes end users of our enterprise customers. If we can gather data about how our products are being used and trying to understand maybe areas that we didn’t foresee how customers or users might be using those things, and then we can tune those systems to better align with what both customers and users want but also our own AI principles and policies and programs.

SULLIVAN: Daniel, before coming to Microsoft, you led social science research and sociotechnical applications of AI-driven tech at Berkeley. What do you think some of the biggest challenges are in defining and maybe even just, kind of, measuring at, like, a societal level some of the impacts of AI more broadly?

KLUTTZ: Measuring social phenomenon is a difficult thing. And one of the things that, as social scientists, you’re very interested in is scientifically observing and measuring social phenomena. Well, that sounds great. It sounds also very high level and jargony. What do we mean by that? You know, it’s very easy to say that you’re collecting data and you’re measuring, I don’t know, trust in AI, right? That’s a very fuzzy concept.

SULLIVAN: Right. Definitely.

KLUTTZ: It is a concept that we want to get to, but we have to unpack that, and we have to develop what we call measurable constructs. What are the things that we might observe that could give us an indication toward what is a very fuzzy and general concept. And there’s challenges with that everywhere. And I’m extremely fortunate to work at Microsoft with some of the world’s leading sociotechnical researchers and some of these folks who are thinking about—you know, very steeped in measurement theory, literally PhDs in these fields—how to both measure and allow for a scalable way to do that at a place the size of Microsoft. And that is trying to develop frameworks that are scalable and repeatable and put into our platform that then serves our product teams. Are we providing, as a platform, a service to those product teams that they can plug in and do their automated evaluations at scale as much as possible and then go back in over the top and do some of your more qualitative targeted testing and evaluations.

SULLIVAN: Yeah, makes a lot of sense. Before we close out, if you’re game for it, maybe we do a quick lightning round. Just 30-second answers here. Favorite real-world sensitive use case you’ve ever reviewed.

KLUTTZ: Oh gosh. Wow, this is where I get to be the social scientist.

SULLIVAN: [LAUGHS] Yes.

KLUTTZ: It’s like, define favorite, Kathleen. [LAUGHS] Most memorable, most painful.

SULLIVAN: Let’s do most memorable.

KLUTTZ: We’ll do most memorable.

SULLIVAN: Yeah.

KLUTTZ: You know, I would say the most memorable project I worked on was when we rolled out the new Bing Chat, which is no longer called Bing Chat, because that was the first really big cross-company effort to deploy GPT-4, which was, you know, the next step up in AI innovation from our partners at OpenAI. And I really value working hand in hand with engineering teams and with researchers and that was us at our best and really sort of turbocharged the model that we have.

SULLIVAN: Wonderful. What’s one of the most overused phrases that you have in your AI governance meetings?

KLUTTZ: Gosh. [LAUGHS] If I hear “We need to get aligned; we need to align on this more” …

SULLIVAN: [LAUGHS] Right.

KLUTTZ: But, you know, it’s said for a reason. And I think it sort of speaks to that clever nature. That’s one that comes to mind.

SULLIVAN: That’s great. And then maybe, maybe last one. What are you most excited about in the next, I don’t know, let’s say three months? This world is moving so fast!

KLUTTZ: You know, the pace of innovation, as you just said, is just staggering. It is unbelievable. And sometimes it can feel overwhelming in my space. But what I am most excited about is how we are building up this Emerging … I mentioned this Emerging Technologies program in my team as a, sort of, formal program is relatively new. And I really enjoy being able to take a step back and think a little bit more about the future and a little bit more holistically. And I love working with engineering teams and sort of strategic visionaries who are thinking about what we’re doing a year from now or five years from now, or even 10 years from now, and I get to be a part of those conversations. And that really gives me energy and helps me … helps keep me grounded and not just dealing with the day to day, and, you know, various fire drills that you may run. It’s thinking strategically and having that foresight about what’s to come. And it’s exciting.

SULLIVAN: Great. Well, Daniel, just thanks so much for being here. I had such a wonderful discussion with you, and I think the thoughtfulness in our discussion today I hope resonates with our listeners. And again, thanks to Alta for setting the stage and sharing her really amazing, insightful thoughts here, as well. So thank you.

[MUSIC]

KLUTTZ: Thank you, Kathleen. I appreciate it. It’s been fun.

SULLIVAN: And to our listeners, thanks for tuning in. You can find resources related to this podcast in the show notes. And if you want to learn more about how Microsoft approaches AI governance, you can visit microsoft.com/RAI.

See you next time! 

[MUSIC FADES]

The post AI Testing and Evaluation: Learnings from genome editing appeared first on Microsoft Research.

Read More

Context extraction from image files in Amazon Q Business using LLMs

Context extraction from image files in Amazon Q Business using LLMs

To effectively convey complex information, organizations increasingly rely on visual documentation through diagrams, charts, and technical illustrations. Although text documents are well-integrated into modern knowledge management systems, rich information contained in diagrams, charts, technical schematics, and visual documentation often remains inaccessible to search and AI assistants. This creates significant gaps in organizational knowledge bases, leading to interpreting visual data manually and preventing automation systems from using critical visual information for comprehensive insights and decision-making. While Amazon Q Business already handles embedded images within documents, the custom document enrichment (CDE) feature extends these capabilities significantly by processing standalone image files (for example, JPGs and PNGs).

In this post, we look at a step-by-step implementation for using the CDE feature within an Amazon Q Business application. We walk you through an AWS Lambda function configured within CDE to process various image file types, and we showcase an example scenario of how this integration enhances the Amazon Q Business ability to provide comprehensive insights. By following this practical guide, you can significantly expand your organization’s searchable knowledge base, enabling more complete answers and insights that incorporate both textual and visual information sources.

Example scenario: Analyzing regional educational demographics

Consider a scenario where you’re working for a national educational consultancy that has charts, graphs, and demographic data across different AWS Regions stored in an Amazon Simple Storage Service (Amazon S3) bucket. The following image shows student distribution by age range across various cities using a bar chart. The insights in visualizations like this are valuable for decision-making but traditionally locked within image formats in your S3 buckets and other storage.

With Amazon Q Business and CDE, we show you how to enable natural language queries against such visualizations. For example, your team could ask questions such as “Which city has the highest number of students in the 13–15 age range?” or “Compare the student demographics between City 1 and City 4” directly through the Amazon Q Business application interface.

Distribution Chart

You can bridge this gap using the Amazon Q Business CDE feature to:

  1. Detect and process image files during the document ingestion process
  2. Use Amazon Bedrock with AWS Lambda to interpret the visual information
  3. Extract structured data and insights from charts and graphs
  4. Make this information searchable using natural language queries

Solution overview

In this solution, we walk you through how to implement a CDE-based solution for your educational demographic data visualizations. The solution empowers organizations to extract meaningful information from image files using the CDE capability of Amazon Q Business. When Amazon Q Business encounters the S3 path during ingestion, CDE rules automatically trigger a Lambda function. The Lambda function identifies the image files and calls the Amazon Bedrock API, which uses multimodal large language models (LLMs) to analyze and extract contextual information from each image. The extracted text is then seamlessly integrated into the knowledge base in Amazon Q Business. End users can then quickly search for valuable data and insights from images based on their actual context. By bridging the gap between visual content and searchable text, this solution helps organizations unlock valuable insights previously hidden within their image repositories.

The following figure shows the high-level architecture diagram used for this solution.

Arch Diagram

For this use case, we use Amazon S3 as our data source. However, this same solution is adaptable to other data source types supported by Amazon Q Business, or it can be implemented with custom data sources as needed.To complete the solution, follow these high-level implementation steps:

  1. Create an Amazon Q Business application and sync with an S3 bucket.
  2. Configure the Amazon Q Business application CDE for the Amazon S3 data source.
  3. Extract context from the images.

Prerequisites

The following prerequisites are needed for implementation:

  1. An AWS account.
  2. At least one Amazon Q Business Pro user that has admin permissions to set up and configure Amazon Q Business. For pricing information, refer to Amazon Q Business pricing.
  3. AWS Identity and Access Management (IAM) permissions to create and manage IAM roles and policies.
  4. A supported data source to connect, such as an S3 bucket containing your public documents.
  5. Access to an Amazon Bedrock LLM in the required AWS Region.

Create an Amazon Q Business application and sync with an S3 bucket

To create an Amazon Q Business application and connect it to your S3 bucket, complete the following steps. These steps provide a general overview of how to create an Amazon Q Business application and synchronize it with an S3 bucket. For more comprehensive, step-by-step guidance, follow the detailed instructions in the blog post Discover insights from Amazon S3 with Amazon Q S3 connector.

  1. Initiate your application setup through either the AWS Management Console or AWS Command Line Interface (AWS CLI).
  2. Create an index for your Amazon Q Business application.
  3. Use the built-in Amazon S3 connector to link your application with documents stored in your organization’s S3 buckets.

Configure the Amazon Q Business application CDE for the Amazon S3 data source

With the CDE feature of Amazon Q Business, you can make the most of your Amazon S3 data sources by using the sophisticated capabilities to modify, enhance, and filter documents during the ingestion process, ultimately making enterprise content more discoverable and valuable. When connecting Amazon Q Business to S3 repositories, you can use CDE to seamlessly transform your raw data, applying modifications that significantly improve search quality and information accessibility. This powerful functionality extends to extracting context from binary files such as images through integration with Amazon Bedrock services, enabling organizations to unlock insights from previously inaccessible content formats. By implementing CDE for Amazon S3 data sources, businesses can maximize the utility of their enterprise data within Amazon Q, creating a more comprehensive and intelligent knowledge base that responds effectively to user queries.To configure the Amazon Q Business application CDE for the Amazon S3 data source, complete the following steps:

  1. Select your application and navigate to Data sources.
  2. Choose your existing Amazon S3 data source or create a new one. Verify that Audio/Video under Multi-media content configuration is not enabled.
  3. In the data source configuration, locate the Custom Document Enrichment section.
  4. Configure the pre-extraction rules to trigger a Lambda function when specific S3 bucket conditions are satisfied. Check the following screenshot for an example configuration.

Reference Settings
Pre-extraction rules are executed before Amazon Q Business processes files from your S3 bucket.

Extract context from the images

To extract insights from an image file, the Lambda function makes an Amazon Bedrock API call using Anthropic’s Claude 3.7 Sonnet model. You can modify the code to use other Amazon Bedrock models based on your use case.

Constructing the prompt is a critical piece of the code. We recommend trying various prompts to get the desired output for your use case. Amazon Bedrock offers the capability to optimize a prompt that you can use to enhance your use case specific input.

Examine the following Lambda function code snippets, written in Python, to understand the Amazon Bedrock model setup along with a sample prompt to extract insights from an image.

In the following code snippet, we start by importing relevant Python libraries, define constants, and initialize AWS SDK for Python (Boto3) clients for Amazon S3 and Amazon Bedrock runtime. For more information, refer to the Boto3 documentation.

import boto3
import logging
import json
from typing import List, Dict, Any
from botocore.config import Config

MODEL_ID = "us.anthropic.claude-3-7-sonnet-20250219-v1:0"
MAX_TOKENS = 2000
MAX_RETRIES = 2
FILE_FORMATS = ("jpg", "jpeg", "png")

logger = logging.getLogger()
logger.setLevel(logging.INFO)
s3 = boto3.client('s3')
bedrock = boto3.client('bedrock-runtime', config=Config(read_timeout=3600, region_name='us-east-1'))

The prompt passed to the Amazon Bedrock model, Anthropic’s Claude 3.7 Sonnet in this case, is broken into two parts: prompt_prefix and prompt_suffix. The prompt breakdown makes it more readable and manageable. Additionally, the Amazon Bedrock prompt caching feature can be used to reduce response latency as well as input token cost. You can modify the prompt to extract information based on your specific use case as needed.

prompt_prefix = """You are an expert image reader tasked with generating detailed descriptions for various """
"""types of images. These images may include technical diagrams,"""
""" graphs and charts, categorization diagrams, data flow and process flow diagrams,"""
""" hierarchical and timeline diagrams, infographics, """
"""screenshots and product diagrams/images from user manuals. """
""" The description of these images needs to be very detailed so that user can ask """
""" questions based on the image, which can be answered by only looking at the descriptions """
""" that you generate.
Here is the image you need to analyze:

<image>
"""

prompt_suffix = """
</image>

Please follow these steps to analyze the image and generate a comprehensive description:

1. Image type: Classify the image as one of technical diagrams, graphs and charts, categorization diagrams, data flow and process flow diagrams, hierarchical and timeline diagrams, infographics, screenshots and product diagrams/images from user manuals. The description of these images needs to be very detailed so that user can ask questions based on the image, which can be answered by only looking at the descriptions that you generate or other.

2. Items:
   Carefully examine the image and extract all entities, texts, and numbers present. List these elements in <image_items> tags.

3. Detailed Description:
   Using the information from the previous steps, provide a detailed description of the image. This should include the type of diagram or chart, its main purpose, and how the various elements interact or relate to each other.  Capture all the crucial details that can be used to answer any followup questions. Write this description in <image_description> tags.

4. Data Estimation (for charts and graphs only):
   If the image is a chart or graph, capture the data in the image in CSV format to be able to recreate the image from the data. Ensure your response captures all relevant details from the chart that might be necessary to answer any follow up questions from the chart.
   If exact values cannot be inferred, provide an estimated range for each value in <estimation> tags.
   If no data is present, respond with "No data found".

Present your analysis in the following format:

<analysis>
<image_type>
[Classify the image type here]
</image_type>

<image_items>
[List all extracted entities, texts, and numbers here]
</image_items>

<image_description>
[Provide a detailed description of the image here]
</image_description>

<data>
[If applicable, provide estimated number ranges for chart elements here]
</data>
</analysis>

Remember to be thorough and precise in your analysis. If you're unsure about any aspect of the image, state your uncertainty clearly in the relevant section.
"""

The lambda_handler is the main entry point for the Lambda function. While invoking this Lambda function, the CDE passes the data source’s information within event object input. In this case, the S3 bucket and the S3 object key are retrieved from the event object along with the file format. Further processing of the input happens only if the file_format matches the expected file types. For production ready code, implement proper error handling for unexpected errors.

def lambda_handler(event, context):
    logger.info("Received event: %s" % json.dumps(event))
    s3Bucket = event.get("s3Bucket")
    s3ObjectKey = event.get("s3ObjectKey")
    metadata = event.get("metadata")
    file_format = s3ObjectKey.lower().split('.')[-1]
    new_key = 'cde_output/' + s3ObjectKey + '.txt'
    if (file_format in FILE_FORMATS):
        afterCDE = generate_image_description(s3Bucket, s3ObjectKey, file_format)
        s3.put_object(Bucket = s3Bucket, Key = new_key, Body=afterCDE)
    return {
        "version" : "v0",
        "s3ObjectKey": new_key,
        "metadataUpdates": []
    }

The generate_image_description function calls two other functions: first to construct the message that is passed to the Amazon Bedrock model and second to invoke the model. It returns the final text output extracted from the image file by the model invocation.

def generate_image_description(s3Bucket: str, s3ObjectKey: str, file_format: str) -> str:
    """
    Generate a description for an image.
    Inputs:
        image_file: str - Path to the image file
    Output:
        str - Generated image description
    """
    messages = _llm_input(s3Bucket, s3ObjectKey, file_format)
    response = _invoke_model(messages)
    return response['output']['message']['content'][0]['text']

The _llm_input function takes in the S3 object’s details passed as input along with the file type (png, jpg) and builds the message in the format expected by the model invoked by Amazon Bedrock.

def _llm_input(s3Bucket: str, s3ObjectKey: str, file_format: str) -> List[Dict[str, Any]]:
    s3_response = s3.get_object(Bucket = s3Bucket, Key = s3ObjectKey)
    image_content = s3_response['Body'].read()
    message = {
        "role": "user",
        "content": [
            {"text": prompt_prefix},
            {
                "image": {
                    "format": file_format,
                    "source": {
                        "bytes": image_content
                    }
                }
            },
            {"text": prompt_suffix}
        ]
    }
    return [message]

The _invoke_model function calls the converse API using the Amazon Bedrock runtime client. This API returns the response generated by the model. The values within inferenceConfig settings for maxTokens and temperature are used to limit the length of the response and make the responses more deterministic (less random) respectively.

def _invoke_model(messages: List[Dict[str, Any]]) -> Dict[str, Any]:
    """
    Call the Bedrock model with retry logic.
    Input:
        messages: List[Dict[str, Any]] - Prepared messages for the model
    Output:
        Dict[str, Any] - Model response
    """
    for attempt in range(MAX_RETRIES):
        try:
            response = bedrock.converse(
                modelId=MODEL_ID,
                messages=messages,
                inferenceConfig={
                    "maxTokens": MAX_TOKENS,
                    "temperature": 0,
                }
            )
            return response
        except Exception as e:
            print(e)
    
    raise Exception(f"Failed to call model after {MAX_RETRIES} attempts")

Putting all the preceding code pieces together, the full Lambda function code is shown in the following block:

# Example Lambda function for image processing
import boto3
import logging
import json
from typing import List, Dict, Any
from botocore.config import Config

MODEL_ID = "us.anthropic.claude-3-7-sonnet-20250219-v1:0"
MAX_TOKENS = 2000
MAX_RETRIES = 2
FILE_FORMATS = ("jpg", "jpeg", "png")

logger = logging.getLogger()
logger.setLevel(logging.INFO)
s3 = boto3.client('s3')
bedrock = boto3.client('bedrock-runtime', config=Config(read_timeout=3600, region_name='us-east-1'))

prompt_prefix = """You are an expert image reader tasked with generating detailed descriptions for various """
"""types of images. These images may include technical diagrams,"""
""" graphs and charts, categorization diagrams, data flow and process flow diagrams,"""
""" hierarchical and timeline diagrams, infographics, """
"""screenshots and product diagrams/images from user manuals. """
""" The description of these images needs to be very detailed so that user can ask """
""" questions based on the image, which can be answered by only looking at the descriptions """
""" that you generate.
Here is the image you need to analyze:

<image>
"""

prompt_suffix = """
</image>

Please follow these steps to analyze the image and generate a comprehensive description:

1. Image type: Classify the image as one of technical diagrams, graphs and charts, categorization diagrams, data flow and process flow diagrams, hierarchical and timeline diagrams, infographics, screenshots and product diagrams/images from user manuals. The description of these images needs to be very detailed so that user can ask questions based on the image, which can be answered by only looking at the descriptions that you generate or other.

2. Items:
   Carefully examine the image and extract all entities, texts, and numbers present. List these elements in <image_items> tags.

3. Detailed Description:
   Using the information from the previous steps, provide a detailed description of the image. This should include the type of diagram or chart, its main purpose, and how the various elements interact or relate to each other.  Capture all the crucial details that can be used to answer any followup questions. Write this description in <image_description> tags.

4. Data Estimation (for charts and graphs only):
   If the image is a chart or graph, capture the data in the image in CSV format to be able to recreate the image from the data. Ensure your response captures all relevant details from the chart that might be necessary to answer any follow up questions from the chart.
   If exact values cannot be inferred, provide an estimated range for each value in <estimation> tags.
   If no data is present, respond with "No data found".

Present your analysis in the following format:

<analysis>
<image_type>
[Classify the image type here]
</image_type>

<image_items>
[List all extracted entities, texts, and numbers here]
</image_items>

<image_description>
[Provide a detailed description of the image here]
</image_description>

<data>
[If applicable, provide estimated number ranges for chart elements here]
</data>
</analysis>

Remember to be thorough and precise in your analysis. If you're unsure about any aspect of the image, state your uncertainty clearly in the relevant section.
"""

def _llm_input(s3Bucket: str, s3ObjectKey: str, file_format: str) -> List[Dict[str, Any]]:
    s3_response = s3.get_object(Bucket = s3Bucket, Key = s3ObjectKey)
    image_content = s3_response['Body'].read()
    message = {
        "role": "user",
        "content": [
            {"text": prompt_prefix},
            {
                "image": {
                    "format": file_format,
                    "source": {
                        "bytes": image_content
                    }
                }
            },
            {"text": prompt_suffix}
        ]
    }
    return [message]

def _invoke_model(messages: List[Dict[str, Any]]) -> Dict[str, Any]:
    """
    Call the Bedrock model with retry logic.
    Input:
        messages: List[Dict[str, Any]] - Prepared messages for the model
    Output:
        Dict[str, Any] - Model response
    """
    for attempt in range(MAX_RETRIES):
        try:
            response = bedrock.converse(
                modelId=MODEL_ID,
                messages=messages,
                inferenceConfig={
                    "maxTokens": MAX_TOKENS,
                    "temperature": 0,
                }
            )
            return response
        except Exception as e:
            print(e)
    
    raise Exception(f"Failed to call model after {MAX_RETRIES} attempts")

def generate_image_description(s3Bucket: str, s3ObjectKey: str, file_format: str) -> str:
    """
    Generate a description for an image.
    Inputs:
        image_file: str - Path to the image file
    Output:
        str - Generated image description
    """
    messages = _llm_input(s3Bucket, s3ObjectKey, file_format)
    response = _invoke_model(messages)
    return response['output']['message']['content'][0]['text']

def lambda_handler(event, context):
    logger.info("Received event: %s" % json.dumps(event))
    s3Bucket = event.get("s3Bucket")
    s3ObjectKey = event.get("s3ObjectKey")
    metadata = event.get("metadata")
    file_format = s3ObjectKey.lower().split('.')[-1]
    new_key = 'cde_output/' + s3ObjectKey + '.txt'
    if (file_format in FILE_FORMATS):
        afterCDE = generate_image_description(s3Bucket, s3ObjectKey, file_format)
        s3.put_object(Bucket = s3Bucket, Key = new_key, Body=afterCDE)
    return {
        "version" : "v0",
        "s3ObjectKey": new_key,
        "metadataUpdates": []
    }

We strongly recommend testing and validating code in a nonproduction environment before deploying it to production. In addition to Amazon Q pricing, this solution will incur charges for AWS Lambda and Amazon Bedrock. For more information, refer to AWS Lambda pricing and Amazon Bedrock pricing.

After the Amazon S3 data is synced with the Amazon Q index, you can prompt the Amazon Q Business application to get the extracted insights as shown in the following section.

Example prompts and results

The following question and answer pairs refer the Student Age Distribution graph at the beginning of this post.

Q: Which City has the highest number of students in the 13-15 age range?

Natural Language Query Response

Q: Compare the student demographics between City 1 and City 4?

Natural Language Query Response

In the original graph, the bars representing student counts lacked explicit numerical labels, which could make data interpretation challenging on a scale. However, with Amazon Q Business and its integration capabilities, this limitation can be overcome. By using Amazon Q Business to process these visualizations with Amazon Bedrock LLMs using the CDE feature, we’ve enabled a more interactive and insightful analysis experience. The service effectively extracts the contextual information embedded in the graph, even when explicit labels are absent. This powerful combination means that end users can ask questions about the visualization and receive responses based on the underlying data. Rather than being limited by what’s explicitly labeled in the graph, users can now explore deeper insights through natural language queries. This capability demonstrates how Amazon Q Business transforms static visualizations into queryable knowledge assets, enhancing the value of your existing data visualizations without requiring additional formatting or preparation work.

Best practices for Amazon S3 CDE configuration

When setting up CDE for your Amazon S3 data source, consider these best practices:

  • Use conditional rules to only process specific file types that need transformation.
  • Monitor Lambda execution with Amazon CloudWatch to track processing errors and performance.
  • Set appropriate timeout values for your Lambda functions, especially when processing large files.
  • Consider incremental syncing to process only new or modified documents in your S3 bucket.
  • Use document attributes to track which documents have been processed by CDE.

Cleanup

Complete the following steps to clean up your resources:

  1. Go to the Amazon Q Business application and select Remove and unsubscribe for users and groups.
  2. Delete the Amazon Q Business application.
  3. Delete the Lambda function.
  4. Empty and delete the S3 bucket. For instructions, refer to Deleting a general purpose bucket.

Conclusion

This solution demonstrates how combining Amazon Q Business, custom document enrichment, and Amazon Bedrock can transform static visualizations into queryable knowledge assets, significantly enhancing the value of existing data visualizations without additional formatting work. By using these powerful AWS services together, organizations can bridge the gap between visual information and actionable insights, enabling users to interact with different file types in more intuitive ways.

Explore What is Amazon Q Business? and Getting started with Amazon Bedrock in the documentation to implement this solution for your specific use cases and unlock the potential of your visual data.

About the Authors


About the authors

Amit Chaudhary Amit Chaudhary is a Senior Solutions Architect at Amazon Web Services. His focus area is AI/ML, and he helps customers with generative AI, large language models, and prompt engineering. Outside of work, Amit enjoys spending time with his family.

Nikhil Jha Nikhil Jha is a Senior Technical Account Manager at Amazon Web Services. His focus areas include AI/ML, building Generative AI resources, and analytics. In his spare time, he enjoys exploring the outdoors with his family.

Read More

Build AWS architecture diagrams using Amazon Q CLI and MCP

Build AWS architecture diagrams using Amazon Q CLI and MCP

Creating professional AWS architecture diagrams is a fundamental task for solutions architects, developers, and technical teams. These diagrams serve as essential communication tools for stakeholders, documentation of compliance requirements, and blueprints for implementation teams. However, traditional diagramming approaches present several challenges:

  • Time-consuming process – Creating detailed architecture diagrams manually can take hours or even days
  • Steep learning curve – Learning specialized diagramming tools requires significant investment
  • Inconsistent styling – Maintaining visual consistency across multiple diagrams is difficult
  • Outdated AWS icons – Keeping up with the latest AWS service icons and best practices challenging.
  • Difficult maintenance – Updating diagrams as architectures evolve can become increasingly burdensome

Amazon Q Developer CLI with the Model Context Protocol (MCP) offers a streamlined approach to creating AWS architecture diagrams. By using generative AI through natural language prompts, architects can now generate professional diagrams in minutes rather than hours, while adhering to AWS best practices.

In this post, we explore how to use Amazon Q Developer CLI with the AWS Diagram MCP and the AWS Documentation MCP servers to create sophisticated architecture diagrams that follow AWS best practices. We discuss techniques for basic diagrams and real-world diagrams, with detailed examples and step-by-step instructions.

Solution overview

Amazon Q Developer CLI is a command line interface that brings the generative AI capabilities of Amazon Q directly to your terminal. Developers can interact with Amazon Q through natural language prompts, making it an invaluable tool for various development tasks.

Developed by Anthropic as an open protocol, the Model Context Protocol (MCP) provides a standardized way to connect AI models to virtually any data source or tool. Using a client-server architecture (as illustrated in the following diagram), the MCP helps developers expose their data through lightweight MCP servers while building AI applications as MCP clients that connect to these servers.

The MCP uses a client-server architecture containing the following components:

  • Host – A program or AI tool that requires access to data through the MCP protocol, such as Anthropic’s Claude Desktop, an integrated development environment (IDE), AWS MCP CLI, or other AI applications
  • Client – Protocol clients that maintain one-to-one connections with server
  • Server – Lightweight programs that expose capabilities through standardized MCP or act as tools
  • Data sources – Local data sources such as databases and file systems, or external systems available over the internet through APIs (web APIs) that MCP servers can connect with

mcp-architectinfo

As announced in April 2025, MCP enables Amazon Q Developer to connect with specialized servers that extend its capabilities beyond what’s possible with the base model alone. MCP servers act as plugins for Amazon Q, providing domain-specific knowledge and functionality. The AWS Diagram MCP server specifically enables Amazon Q to generate architecture diagrams using the Python diagrams package, with access to the complete AWS icon set and architectural best practices.

Prerequisites

To implement this solution, you must have an AWS account with appropriate permissions and follow the steps below.

Set up your environment

Before you can start creating diagrams, you need to set up your environment with Amazon Q CLI, the AWS Diagram MCP server, and AWS Documentation MCP server. This section provides detailed instructions for installation and configuration.

Install Amazon Q Developer CLI

Amazon Q Developer CLI is available as a standalone installation. Complete the following steps to install it:

  1. Download and install Amazon Q Developer CLI. For instructions, see Using Amazon Q Developer on the command line.
  2. Verify the installation by running the following command: q --version
    You should see output similar to the following: Amazon Q Developer CLI version 1.x.x
  3. Configure Amazon Q CLI with your AWS credentials: q login
  4. Choose the login method suitable for you:

Set up MCP servers

Complete the following steps to set up your MCP servers:

  1. Install uv using the following command: pip install uv
  2. Install Python 3.10 or newer: uv python install 3.10
  3. Install GraphViz for your operating system.
  4. Add the servers to your ~/.aws/amazonq/mcp.json file:
{
  "mcpServers": {
    "awslabs.aws-diagram-mcp-server": {
      "command": "uvx",
      "args": ["awslabs.aws-diagram-mcp-server"],
      "env": {
        "FASTMCP_LOG_LEVEL": "ERROR"
      },
      "autoApprove": [],
      "disabled": false
    },
    "awslabs.aws-documentation-mcp-server": {
      "command": "uvx",
      "args": ["awslabs.aws-documentation-mcp-server@latest"],
      "env": {
        "FASTMCP_LOG_LEVEL": "ERROR"
      },
      "autoApprove": [],
      "disabled": false
    }
  }
}

Now, Amazon Q CLI automatically discovers MCP servers in the ~/.aws/amazonq/mcp.json file.

Understanding MCP server tools

The AWS Diagram MCP server provides several powerful tools:

  • list_icons – Lists available icons from the diagrams package, organized by provider and service category
  • get_diagram_examples – Provides example code for different types of diagrams (AWS, sequence, flow, class, and others)
  • generate_diagram – Creates a diagram from Python code using the diagrams package

The AWS Documentation MCP server provides the following useful tools:

  • search_documentation – Searches AWS documentation using the official AWS Documentation Search API
  • read_documentation – Fetches and converts AWS documentation pages to markdown format
  • recommend – Gets content recommendations for AWS documentation pages

These tools work together to help you create accurate architecture diagrams that follow AWS best practices.

Test your setup

Let’s verify that everything is working correctly by generating a simple diagram:

  1. Start the Amazon Q CLI chat interface and verify the output shows the MCP servers being loaded and initialized: q chat
    q chat
  2. In the chat interface, enter the following prompt:
    Please create a diagram showing an EC2 instance in a VPC connecting to an external S3 bucket. Include essential networking components (VPC, subnets, Internet Gateway, Route Table), security elements (Security Groups, NACLs), and clearly mark the connection between EC2 and S3. Label everything appropriately concisely and indicate that all resources are in the us-east-1 region. Check for AWS documentation to ensure it adheres to AWS best practices before you create the diagram.
  3. Amazon Q CLI will ask you to trust the tool that is being used; enter t to trust it.Amazon Q CLI will generate and display a simple diagram showing the requested architecture. Your diagram should look similar to the following screenshot, though there might be variations in layout, styling, or specific details because it’s created using generative AI. The core architectural components and relationships will be represented, but the exact visual presentation might differ slightly with each generation.

    If you see the diagram, your environment is set up correctly. If you encounter issues, verify that Amazon Q CLI can access the MCP servers by making sure you installed the necessary tools and the servers are in the ~/.aws/amazonq/mcp.json file.

Configuration options

The AWS Diagram MCP server supports several configuration options to customize your diagramming experience:

  • Output directory – By default, diagrams are saved in a generated-diagrams directory in your current working directory. You can specify a different location in your prompts.
  • Diagram format – The default output format is PNG, but you can request other formats like SVG in your prompts.
  • Styling options – You can specify colors, shapes, and other styling elements in your prompts.

Now that our environment is set up, let’s create more diagrams.

Create AWS architecture diagrams

In this section, we walk through the process of multiple AWS architecture diagrams using Amazon Q CLI with the AWS Diagram MCP server and AWS Documentation MCP server to make sure our requirements follow best practices.

When you provide a prompt to Amazon Q CLI, the AWS Diagram and Documentation MCP servers complete the following steps:

  1. Interpret your requirements.
  2. Check for best practices on the AWS documentation.
  3. Generate Python code using the diagrams package.
  4. Execute the code to create the diagram.
  5. Return the diagram as an image.

This process happens seamlessly, so you can focus on describing what you want rather than how to create it.

AWS architecture diagrams typically include the following components:

  • Nodes – AWS services and resources
  • Edges – Connections between nodes showing relationships or data flow
  • Clusters – Logical groupings of nodes, such as virtual private clouds (VPCs), subnets, and Availability Zones
  • Labels – Text descriptions for nodes and connections

Example 1: Create a web application architecture

Let’s create a diagram for a simple web application hosted on AWS. Enter the following prompt:

Create a diagram for a simple web application with an Application Load Balancer, two EC2 instances, and an RDS database. Check for AWS documentation to ensure it adheres to AWS best practices before you create the diagram

After you enter your prompt, Amazon Q CLI will search AWS documentation for best practices using the search_documentation tool from awslabsaws_documentation_mcp_server.
search documentaion-img


Following the search of the relevant AWS documentation, it will read the documentation using the read_documentation tool from the MCP server awslabsaws_documentation_mcp_server.
read_document-img

Amazon Q CLI will then list the needed AWS service icons using the list_icons tool, and will use generate_diagram with awslabsaws_diagram_mcp_server.
list and generate on cli

You should receive an output with a description of the diagram created based on the prompt along with the location of where the diagram was saved.
final-output-1stexample

Amazon Q CLI will generate and display the diagram.

The generated diagram shows the following key components:

Example 2: Create a multi-tier architecture

Multi-tier architectures separate applications into functional layers (presentation, application, and data) to improve scalability and security. We use the following prompt to create our diagram:

Create a diagram for a three-tier web application with a presentation tier (ALB and CloudFront), application tier (ECS with Fargate), and data tier (Aurora PostgreSQL). Include VPC with public and private subnets across multiple AZs. Check for AWS documentation to ensure it adheres to AWS best practices before you create the diagram.

The diagram shows the following key components:

  • A presentation tier in public subnets
  • An application tier in private subnets
  • A data tier in isolated private subnets
  • Proper security group configurations
  • Traffic flow between tiers

Example 3: Create a serverless architecture

We use the following prompt to create a diagram for a serverless architecture:

Create a diagram for a serverless web application using API Gateway, Lambda, DynamoDB, and S3 for static website hosting. Include Cognito for user authentication and CloudFront for content delivery. Check for AWS documentation to ensure it adheres to AWS best practices before you create the diagram.

The diagram includes the following key components:

Example 4: Create a data processing diagram

We use the following prompt to create a diagram for a data processing pipeline:

Create a diagram for a data processing pipeline with components organized in clusters for data ingestion, processing, storage, and analytics. Include Kinesis, Lambda, S3, Glue, and QuickSight. Check for AWS documentation to ensure it adheres to AWS best practices before you create the diagram.

The diagram organizes components into distinct clusters:

Real-world examples

Let’s explore some real-world architecture patterns and how to create diagrams for them using Amazon Q CLI with the AWS Diagram MCP server.

Ecommerce platform

Ecommerce platforms require scalable, resilient architectures to handle variable traffic and maintain high availability. We use the following prompt to create an example diagram:

Create a diagram for an e-commerce platform with microservices architecture. Include components for product catalog, shopping cart, checkout, payment processing, order management, and user authentication. Ensure the architecture follows AWS best practices for scalability and security. Check for AWS documentation to ensure it adheres to AWS best practices before you create the diagram.

The diagram includes the following key components:

Intelligent document processing solution

We use the following prompt to create a diagram for an intelligent document processing (IDP) architecture:

Create a diagram for an intelligent document processing (IDP) application on AWS. Include components for document ingestion, OCR and text extraction, intelligent data extraction (using NLP and/or computer vision), human review and validation, and data output/integration. Ensure the architecture follows AWS best practices for scalability and security, leveraging services like S3, Lambda, Textract, Comprehend, SageMaker (for custom models, if applicable), and potentially Augmented AI (A2I). Check for AWS documentation related to intelligent document processing best practices to ensure it adheres to AWS best practices before you create the diagram.

The diagram includes the following key components:

  • Amazon API Gateway as the entry point for client applications, providing a secure and scalable interface
  • Microservices implemented as containers in ECS with Fargate, enabling flexible and scalable processing
  • Amazon RDS databases for product catalog, shopping cart, and order data, providing reliable structured data storage
  • Amazon ElastiCache for product data caching and session management, improving performance and user experience
  • Amazon Cognito for authentication, ensuring secure access control
  • Amazon Simple Queue Service and Amazon Simple Notification Service for asynchronous communication between services, enabling decoupled and resilient architecture
  • Amazon CloudFront for content delivery and static assets from S3, optimizing global performance
  • Amazon Route53 for DNS management, providing reliable routing
  • AWS WAF for web application security, protecting against common web exploits
  • AWS Lambda functions for serverless microservice implementation, offering cost-effective scaling
  • AWS Secrets Manager for secure credential storage, enhancing security posture
  • Amazon CloudWatch for monitoring and observability, providing insights into system performance and health.

Clean up

If you no longer need to use the AWS Cost Analysis MCP server with Amazon Q CLI, you can remove it from your configuration:

  1. Open your ~/.aws/amazonq/mcp.json file.
  2. Remove or comment out the MCP server entries.
  3. Save the file.

This will prevent the server from being loaded when you start Amazon Q CLI in the future.

Conclusion

In this post, we explored how to use Amazon Q CLI with the AWS Documentation MCP and AWS Diagram MCP servers to create professional AWS architecture diagrams that adhere to AWS best practices referenced from official AWS documentation. This approach offers significant advantages over traditional diagramming methods:

  • Time savings – Generate complex diagrams in minutes instead of hours
  • Consistency – Make sure diagrams follow the same style and conventions
  • Best practices – Automatically incorporate AWS architectural guidelines
  • Iterative refinement – Quickly modify diagrams through simple prompts
  • Validation – Check architectures against official AWS documentation and recommendations

As you continue your journey with AWS architecture diagrams, we encourage you to deepen your knowledge by learning more about the Model Context Protocol (MCP) to understand how it enhances the capabilities of Amazon Q. When seeking inspiration for your own designs, the AWS Architecture Center offers a wealth of reference architectures that follow best practices. For creating visually consistent diagrams, be sure to visit the AWS Icons page, where you can find the complete official icon set. And to stay at the cutting edge of these tools, keep an eye on updates to the official AWS MCP Servers—they’re constantly evolving with new features to make your diagramming experience even better.


About the Authors

Joel Asante, an Austin-based Solutions Architect at Amazon Web Services (AWS), works with GovTech (Government Technology) customers. With a strong background in data science and application development, he brings deep technical expertise to creating secure and scalable cloud architectures for his customers. Joel is passionate about data analytics, machine learning, and robotics, leveraging his development experience to design innovative solutions that meet complex government requirements. He holds 13 AWS certifications and enjoys family time, fitness, and cheering for the Kansas City Chiefs and Los Angeles Lakers in his spare time.

Dunieski Otano is a Solutions Architect at Amazon Web Services based out of Miami, Florida. He works with World Wide Public Sector MNO (Multi-International Organizations) customers. His passion is Security, Machine Learning and Artificial Intelligence, and Serverless. He works with his customers to help them build and deploy high available, scalable, and secure solutions. Dunieski holds 14 AWS certifications and is an AWS Golden Jacket recipient. In his free time, you will find him spending time with his family and dog, watching a great movie, coding, or flying his drone.

Varun Jasti is a Solutions Architect at Amazon Web Services, working with AWS Partners to design and scale artificial intelligence solutions for public sector use cases to meet compliance standards. With a background in Computer Science, his work covers broad range of ML use cases primarily focusing on LLM training/inferencing and computer vision. In his spare time, he loves playing tennis and swimming.

Read More

Instruction-Following Pruning for Large Language Models

With the rapid scaling of large language models (LLMs), structured pruning has become a widely used technique to learn efficient, smaller models from larger ones, delivering superior performance compared to training similarly sized models from scratch. In this paper, we move beyond the traditional static pruning approach of determining a fixed pruning mask for a model, and propose a dynamic approach to structured pruning. In our method, the pruning mask is input-dependent and adapts dynamically based on the information described in a user instruction. Our approach, termed…Apple Machine Learning Research

Cavia: Camera-controllable Multi-view Video Diffusion with View-Integrated Attention

In recent years, there have been remarkable breakthroughs in image-to-video generation. However, the 3D consistency and camera controllability of generated frames have remained unsolved. Recent studies have attempted to incorporate camera control into the generation process, but their results are often limited to simple trajectories or lack the ability to generate consistent videos from multiple distinct camera paths for the same scene. To address these limitations, we introduce Cavia, a novel framework for camera-controllable, multi-view video generation, capable of converting an input image…Apple Machine Learning Research

Evaluating Long Range Dependency Handling in Code Generation LLMs

As language models support larger and larger context sizes, evaluating their ability to make
effective use of that context becomes increasingly important. We analyze the ability of
several code generation models to handle long range dependencies using a suite of multi-step
key retrieval tasks in context windows up to 8k tokens in length. The tasks progressively
increase in difficulty and allow more nuanced evaluation of model capabilities than tests like
the popular needle-in-the-haystack test. We find that performance degrades significantly for
many models (up to 2x) when a function…Apple Machine Learning Research

ETVA: Evaluation of Text-to-Video Alignment via Fine-grained Question Generation and Answering

Precisely evaluating semantic alignment between text prompts and generated videos remains a challenge in Text-to-Video (T2V) Generation. Existing text-to-video alignment metrics like CLIPScore only generate coarse-grained scores without fine-grained alignment details, failing to align with human preference. To address this limitation, we propose ETVA, a novel Evaluation method of Text-to-Video Alignment via fine-grained question generation and answering. First, a multi-agent system parses prompts into semantic scene graphs to generate atomic questions. Then we design a knowledge-augmented…Apple Machine Learning Research

Advancing Egocentric Video Question Answering with Multimodal Large Language Models

Egocentric Video Question Answering (QA) requires models to handle long-horizon temporal reasoning, first-person perspectives, and specialized challenges like frequent camera movement. This paper systematically evaluates both proprietary and open-source Multimodal Large Language Models (MLLMs) on QaEgo4Dv2—a refined dataset of egocentric videos derived from QaEgo4D. Four popular MLLMs (GPT-4o, Gemini-1.5-Pro, Video-LLaVa-7B and Qwen2-VL-7B-Instruct) are assessed using zero-shot and fine-tuned approaches for both OpenQA and CloseQA settings. We introduce QaEgo4Dv2 to mitigate
annotation noise…Apple Machine Learning Research