Prompting for the best price-performance

Prompting for the best price-performance

In the drive to remain competitive, businesses today are turning to AI to help them minimize cost and maximize efficiency. It’s incumbent on them to find the most suitable AI model—the one that will help them achieve more while spending less. For many businesses, the migration from OpenAI’s model family to Amazon Nova represents not only a shift in model but a strategic move toward scalability, efficiency, and broader multimodal capabilities.

In this blog, we discuss how to optimize prompting in Amazon Nova for the best price-performance.

Why migrate from OpenAI to Amazon Nova?

OpenAI’s models remain powerful, but their operational costs can be prohibitive when scaled. Consider these figures from Artificial Analysis:

Model Input Token Cost (per Million Tokens) Output Token Cost (per Million Tokens) Context Window Output Speed (Tokens per Second) Latency (Seconds per first token)
GPT-4o ~$2.50 ~$10.00 Up to 128K tokens ~63 ~0.49
GPT-4o Mini ~$0.15 ~$0.60 Up to 128K tokens ~90 ~0.43
Nova Micro ~$0.035 ~$0.14 Up to 128K tokens ~195 ~0.29
Nova Lite ~$0.06 ~$0.24 Up to 300K tokens ~146 ~0.29
Nova Pro ~$0.80 ~$3.20 Up to 300K tokens ~90 ~0.34

For high-volume applications—like global customer support or large-scale document analysis—these cost differences are disruptive. Not only does Amazon Nova Pro offer over three times the cost-efficiency, its longer context window also enables it to handle more extensive and complex inputs.

Breaking down the Amazon Nova suite

Amazon Nova isn’t a single model—it’s a suite designed for various needs:

  • Amazon Nova Pro – A robust multimodal model that can process text, images, and video. It excels at tasks such as document analysis and deep data visualization. Benchmark comparisons show Amazon Nova Pro matching or even surpassing GPT-4o on complex reasoning tasks, according to section 2.1.1 of the Nova technical report and model card.
  • Amazon Nova Lite – Offers a balanced mix of multimodal processing and speed. Amazon Nova Lite is ideal for applications such as document summarization, translation, and even basic visual search, delivering quality outputs at lower latency and cost compared to GPT-4o Mini. You can find these benchmark results in section 2.1.2 of the Nova Technical Report and Model Card.
  • Amazon Nova Micro – A text-only model engineered for ultra-low latency. With output speed of up to 195 tokens per second, Amazon Nova Micro is perfect for real-time applications such as chat-based assistants and automated FAQs. Its token costs are dramatically lower than those of GPT-4o Mini—roughly 4.3 times cheaper on a per-token basis.

The lower per-token costs and higher output per second of Amazon Nova give you the flexibility to simplify prompts for real-time applications so you can balance quality, speed, and cost for your use case.

Understanding the foundations

To make the best decision about which model family fits your needs, it’s important to understand the differences in prompt engineering best practices in both OpenAI and Amazon Nova. Each model family has its own set of strengths, but there are some things that apply to both families. Across both model families, quality accuracy is achieved through clarity of instructions, structured prompts, and iterative refinement. Whether you’re using strong output directives or clearly defined use cases, the goal is to reduce ambiguity and improve response quality.

The OpenAI approach

OpenAI uses a layered messaging system for prompt engineering, where system, developer, and user prompts work in harmony to control tone, safety, and output format. Their approach emphasizes:

  • Hierarchical message roles – Setting the model’s role and behavior using system messages makes sure that the overarching safety and style guidelines (set in system prompts) are preserved
  • Instruction placement and delimiters – Directives are placed at the beginning, with clear separation between context, examples, and queries
  • Selective chain-of-thought – Detailed, step-by-step reasoning is used when it benefits complex tasks
  • Formatting and structure – Using strong directives such as DO, MUST, and DO NOT to provide consistent outputs (for example, in JSON)

The Amazon Nova approach

  • Define the prompt use case
    • Task – What exactly the model should do
    • Role – Which role the model should assume
    • Response style – The structure or tone of the output
    • Instructions – Guidelines the model must follow
  • Chain-of-thought
    • Explicit state response – Provide clear and strong instructions to limit the model’s response
    • Structural thinking – Thinking step-by-step encourages structural thinking
  • Formatting and structure
    • Use delimiters to sectioned your prompts, for example, ##Task##, ##Context##, or ##Example##
    • Specify the output format, for example, JSON, YAML, or Markdown
    • Use strong instructions and caps, such as DO, DO NOT, or MUST
    • Prefill the responses to guide the model, for example, start with “{“ or “json…”

Evolving your prompt strategy: Migrating from OpenAI to Amazon Nova

Transitioning to Amazon Nova isn’t merely a change in API endpoints—it requires retooling your prompt engineering to align with the strengths of Amazon Nova. You need to reframe your use case definition. Begin by breaking down your current GPT-4o or GPT-4o Mini prompt into its core elements of task, role, response style, and instructions and success criteria. Make sure to structure these elements clearly to provide a blueprint for the model.

To understand how to migrate an existing OpenAI prompt to work optimally for Amazon Nova Pro, consider the following example using the meeting notes summarizer. Here is the GPT-4o system prompt:

You will be provided with meeting notes, and your task is to summarize the meeting 
as follows: 
-Overall summary of discussion 
-Action items (what needs to be done and who is doing it) 
-If applicable, a list of topics that need to be discussed more fully in the next meeting.

The user prompt is the meeting notes that need to be summarized:

Meeting Date: March 5th, 2050
Meeting Time: 2:00 PM
Location: Conference Room 3B, Intergalactic Headquarters...

GPT produces this helpful response:

Overall Summary of Discussion: The team discussed the recent mission to Planet Zog, 
the space pirate issue in Sector 7, the annual Intergalactic Bake-Off, planning for 
the upcoming charity fundraiser, and a proposed team-building retreat. The mission to 
Planet Zog was successful, but communication with the Zogians was challenging. The team 
also discussed the need for a better strategy to deal with space pirates. The team placed 
second in the Intergalactic Bake-Off and is planning for the upcoming charity fundraiser. 
A team-building retreat to the Moon Resort and Spa was proposed.

Action Items:
1. Dr. Quasar to start working on a Zogian-English dictionary.
2. Captain Stardust to speak with Admiral Starbeam about increasing patrols in Sector 7.
3. Dr. Quasar to finalize the prototype of the new cloaking technology.
4. All team members to start gathering supplies and preparing games for the Intergalactic 
Charity Bazaar.
5. Captain Stardust to check the budget for the proposed team-building retreat to the Moon 
Resort and Spa.

Topics for Next Meeting:
1. Update on the Zogian-English dictionary.
2. Progress report on the cloaking technology.
3. Results of increased patrols in Sector 7.
4. Final preparations for the Intergalactic Charity Bazaar.

To meet or exceed the quality of the response from GPT-4o, here is what an Amazon Nova Pro prompt might look like. The prompt uses the same best practices discussed in this post, starting with the system prompt. We used a temperature of .2 and a topP of .9 here:

You are an experienced executive assistant skilled in meeting note analysis and 
summarization. Your primary responsibilities include distilling complex discussions
into clear, actionable summaries.
Follow these instructions:

##INSTRUCTIONS##
1. Read understand the meeting notes found in ##NOTES##
2. Put all of your outputs in a section called ##OUTPUTS## in markdown formatting
3. Summarize the meeting notes in 5 sentences or less. Put this in a section called 
"Overall Summary".
4. Numerically list any action items for specific people and what needs to be completed. 
Put this list in a section called "Action Items".
5. If applicable, list the topics that need to be discussed more fully in the next meeting. 
Put this in a section called "Topics for Next Meeting".

Here’s the user prompt, using prefilled responses:

##NOTES##
Meeting Date: March 5th, 2050
Meeting Time: 2:00 PM
Location: Conference Room 3B, Intergalactic Headquarters
Attendees:
- Captain Stardust
- Dr. Quasar
- Lady Nebula
- Sir Supernova
- Ms. Comet
Meeting called to order by Captain Stardust at 2:05 PM
1. Introductions and welcome to our newest team member, Ms. Comet
2. Discussion of our recent mission to Planet Zog
- Captain Stardust: "Overall, a success, but communication with the Zogians was difficult. 
We need to improve our language skills."
- Dr. Quasar: "Agreed. I'll start working on a Zogian-English dictionary right away."
- Lady Nebula: "The Zogian food was out of this world, literally! We should consider having 
a Zogian food night on the ship."
3. Addressing the space pirate issue in Sector 7
- Sir Supernova: "We need a better strategy for dealing with these pirates. They've already 
plundered three cargo ships this month."
- Captain Stardust: "I'll speak with Admiral Starbeam about increasing patrols in that area.
- Dr. Quasar: "I've been working on a new cloaking technology that could help our ships avoid 
detection by the pirates. I'll need a few more weeks to finalize the prototype."
4. Review of the annual Intergalactic Bake-Off
- Lady Nebula: "I'm happy to report that our team placed second in the competition! Our Martian Mud 
Pie was a big hit!"
- Ms. Comet: "Let's aim for first place next year. I have a secret recipe for Jupiter Jello that I 
think could be a winner."
5. Planning for the upcoming charity fundraiser
- Captain Stardust: "We need some creative ideas for our booth at the Intergalactic Charity Bazaar."
- Sir Supernova: "How about a 'Dunk the Alien' game? We can have people throw water balloons at a 
volunteer dressed as an alien."
- Dr. Quasar: "I can set up a 'Name That Star' trivia game with prizes for the winners."
- Lady Nebula: "Great ideas, everyone. Let's start gathering the supplies and preparing the games."
6. Upcoming team-building retreat
- Ms. Comet: "I would like to propose a team-building retreat to the Moon Resort and Spa. It's a great 
opportunity to bond and relax after our recent missions."
- Captain Stardust: "Sounds like a fantastic idea. I'll check the budget and see if we can make it happen."
7. Next meeting agenda items
- Update on the Zogian-English dictionary (Dr. Quasar)
- Progress report on the cloaking technology (Dr. Quasar)
- Results of increased patrols in Sector 7 (Captain Stardust)
- Final preparations for the Intergalactic Charity Bazaar (All)
Meeting adjourned at 3:15 PM. Next meeting scheduled for March 19th, 2050 at 2:00 PM in Conference Room 3B, 
Intergalactic Headquarters.

##OUTPUTS##

The following example shows that the Amazon Nova response meets and exceeds the accuracy of the OpenAI example, formats the output in Markdown, and has found clear owners for each action item:

### Overall Summary
The meeting covered the successful mission to Planet Zog, the ongoing issue with 
space pirates in Sector 7, the results of the annual Intergalactic Bake-Off, plans 
for the upcoming charity fundraiser, and a proposed team-building retreat. Action items 
were assigned, and the next meeting's agenda was set.

### Action Items
1. **Dr. Quasar**: Begin working on a Zogian-English dictionary.
2. **Captain Stardust**: Speak with Admiral Starbeam about increasing patrols in Sector 7.
3. **Dr. Quasar**: Finalize the prototype for the new cloaking technology within a few weeks.
4. **Team**: Start gathering supplies and preparing games for the Intergalactic Charity Bazaar.
5. **Captain Stardust**: Check the budget for the proposed team-building retreat to the Moon Resort and Spa.

### Topics for Next Meeting
1. Update on the Zogian-English dictionary (Dr. Quasar)
2. Progress report on the cloaking technology (Dr. Quasar)
3. Results of increased patrols in Sector 7 (Captain Stardust)
4. Final preparations for the Intergalactic Charity Bazaar (All)

A few updates to the prompt can achieve comparable or better results from Amazon Nova Pro while enjoying a much less expensive cost of inference.

Employ Amazon Nova extended context

Amazon Nova Lite and Amazon Nova Pro can support up to 300,000 input tokens, which means that you can include more context in your prompt if needed. Expand your background data and detailed instructions accordingly—if your original OpenAI prompt was optimized for 128,000 tokens, adjust it to use the Amazon Nova extended window.

Tailor output constraints

If your GPT prompt required strict formatting (for example, “Respond in JSON only”), make sure that your Amazon Nova prompt includes these directives. Additionally, if your task involves multimodal inputs, specify when to include images or video references.

Function calling

The rise of generative AI agents has made function calling, or tool calling, one of the most important abilities of a given large language model (LLM). A model’s ability to correctly pick the right tool for the job, in a low-latency manner, is often the difference between success and failure of an agentic system.

Both OpenAI and Amazon Nova models share similarities in function calling, in particular their support for structured API calls. Both model families support tool selection through defined tool schemas, which we discuss later in this post. They also both provide a mechanism to decide when to invoke these tools or not.

OpenAI’s function calling uses flexible JSON schemas to define and structure API interactions. The models support a wide range of schema configurations, which give developers the ability to quickly implement external function calls through straightforward JSON definitions tied to their API endpoints.

Here is an example of a function:

tools = [{
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current temperature for a given location.",
            "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {
                            "type": "string",
                            "description": "City and country e.g. Montevideo, Uruguay"
                        }            
                },            
                "required": [                
                    "location"
        ],
        "additionalProperties": False
    },
    "strict": True
    }
}]

completion = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is the weather like in Punta del Este today?"}],
    tools=tools

Similar to OpenAI’s approach, Amazon Nova can call tools when passed a configuration schema as shown in the following code example. Amazon Nova has made heavy use of Greedy Decoding when calling tools, and it’s advised to set temperature, topP, and topK to 1. This makes sure that the model has the highest accuracy in tool selection. These Greedy Decoding parameters and other great examples of tool use are covered in great detail in Tool use (function calling) with Amazon Nova.

The following is an example of function calling without using additionalModelRequestFields:

tool_config = {
    "tools": [{
        "toolSpec": {
            "name": "get_recipe",
            "description": "Structured recipe generation system",
            "inputSchema": {
                "json": {
                    "type": "object",
                    "properties": {
                        "recipe": {
                            "type": "object",
                            "properties": {
                                "name": {"type": "string"},
                                "ingredients": {
                                    "type": "array",
                                    "items": {
                                        "type": "object",
                                        "properties": {
                                            "item": {"type": "string"},
                                            "amount": {"type": "number"},
                                            "unit": {"type": "string"}
                                        }
                                    }
                                },
                                "instructions": {
                                    "type": "array",
                                    "items": {"type": "string"}
                                }
                            },
                            "required": ["name", "ingredients", "instructions"]
                        }
                    }
                }
            }
        }
    }]
}

# Base configuration without topK=1
input_text = "I need a recipe for chocolate lava cake"
messages = [{
    "role": "user",
    "content": [{"text": input_text}]
}]

# Inference parameters
inf_params = {"topP": 1, "temperature": 1}

response = client.converse(
    modelId="us.amazon.nova-lite-v1:0",
    messages=messages,
    toolConfig=tool_config,
    inferenceConfig=inf_params
)
# Typically produces less structured or incomplete output
 

The following example shows how function calling accuracy can be improved by using

additionalModelRequestFields:

# Enhanced configuration with topK=1
response = client.converse(
    modelId="us.amazon.nova-lite-v1:0",
    messages=messages,
    toolConfig=tool_config,
    inferenceConfig=inf_params,
    additionalModelRequestFields={"inferenceConfig": {"topK": 1}}
)
# Produces more accurate and structured function call

To maximize Amazon Nova function calling potential and improve accuracy, always use additionalModelRequestFields with topk=1. This forces the model to select the single most probable token and prevents random token selection. This increases deterministic output generation and improves function call precision by about 30–40%.

The following code examples further explain how to conduct tool calling successfully. The first scenario shows recipe generation without an explicit tool. The example doesn’t use topK, which typically results in responses that are less structured:

input_text = """
I'm looking for a decadent chocolate dessert that's quick to prepare. 
Something that looks fancy but isn't complicated to make.
"""

messages = [{
    "role": "user",
    "content": [{"text": input_text}]
}]

response = client.converse(
    modelId="us.amazon.nova-lite-v1:0",
    messages=messages,
    inferenceConfig={"topP": 1, "temperature": 1}
)
# Generates a conversational recipe description
# Less structured, more narrative-driven response

In this example, the scenario shows recipe generation with a structured tool. We add topK set to 1, which produces a more structured output:

response = client.converse(
    modelId="us.amazon.nova-lite-v1:0",
    messages=messages,
    toolConfig=tool_config,
    inferenceConfig={"topP": 1, "temperature": 1},
    additionalModelRequestFields={"inferenceConfig": {"topK": 1}}
)
# Generates a highly structured, JSON-compliant recipe
# Includes precise ingredient measurements
# Provides step-by-step instructions

Overall, OpenAI offers more flexible, broader schema support. Amazon Nova provides more precise, controlled output generation and is the best choice when working with high-stakes, structured data scenarios, as demonstrated in Amazon Nova’s performance on the IFEval benchmark discussed in section 2.1.1 of the technical report and model card. We recommend using Amazon Nova for applications requiring predictable, structured responses because its function calling methodology provides superior control and accuracy.

Conclusion

The evolution from OpenAI’s models to Amazon Nova represents a significant shift in using AI. It shows a transition toward models that deliver similar or superior performance at a fraction of the cost, with expanded capabilities in multimodal processing and extended context handling.

Whether you’re using the robust, enterprise-ready Amazon Nova Pro, the agile and economical Amazon Nova Lite, or the versatile Amazon Nova Micro, the benefits are clear:

  • Cost savings – With token costs up to four times lower, businesses can scale applications more economically
  • Enhanced response performance – Faster response times (up to 190 tokens per second) make real-time applications more viable
  • Expanded capabilities – A larger context window and multimodal support unlock new applications, from detailed document analysis to integrated visual content

By evolving your prompt strategy—redefining use cases, exploiting the extended context, and iteratively refining instructions—you can smoothly migrate your existing workflows from OpenAI’s o4 and o4-mini models to the innovative world of Amazon Nova.


About the Authors

Claudio Mazzoni is a Sr Specialist Solutions Architect on the Amazon Bedrock GTM team. Claudio exceeds at guiding costumers through their Gen AI journey. Outside of work, Claudio enjoys spending time with family, working in his garden, and cooking Uruguayan food.

Pat Reilly is a Sr. Specialist Solutions Architect on the Amazon Bedrock Go-to-Market team. Pat has spent the last 15 years in analytics and machine learning as a consultant. When he’s not building on AWS, you can find him fumbling around with wood projects.

Read More

Evaluate models or RAG systems using Amazon Bedrock Evaluations – Now generally available

Evaluate models or RAG systems using Amazon Bedrock Evaluations – Now generally available

Organizations deploying generative AI applications need robust ways to evaluate their performance and reliability. When we launched LLM-as-a-judge (LLMaJ) and Retrieval Augmented Generation (RAG) evaluation capabilities in public preview at AWS re:Invent 2024, customers used them to assess their foundation models (FMs) and generative AI applications, but asked for more flexibility beyond Amazon Bedrock models and knowledge bases. Today, we’re excited to announce the general availability of these evaluation features in Amazon Bedrock Evaluations, along with significant enhancements that make them fully environment-agnostic.

The general availability release introduces “bring your own inference responses” capabilities for both RAG evaluation and model evaluation. This means you can now evaluate a RAG system or model—whether running on Amazon Bedrock, other cloud providers, or on premises—as long as you provide evaluation data in the required format. You can even evaluate the final responses of a fully built application. We’ve also added new citation metrics for the already-powerful RAG evaluation suite, including citation precision and citation coverage, to help you better assess how accurately your RAG system uses retrieved information. These metrics provide deeper insights into whether responses are properly grounded in the source documents and if they’re using the most relevant retrieved passages.

In this post, we explore these new features in detail, showing you how to evaluate both RAG systems and models with practical examples. We demonstrate how to use the comparison capabilities to benchmark different implementations and make data-driven decisions about your AI deployments. Whether you’re using Amazon Bedrock or other environments for your generative AI applications, these evaluation tools can help you optimize performance and promote quality across your entire generative AI portfolio.

What’s new: RAG evaluation

In this section, we discuss what’s new with RAG evaluation, including the introduction of Bring Your Own Inference responses for evaluating external RAG systems and new citation metrics that measure precision and coverage of RAG citations.

Bring Your Own Inference responses

The RAG evaluation capabilities have been significantly expanded to allow evaluation of outputs from RAG systems hosted anywhere through the evaluation environment in Amazon Bedrock. With Bring Your Own Inference (BYOI) responses, you can now evaluate retrieval and generation results from a variety of sources, including other FM providers, custom-build RAG systems, or deployed open-weights solutions, by providing the outputs in the required format. Additionally, the introduction of new citation metrics with our previously released quality and responsible AI metrics also provides deeper insights into how well RAG systems use their knowledge bases and source documents.

When evaluating retrieval results from a knowledge base, we allow one knowledge base to be evaluated per evaluation job. For every dataset entry, you can provide a list of conversation turns. Each conversation turn currently has a prompt and referenceResponses. When using BYOI, you must provide your retrieval results in a new field called output (this field is required for BYOI jobs but optional and not needed for non-BYOI jobs). You must provide a knowledgeBaseIdentifier for every output. Additionally, for every retrieval result you bring, you can provide a name and additional metadata in the form of key-value pairs. The following is the input JSONL format for RAG evaluation jobs (Retrieve). Fields marked with ? are optional.

In the public preview period that ended March 20, 2025, the following dataset format contained a key called referenceContexts. This field has now been changed to referenceResponses to align with the Retrieve and Generate evaluation. The content of referenceResponses should be the expected ground truth answer that an end-to-end RAG system would have generated given the prompt, not the expected passages or chunks retrieved from the knowledge base.

{
  "conversationTurns": [
    {
      "prompt": {
        "content": [
          {
            "text": string
          }
        ]
      },
      "referenceResponses"?: [
        {
          "content": [
            {
              "text": string
            }
          ]
        }
      ],
       "output"?: {
        "knowledgeBaseIdentifier": string
        "retrievedResults": {
          "retrievalResults": [
            {
              "name"?: string
              "content": {
                "text": string
              },
              "metadata"?: {
                [key: string]: string
              }
            }
          ]
        }
      }
    }
  ]
}

When evaluating retrieval and generation results from a knowledge base or RAG system, we allow one knowledge base (RAG system) to be evaluated per evaluation job. For every dataset entry, you can provide a list of conversation turns. Each conversation turn has a prompt and referenceResponses. When using BYOI, you must provide the generated text, retrieved passages (for example, retrieval results), and citations for the generated text in a new field called output (this field is required for BYOI jobs but optional and not needed for non-BYOI jobs). The citations field will be used by the citation metrics. If you don’t have citations, you can enter dummy data in those fields and make sure to not select the citation precision or citation coverage metrics for the evaluation job. You must also provide a knowledgeBaseIdentifier for every output you bring. This knowledgeBaseIdentifier should be the same for all lines in the dataset and also at the job level. The following is the input JSONL format for RAG evaluation jobs (Retrieve and Generate). Fields marked with ? are optional.

{
  "conversationTurns": [
    {
      "prompt": {
        "content": [
          {
            "text": string
          }
        ]
      },
      "referenceResponses"?: [
        {
          "content": [
            {
              "text": string
            }
          ]
        }
      ],
      "output"?: {
        "text": string
        "modelIdentifier"?: string
        "knowledgeBaseIdentifier": string
        "retrievedPassages": {
          "retrievalResults": [
            {
              "name"?: string
              "content": {
                "text": string
              },
              "metadata"?: {
                [key: string]: string
              }
            }
          ]
        },
        "citations": [
          {
            "generatedResponsePart": {
              "textResponsePart": {
                "span": {
                  "start": number
                  "end": number
                  
                },
                "text": string
              }
            },
            "retrievedReferences": [
              {
                "name"?: string
                "content": {
                  "text": string
                },
                "metadata"?: {
                  [key: string]: string
                }
              }
            ]
          }
        ]
      }
    }
  ]
}

Citation metrics

Citation precision measures how accurately a RAG system cites its sources by evaluating whether the cited passages actually contain information used in the response. This metric helps identify when responses include unnecessary or irrelevant citations—a common issue where models might cite passages that weren’t actually used to generate the response content. The metric produces a score between 0–1, where 1 indicates perfect citation precision (all cited passages were relevant and used in the response), and 0 indicates complete imprecision (none of the cited passages should have been used). This helps teams identify and fix cases where their RAG systems are citing passages indiscriminately rather than selectively citing only the passages that contributed to the response.

Citation coverage evaluates how well a response’s content is supported by its citations, focusing on whether all information derived from retrieved passages has been properly cited. This metric works by comparing the faithfulness of the response to both the cited passages and the full set of retrieved passages, then calculating their ratio. In cases where the response is completely unfaithful to the retrieved passages (for example, hallucination), the metric indicates that citation coverage assessment isn’t applicable. The metric provides a score from 0–1, where scores closer to 0 indicate that the model failed to cite relevant supporting passages, and scores closer to 1 suggest proper citation of information from the retrieved passages. Importantly, a perfect score of 1 doesn’t necessarily mean all information in the response is cited—rather, it indicates that all information that could have been cited (based on the retrieved passages) was properly cited. This helps teams identify when their RAG systems are missing opportunities to cite relevant source material.

What’s new: Model evaluation

In this section, we discuss what’s new with model evaluation, including expanded Bring Your Own Inference capabilities that allow you to evaluate model responses from external providers and deployed solutions through both LLMaaJ and human evaluation workflows.

Bring Your Own Inference responses

The model evaluation capabilities have been significantly expanded to allow evaluation of outputs from the models available through the Amazon Bedrock evaluation environment. With BYOI responses, you can now evaluate model responses from other FM providers or deployed solutions by providing the outputs in the required format. BYOI is available for both LLMaJ and human evaluation workflows.

You don’t have to limit these evaluations to foundation models. Because BYOI evaluation takes in the prompt and the output, it can evaluate the final response of a full application if you choose to bring that into your dataset.

When using LLMaaJ, only one model can be evaluated per evaluation job. Consequently, you must provide just a single entry in the modelResponses list for each evaluation, though you can run multiple evaluation jobs to compare different models. The modelResponses field is required for BYOI jobs, but not needed for non-BYOI jobs. The following is the input JSONL format for LLMaaJ with BYOI. Fields marked with ? are optional.

{
    "prompt": string
    "referenceResponse"?: string
    "category"?: string
     "modelResponses"?: [
        {
            "response": string
            "modelIdentifier": string
        }
    ]
}

When using human evaluation, up to two models can be evaluated per evaluation job. Consequently, you can provide up to two entries in the modelResponses list, with each response requiring a unique identifier. The modelResponses field is required for BYOI jobs but not needed for non-BYOI jobs. The following is the input JSONL format for human evaluation with BYOI. Fields marked with ? are optional.

{
    "prompt": string
    "referenceResponse"?: string
    "category"?: string
    "modelResponses"?: [
        {
            "response": string
            "modelIdentifier": string
        },
        {
            "response": string
            "modelIdentifier": string
        }
    ]
}

Feature overview

The LLMaaJ evaluation workflow with BYOI enables you to systematically assess model outputs from your preferred source using Amazon Bedrock model evaluation capabilities. The process follows a straightforward flow, starting with selecting an evaluator model and configuring BYOI, then choosing appropriate metrics, uploading your evaluation datasets, running the assessment, and finally analyzing the detailed results. This comprehensive evaluation pipeline (as illustrated in the following diagram) helps provide consistent quality across your AI implementations, regardless of where they are deployed.

For RAG system evaluation, the workflow incorporates additional components and metrics specific to context retrieval analysis. The process begins with choosing an evaluator model, then selecting between retrieval-only or retrieve-and-generate evaluation modes. With BYOI support, you can evaluate RAG outputs from any source while using powerful LLMaaJ metrics to assess the quality of your retrievals or full end-to-end retrieve-and-generate flow of your RAG system. This end-to-end evaluation framework (see the following diagram) provides deep insights into how effectively your RAG system uses its retrieved context and generates accurate, well-supported responses.

Prerequisites

To use the LLMaaJ model evaluation and RAG evaluation features with BYOI, you must have the following prerequisites:

Dataset description and preparation

For demonstrating an LLMaaJ evaluation job with BYOI, we created a shopping math problems dataset with a third-party model. The dataset contains 30 discount calculation problems, each with a prompt, reference answer, and model response following the required BYOI format. Each record uses the "third-party-model" identifier in the modelResponses array, enabling the LLMaaJ evaluator to assess response quality against your chosen metrics.

The dataset we used for a RAG evaluation job with BYOI was created using Amazon’s 10-K SEC filing information. Each record in the dataset contains questions about Amazon’s corporate information (such as SEC file numbers, operating segments, and fiscal year reporting), reference responses, and third-party model-generated answers with their corresponding retrieved passages and citations. The dataset follows the required BYOI format, with each record using "third-party-RAG" as the knowledge base identifier. This structure enables evaluation of both retrieve and generate quality metrics and citation accuracy, demonstrating how organizations can assess RAG systems regardless of where they’re deployed.

Start an LLMaaJ evaluation job with BYOI using the Amazon Bedrock console

In this first example, we use the prompt and inference responses dataset published in our amazon-bedrock-samples repository. You can use LLMaaJ with BYOI responses on Amazon Bedrock model evaluation to assess model performance through a user-friendly AWS Management Console interface. Follow these steps to start an evaluation job:

  1. On the Amazon Bedrock console, choose Inference and Assessment in the navigation pane and then choose Evaluations.
  2. On the Evaluations page, choose the Models
  3. In the Model evaluations section, choose Create and choose Automatic: Model as a judge.
  4. On the Create automatic evaluation page, enter a name and description and under Evaluation model, select a model. This model will be used as a judge to evaluate your inference responses.
  5. Optionally, choose Tags, and create your own tags.
  6. Under Inference source, choose Bring your own inference responses as the inference source.
  7. For Source name, enter a name that is the same as the modelIdentifier in your prompt and inference response dataset.

    {
        "prompt": "If books cost \$10.21 each and you buy 15 of them with a 20% discount, how much will you pay in total?",
        "referenceResponse": "The total price will be \$122.52. Original price: \$153.15 minus 20% discount (\$30.630000000000003)",
        "category": "Shopping Math",
        "modelResponses": [{
            "response": "To determine the total cost of buying 15 books with a 20% discount, we can follow these steps:nn1. Calculate the total cost of the books before the discount.n2. Determine the discount amount.n3. Subtract the discount amount from the total cost before the discount.nn**Step 1: Calculate the total cost of the books before the discount.**nnThe cost of one book is \$10.21. Therefore, the cost of 15 books is:n\[ 15 \times 10.21 = 153.15 \]nn**Step 2: Determine the discount amount.**nnThe discount is 20%, so we need to find 20% of \$153.15. To do this, we multiply \$153.15 by 0.20:n\[ 153.15 \times 0.20 = 30.63 \]nn**Step 3: Subtract the discount amount from the total cost before the discount.**nnThe total cost after the discount is:n\[ 153.15 - 30.63 = 122.52 \]nnTherefore, the total amount you will pay is:n\[ \boxed{122.52} \]",
            "modelIdentifier": "third-party-model"
        }]
    }

  1. Under Metrics, select the metrics you want to use to evaluate the model response (such as helpfulness, correctness, faithfulness, relevance, and harmfulness).
  2. Under Datasets, for Choose a prompt dataset and Evaluation results, enter or choose the Browse S3 option to enter the S3 URI.
  3. Under Amazon Bedrock IAM role – Permissions, select or create an IAM service role with the proper permissions. This includes service access to Amazon Bedrock, the S3 buckets in the evaluation job, and the models being used in the job. If you create a new IAM role in the evaluation setup, the service will automatically give the role the proper permissions for the job.
  4. Specify the folder in the S3 bucket where the results of the model evaluation will be stored and choose Create.

There’s an input dataset validator that validates that the format of the input prompt dataset is correct when you choose Create. It will help you correct any formatting errors.

You will be able to see the evaluation job is In Progress. Wait for the job status to change to Complete. This may take minutes or hours depending on how long your prompts and responses are and if you used only a few prompts or hundreds of prompts.

  1. When the job is complete, choose the job to see its details and metrics summary.
  2. To view generation metrics details, scroll down in the model evaluation report and choose one of the individual metrics (like helpfulness or correctness) to see its detailed breakdown.
  3. To see each record’s prompt input, generation output (from bring your own inference responses), ground truth (which is optional), and individual scores, choose a metric and choose Prompt details.
  4. Hover over an individual score to view its detailed explanation.

Start an LLMaaJ evaluation job with BYOI using the Python SDK and APIs

To use the Python SDK for creating an LLMaaJ model evaluation job with your own inference responses, use the following steps.

  1. Set up the required configurations:
    import boto3
    import json
    import random
    from datetime import datetime
    import botocore
    
    # AWS Configuration
    REGION = "us-east-1"
    ROLE_ARN = "arn:aws:iam::<YOUR_ACCOUNT_ID>:role/<YOUR_IAM_ROLE>"
    BUCKET_NAME = "<YOUR_S3_BUCKET_NAME>"
    PREFIX = "<YOUR_BUCKET_PREFIX>"
    dataset_custom_name = "<YOUR_BYOI_DATASET_NAME>" # without the ".jsonl extension
    
    # Initialize AWS clients
    bedrock_client = boto3.client('bedrock', region_name=REGION)
    s3_client = boto3.client('s3', region_name=REGION)

  2. Use the following function to create an LLMaaJ model evaluation job with your own inference responses. The precomputedInferenceSource parameter is used when you’re bringing your own pre-generated inference responses. The inferenceSourceIdentifier value must match the model identifier you used in your evaluation dataset. This identifier represents the model that generated the responses you want to evaluate.
    def create_llm_judge_evaluation(
        client,
        job_name: str,
        role_arn: str,
        input_s3_uri: str,
        output_s3_uri: str,
        evaluator_model_id: str,
        dataset_name: str = None,
        task_type: str = "General" # must be General for LLMaaJ
    ):    
        # All available LLM-as-judge metrics
        llm_judge_metrics = [
            "Builtin.Correctness",
            "Builtin.Completeness", 
            "Builtin.Faithfulness",
            "Builtin.Helpfulness",
            "Builtin.Coherence",
            "Builtin.Relevance",
            "Builtin.FollowingInstructions",
            "Builtin.ProfessionalStyleAndTone",
            "Builtin.Harmfulness",
            "Builtin.Stereotyping",
            "Builtin.Refusal"
        ]
    
        # Configure dataset
        dataset_config = {
            "name": dataset_name or "CustomDataset",
            "datasetLocation": {
                "s3Uri": input_s3_uri
            }
        }
    
        try:
            response = client.create_evaluation_job(
                jobName=job_name,
                roleArn=role_arn,
                applicationType="ModelEvaluation",
                evaluationConfig={
                    "automated": {
                        "datasetMetricConfigs": [
                            {
                                "taskType": task_type,
                                "dataset": dataset_config,
                                "metricNames": llm_judge_metrics
                            }
                        ],
                        "evaluatorModelConfig": {
                            "bedrockEvaluatorModels": [
                                {
                                    "modelIdentifier": evaluator_model_id
                                }
                            ]
                        }
                    }
                },
                inferenceConfig={
                    "models": [
                        {
                            "precomputedInferenceSource": {
                                "inferenceSourceIdentifier": "third-party-model"
                            }
                        }
                    ]
                },
                outputDataConfig={
                    "s3Uri": output_s3_uri
                }
            )
            return response
            
        except Exception as e:
            print(f"Error creating evaluation job: {str(e)}")
            raise

  3. Use the following code to create an LLMaaJ evaluation job, specifying the evaluator model and configuring all necessary parameters such as input dataset location, output path, and task type. The job configuration creates an evaluation task that will assess your model inference responses, with results being stored in the designated S3 bucket for later analysis.
    # Job Configuration
    evaluator_model = "anthropic.claude-3-haiku-20240307-v1:0"
    job_name = f"llmaaj-third-party-model-{evaluator_model.split('.')[0]}-{datetime.now().strftime('%Y-%m-%d-%H-%M-%S')}"
    
    # S3 Paths
    input_data = f"s3://{BUCKET_NAME}/{PREFIX}/{dataset_custom_name}.jsonl"
    output_path = f"s3://{BUCKET_NAME}/{PREFIX}"
    
    # Create evaluation job
    try:
        llm_as_judge_response = create_llm_judge_evaluation(
            client=bedrock_client,
            job_name=job_name,
            role_arn=ROLE_ARN,
            input_s3_uri=input_data,
            output_s3_uri=output_path,
            evaluator_model_id=evaluator_model,
            task_type="General"
        )
        print(f"✓ Created evaluation job: {llm_as_judge_response['jobArn']}")
    except Exception as e:
        print(f"✗ Failed to create evaluation job: {str(e)}")
        raise

  4. Monitor the progress of your evaluation job:
    # Get job ARN based on job type
    evaluation_job_arn = llm_as_judge_response['jobArn']
    
    # Check job status
    check_status = bedrock_client.get_evaluation_job(jobIdentifier=evaluation_job_arn) 
    print(f"Job Status: {check_status['status']}")

Start a RAG evaluation job with BYOI using the Amazon Bedrock console

When dealing with RAG systems, it’s important to evaluate the quality of retrieval and retrieval and generation together. We already provide several important metrics for those workflows, and now you can also evaluate the citation quality of your RAG systems. You can include citations when using Amazon Bedrock Knowledge Bases. In this section, we explore how you can use a RAG evaluation job to surface two important citation metrics. If your custom RAG system uses citations, you can also plug your citation information into your input dataset with BYOI and use the citation precision and citation coverage metrics:

Best practice is to use both citation metrics, citation precision and citation coverage, together to give a complete view of the citation quality.

To get started using the console, complete the following steps:

  1. On the Amazon Bedrock console, under Inference and Assessments in the navigation pane, choose Evaluations.
  2. Choose the RAG.
  3. Choose Create. This will take you to a new page where you can set up your RAG evaluation job.
  4. Under Evaluation details, enter a name and description and choose an Evaluator model. In the example, we chose Claude 3 Haiku for demonstration, but we provide a choice of several judge models for flexibility. This model will be used as a judge to evaluate your inference responses.
  5. Under Inference source, choose Bring your own inference responses as the inference source.
  6. For Source name, enter a name that is the same as the modelIdentifier in your prompt and inference response dataset. For example, the following is a snippet of one record in our evaluation dataset:
    {
        "conversationTurns": [{
            "prompt": {
                "content": [{
                    "text": "What is Amazon's SEC file number?"
                }]
            },
            "referenceResponses": [{
                "content": [{
                    "text": "Amazon's SEC file number is 000-22513."
                }]
            }],
            "output": {
                "text": "Amazon's SEC file number is 000-22513.",
                "modelIdentifier": "third-party-model",
                "knowledgeBaseIdentifier": "third-party-RAG",
                "retrievedPassages": {
                    "retrievalResults": [{
                        "content": {
                            "text": "Commission File No. 000-22513"
                        }
                    }, {
                        "content": {
                            "text": "AMAZON.COM, INC. (Exact name of registrant as specified in its charter)"
                        }
                    }]
                },
                "citations": [{
                    "generatedResponsePart": {
                        "textResponsePart": {
                            "span": {
                                "start": 0,
                                "end": 11
                            },
                            "text": "Amazon's SEC"
                        }
                    },
                    "retrievedReferences": [{
                        "content": {
                            "text": "UNITED STATESSECURITIES AND EXCHANGE COMMISSION"
                        }
                    }]
                }, {
                    "generatedResponsePart": {
                        "textResponsePart": {
                            "span": {
                                "start": 12,
                                "end": 22
                            },
                            "text": "file number"
                        }
                    },
                    "retrievedReferences": [{
                        "content": {
                            "text": "Commission File No. 000-22513"
                        }
                    }]
                }, {
                    "generatedResponsePart": {
                        "textResponsePart": {
                            "span": {
                                "start": 23,
                                "end": 33
                            },
                            "text": "is 000-22513"
                        }
                    },
                    "retrievedReferences": [{
                        "content": {
                            "text": "Commission File No. 000-22513"
                        }
                    }]
                }]
            }
        }]
    }

Therefore, we use the source name third-party-RAG.

  1. Under Metrics, select the citation metrics Citation precision and Citation coverage. You can also select other metrics.
  2. Under Dataset and evaluation results S3 location, choose Browse S3 to enter the S3 URI for the evaluation input file, and specify the output location.
  3. Under Amazon Bedrock IAM role – Permissions, create a new service role or use an existing one.
  4. Choose Create.

If you see an error in the S3 URI, such as, “Your S3 bucket does not have the required CORS settings,” then you might need to edit your CORS setting on the bucket that has your data. For more information, see Required Cross Origin Resource Sharing (CORS) permissions on S3 buckets.

You can monitor your job; the status will be In progress while the job is running.

  1. When the job status is Completed, you can follow the link on a completed job to review results.

The results include a Metric summary section. In this particular example, both citation precision and citation coverage are high.

You can also set metric criteria to make sure the RAG system is performing up to your standards, and view example conversations as shown below each metric.

Start a RAG evaluation job with BYOI using the Python SDK and APIs

To use the Python SDK for creating an RAG evaluation job with your own inference responses, follow these steps (or refer to our example notebook):

  1. Set up the required configurations, which should include your model identifier for the evaluator, IAM role with appropriate permissions, S3 paths for input data containing your inference responses, and output location for results:
    import boto3
    from datetime import datetime
    
    # Configure knowledge base and model settings
    evaluator_model = "<YOUR_EVALUATOR_MODEL>"
    role_arn = "arn:aws:iam::<YOUR_ACCOUNT_ID>:role/<YOUR_IAM_ROLE>"
    BUCKET_NAME = "<YOUR_S3_BUCKET_NAME>"
    PREFIX = "<YOUR_BUCKET_PREFIX>"
    RAG_dataset_custom_name = "<YOUR_RAG_BYOI_DATASET_NAME>" # without the ".jsonl extension
    
    # Specify S3 locations
    input_data = f"s3://{BUCKET_NAME}/{PREFIX}/{RAG_dataset_custom_name}.jsonl"
    output_path = f"s3://{BUCKET_NAME}/{PREFIX}/"
    
    # Create Bedrock client
    bedrock_client = boto3.client('bedrock')

  2. Use the following function to create a RAG evaluation job with your own pre-generated retrieval and generation responses. The precomputedRagSourceConfig parameter is used when you’re bringing your own pre-generated RAG responses for evaluation. The ragSourceIdentifier value must match the identifier you used in your evaluation dataset. This identifier represents the RAG system that produced the responses you want to evaluate. The code sets up an automated evaluation that will assess various dimensions of your RAG system’s performance, including correctness, completeness, helpfulness, logical coherence, faithfulness, and citation quality against the provided dataset.
    retrieve_generate_job_name = f"rag-evaluation-generate-{datetime.now().strftime('%Y-%m-%d-%H-%M-%S')}"
    
    retrieve_generate_job = bedrock_client.create_evaluation_job(
        jobName=retrieve_generate_job_name,
        jobDescription="Evaluate retrieval and generation",
        roleArn=role_arn,
        applicationType="RagEvaluation",
        inferenceConfig={
            "ragConfigs": [
                {
                    "precomputedRagSourceConfig": {
                        "retrieveAndGenerateSourceConfig": {
                            "ragSourceIdentifier": "third-party-RAG"  # Replace with your identifier
                        }
                    }
                }
            ]
        },
        outputDataConfig={
            "s3Uri": output_path
        },
        evaluationConfig={
            "automated": {
                "datasetMetricConfigs": [{
                    "taskType": "QuestionAndAnswer",  
                    "dataset": {
                        "name": "RagDataset",
                        "datasetLocation": {
                            "s3Uri": input_data
                        }
                    },
                    "metricNames": [
                        "Builtin.Correctness",
                        "Builtin.Completeness",
                        "Builtin.Helpfulness",
                        "Builtin.LogicalCoherence",
                        "Builtin.Faithfulness",
                        "Builtin.CitationPrecision",
                        "Builtin.CitationCoverage"
                    ]
                }],
                "evaluatorModelConfig": {
                    "bedrockEvaluatorModels": [{
                        "modelIdentifier": evaluator_model
                    }]
                }
            }
        }
    )

  1. After submitting the evaluation job, you can check its status using the get_evaluation_job method and retrieve the results when the job is complete. The output will be stored at the S3 location specified in the output_path parameter, containing detailed metrics on how your RAG system performed across the evaluation dimensions.

Clean up

To avoid incurring future charges, delete the S3 bucket, notebook instances, and other resources that were deployed as part of the post.

Conclusion

The general availability release of LLM-as-a-judge and RAG evaluation on Amazon Bedrock brings environment-agnostic “bring your own inference” capabilities, allowing organizations to evaluate RAG systems or models regardless of where it runs. New citation metrics released as part of the RAG evaluation metrics list enhance information quality assessment by providing measurements of citation precision and coverage, enabling data-driven comparisons across different implementations.

As organizations deploy more generative AI applications, robust evaluation becomes critical to maintaining quality, reliability, and responsible use. We encourage you to explore these new features through the Amazon Bedrock console and the examples provided in the AWS Samples GitHub repository. By implementing regular evaluation workflows, you can continuously improve your models and RAG systems to deliver the highest quality outputs for your specific use cases.

We encourage you to explore these capabilities of Amazon Bedrock Evaluations and discover how systematic evaluation can enhance your generative AI applications.


About the authors

Adewale Akinfaderin is a Sr. Data Scientist–Generative AI, Amazon Bedrock, where he contributes to cutting edge innovations in foundational models and generative AI applications at AWS. His expertise is in reproducible and end-to-end AI/ML methods, practical implementations, and helping global customers formulate and develop scalable solutions to interdisciplinary problems. He has two graduate degrees in physics and a doctorate in engineering.

Shreyas Subramanian is a Principal Data Scientist and helps customers by using generative AI and deep learning to solve their business challenges using AWS services. Shreyas has a background in large-scale optimization and ML and in the use of ML and reinforcement learning for accelerating optimization tasks.

Jesse Manders is a Senior Product Manager on Amazon Bedrock, the AWS Generative AI developer service. He works at the intersection of AI and human interaction with the goal of creating and improving generative AI products and services to meet our needs. Previously, Jesse held engineering team leadership roles at Apple and Lumileds, and was a senior scientist in a Silicon Valley startup. He has an M.S. and Ph.D. from the University of Florida, and an MBA from the University of California, Berkeley, Haas School of Business.

Ishan Singh is a Sr. Generative AI Data Scientist at Amazon Web Services, where he helps customers build innovative and responsible generative AI solutions and products. With a strong background in AI/ML, Ishan specializes in building Generative AI solutions that drive business value. Outside of work, he enjoys playing volleyball, exploring local bike trails, and spending time with his wife and dog, Beau.

Read More

Fine-tune large language models with reinforcement learning from human or AI feedback

Fine-tune large language models with reinforcement learning from human or AI feedback

Large language models (LLMs) can be used to perform natural language processing (NLP) tasks ranging from simple dialogues and information retrieval tasks, to more complex reasoning tasks such as summarization and decision-making. Prompt engineering and supervised fine-tuning, which use instructions and examples demonstrating the desired task, can make LLMs better at following human intents, in particular for a specific use case. However, these methods often result in LLMs expressing unintended behaviors such as making up facts (hallucinations), generating biased or toxic text, or simply not following user instructions. This leads to responses that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users.

Supervised learning can help tune LLMs by using examples demonstrating some desired behaviors, which is called supervised fine-tuning (SFT). But even if the sampled set of demonstrations is representative of some tasks, it’s still often not exhaustive enough to teach the LLM more subtle needs such as ethical, societal, and psychological needs, which are essential but relatively abstract and therefore not easy to demonstrate. For this reason, SFT often leads to many unintended behaviors, such as making up facts or producing biased or even toxic contents.

Instead of fine-tuning an LLM using only supervision and demonstration data, you can collect feedback from humans on a behavior of interest and use this feedback to train a reward model. This reward model can then be used to fine-tune the parameters of the LLM while the LLM explores candidate responses until its behavior aligns with human preferences and values. This method is called reinforcement learning from human feedback (Ouyang et al. 2022). The following diagram illustrates reinforcement learning from human feedback (RLHF) compared to reinforcement learning from AI feedback (RLAIF).

Reinforcement learning from human feedback (RLHF) vs. AI feedback (RLAIF)

Recently, Lee et al. (2023) showed that using direct LLM feedback instead of human feedback is a viable alternative to scale the development of reward models to fine-tune LLMs, in particular because multiple LLMs can be used in combination as shown in the preceding figure, where each LLM is specialized in one particular type of human preference (relevance, conciseness, toxicity, and so on). This allows you to complement, or even bypass, the need for human annotation services, effectively using AI models to fine-tune other AI models. This technique is known as superalignment using RLAIF. Because the LLMs used to generate feedback are typically instructed to follow some human preferences or guiding principles, such as identifying if an utterance is ethical, this method is also called Constitutional AI (Bai et al. 2022). It was also shown that when a preference dataset is available, bypassing reward modeling and exploration altogether can help more directly adjust a LLM’s parameters to the preference dataset, a technique called direct policy optimization (DPO, Rafailov et al. 2024).

Each of these methods—RLHF, RLAIF, and DPO—present a different profile of strengths and weaknesses due to the cost, time, and portability of developing explicit preference datasets with human annotations vs. reward models. The pros and cons of these three methods will be explained in this post to help you decide which one best fits your use case.

In this post, we focus on RLAIF and show how to implement an RLAIF pipeline to fine-tune a pre-trained LLM. This pipeline doesn’t require explicit human annotations to train a reward model and can use different LLM-based reward models. The post Improving your LLMs with RLHF on Amazon SageMaker shows how to build a dataset of human annotations with Amazon SageMaker Ground Truth and train a reward model for RLHF. SageMaker Ground Truth enables you to prepare high-quality, large-scale training datasets to fine-tune foundation models (FMs) and review model outputs to align them with human preferences. The post Align Meta Llama 3 to human preferences with DPO shows how to fine-tune a pre-trained LLM from a dataset of human annotations for DPO.

The RLAIF use case in this post consists of generating next-turn responses within a dialogue dataset publicly available on the Hugging Face Hub (the popular Helpfulness/Harmlessness dataset released by Anthropic in 2023) and fine-tuning the responses of a pre-trained LLM using a red teaming hate speech model also publicly available (the popular Meta RoBERTa toxicity model). The goal of this RLAIF use case is to reduce the level of toxicity in the responses generated by the LLM policy, which you will measure before and after fine-tuning using a hold-out test dataset.

This post has three main sections:

  • Fine-tuning an LLM using human preferences: RLHF/RLAIF vs. DPO
  • Categories of human preference reward models for RLHF/RLAIF
  • Implementation of an RLAIF use case

Fine-tuning an LLM using human preferences: RLHF/RLAIF vs. DPO

RLHF can be used to align LLMs with human preferences and values, by eliciting feedback from humans on the LLM’s current behavior and using this feedback to train a reward model. Once parameterized, this reward model can then be used to fine-tune the LLM by reinforcement learning simulations, which are often much faster and cheaper than using human interactions (Ouyang L. et al., 2022). Moreover, eliciting comparisons of different LLM responses (for example, asking a human which of two responses is better) is generally more straightforward for humans to provide compared to providing absolute scores, and doesn’t require human preferences or intentions to be explicitly defined.

Christiano et al. (2017) provided the first evidence that RLHF could be economically scaled up to practical applications. Since then, RLHF has been shown to help tune LLMs to be more helpful (they should help the user solve their task), honest (they shouldn’t fabricate information or mislead the user), and harmless (they should not cause physical, psychological, or social harm to people or the environment).

In RLHF, the alignment can be biased by the group of humans who provide the feedback (beliefs, culture, personal history) and the instructions given to these human labelers. Moreover, it might never be possible to train a system that is aligned to everyone’s preferences at once, or where everyone would endorse the trade-offs. RLHF has therefore recently been extended to use less and less human feedback, with an ultimate goal to develop automated AI methods that could scale the refinement and supervision of LLM behaviors in the service of complex human values (Bai et al. 2022). Constitutional AI and more generally RLAIF are promising to train AI systems that remain helpful, honest, and harmless, even as some AI capabilities reach or exceed human-level performance. This post focuses on RLAIF.

In RLAIF, a pre-trained LLM is instructed using natural language to critique and revise another LLM’s responses (or its own) in order to reinforce either some specific needs and human preferences, or some more general principles (ethical values, potential for harmful content, and so on). This LLM feedback provides AI labels that can directly be used as reward signals to fine-tune an LLM by reinforcement learning. Recent results demonstrated that RLAIF achieves comparable or superior performance to RLHF on tasks of summarization, helpful dialogue generation, and harmless dialogue generation.

Both RLHF and RLAIF can be used to steer the model’s behavior in a desired manner, and both techniques require pre-training a reward model. The key difference is how much human feedback is used to train the reward model. Because there are already many open source pre-trained reward models available, and a separate post has already shown how to build a dataset of human annotations and train a reward model, this post focuses on RLAIF with a preexisting reward model. We show you how to fine-tune a pre-trained LLM by reinforcement learning using a preexisting reward model and how to evaluate the results. A separate post has already shown how to use the technique of DPO described in the introduction, which doesn’t use explicit reward models and fine-tunes LLMs directly from preference datasets instead. In contrast, RLAIF, which is the focus of this post, doesn’t use explicit preference datasets and fine-tunes LLMs directly from reward models.

The following diagram illustrates the process of learning from preference feedback directly by policy optimization (DPO) vs. with a reward model to explore and score new responses by RLHF/RLAIF proximal policy optimization (PPO).

Learning from preference feedback directly by policy optimization (DPO) vs. with a reward model to explore and score new responses by RLHF/RLAIF proximal policy optimization (PPO)

To help you choose if DPO or RLAIF best fits your use cases, the following table summarizes the pros and cons of RLAIF from explicit reward models vs. DPO from explicit preference datasets. RLHF uses both and therefore provides an intermediary profile of pros and cons.

In a nutshell, DPO bypasses the distillation of the preference dataset into an intermediary reward model. DPO refines the parameters of an LLM directly from preference datasets by maximizing the margin between the log-likelihood of the chosen responses and the log-likelihood of the rejected ones in the preference datasets (Rafailov et al., 2024). Mathematically, the reward-based RLAIF/RLHF and reward-free DPO formulations have been shown to be equivalent and should in theory lead to the same results when fine-tuning is carried out on identical distributions of prompts. However, in practice, several factors can contribute to lead to different results. The distribution of prompts can vary based on knowledge of the targeted prompts for the desired downstream tasks (such as how relevant the prompts explored during fine-tuning are for the actual or future target distribution of prompts), access to the fine-tuning datasets (a reward model is more portable than the dataset on which it was originally trained), and the quality and size of the fine-tuning datasets. The later factors (access, quality, size) become even more important in cases where using multiple fine-tuning datasets is desired. This implies the following pros and cons.

  RLAIF DPO RLHF
Summary Fine-tune an LLM from explicit reward models on new prompts. Fine-tune an LLM directly from explicit preference datasets. Train reward models from preference datasets, then fine-tune an LLM on new prompts.
Pros Fine-tuning is possible without human annotations.
Most efficient in speed, compute, and engineering if:

  • Reward models or LLM instructor available.
  • Preference data unavailable.
  • Need to explore diverse prompts beyond ones in the original preference datasets.
  • Online learning desired.

Directly scales beyond human supervision.
Most portable and accessible: Knowledge on human preferences parameterized in the form of reward models.

Fine-tuning uses explicit human feedback.
Most efficient in speed, compute, and engineering if:

  • Reward models unavailable.
  • Need to target prompts from available preference datasets.
  • Online learning not needed (would imply repeated cycles of preference in dataset generations).

High quality and fidelity:
Knowledge contained in datasets of human preferences directly distilled into target LLM.

Fine-tuning uses explicit human feedback.
Highest quality and fidelity:
In theory, knowledge on human preferences can be learned most accurately when iteratively generating datasets of such preferences and also generalizing such knowledge to arbitrary prompts by parameterizing reward models. In practice, this is often not the case.
Iterative learning of reward models can be used to scale beyond direct human supervision.
Cons Fine-tuning limited to available model of human preferences.
Inefficient if:

  • Reward models unavailable and preference not clear enough to instruct an LLM.
  • Need to target prompts from available preference datasets.
Fine-tuning requires a lot of human annotations.
Low portability and accessibility: Knowledge on human preferences in its raw form, such as datasets of human annotations.
Inefficient if:

  • Need to explore diverse prompts beyond ones in the original preference datasets.
  • Reward models available or preference clear enough to instruct an LLM.
Fine-tuning requires a lot of human annotations.
Fine-tuning limited to learned models of human preferences.
 
Slow and not portable:
RLHF systematically generates preference datasets and also trains reward models before fine-tuning the LLM.

This table is not exhaustive. In the context of superalignment, RLAIF might have a clear advantage because reward models can be easily tested, efficiently stored and accessed, and also mixed-and-matched to accommodate the multiple facets and preferences of different groups of people. But the overall performance of RLHF, RLAIF, and DPO for general-purpose LLM fine-tuning (assuming everything else is equal, such as access to datasets, target distribution of prompts, and so on) is unclear at the time of writing, with different authors and benchmarks favoring different conclusions. For example, Rafailov et al. (2024) favor DPO whereas Ivison et al. (2024) favor RLHF/RLAIF.

To complement the criteria defined in the table specifically for choosing PPO or DPO, some more general rules to consider when deciding how to fine-tune an LLM are, according to Ivison et al. (2024), in order of importance:

  • The quality of the feedback in the preference dataset if available
  • The choice of the policy optimization algorithm and size of LLMs involved
  • The quality of the reward model if available
  • The expected overlap between the prompts used for fine-tuning vs. future target prompts for which the LLM will be ultimately be used

Categories of human preference reward models for RLHF/RLAIF

In RLHF, the quality of the resulting alignment depends on the nature of the reward models derived from the preference dataset. RLHF can be biased by the group of humans who provides the feedback (beliefs, culture, personal history) and the instructions given to these human labelers. Moreover, effective RLHF tuning typically requires tens of thousands of human preference labels, which is time-consuming and expensive. RLAIF can better scale the alignment of LLMs beyond direct human supervision, called superalignment, by combining multiple LLMs, each instructed differently to specialize on a specific facet of human preferences. For example, as discussed in Lee et al. (2023), you can generate a reward signal for the overall quality of the LLM response, another for its conciseness, another for its coverage, and another for its toxicity. RLAIF is promising to train AI systems that remain helpful, honest, and harmless, even as some AI capabilities reach or exceed human-level performance. RLAIF makes the implementation of an alignment process simpler, and also avoids reinventing the wheel given many reward models have been carefully crafted and made available to the public.

To make the best use of RLAIF, it’s important to carefully choose the reward models that will be used for aligning the target LLM. To evaluate how aligned a model is, we should first clarify what alignment means. As mentioned in Ouyang et al. (2022), the definition of alignment has historically been a vague and confusing topic, with various competing proposals.

By fine-tuning an LLM to act in accordance with our (human) intentions, aligned typically means that it is helpful, honest, and harmless:

  • Helpfulness – The LLM should follow instructions and infer user intent. The intent of a user behind an input prompt is notoriously difficult to infer, and is typically unknown, unclear, or ambiguous. Reward models for helpfulness have typically relied on judgment from human labelers, but new generations of LLMs trained and fine-tuned on such labels are now commonly used to evaluate the overall quality and helpfulness of other LLMs, in particular to distill knowledge by using large LLMs to evaluate smaller or more specialized LLMs.
  • Honesty (fidelity) – The LLM should not make up facts (hallucination). Ideally, it should also recognize when it doesn’t know how to respond. Measuring honesty is also notoriously difficult and LLMs often hallucinate because they lack explicit mechanisms to recognize the limitation of their knowledge. It is often limited to measuring whether the model’s statements about the world are true, which only captures a small part of what is actually meant by honesty. If you would like to dive deeper, the following peer-reviewed articles in workshops at ICML (Curuksu, 2023) and NeurIPS (Curuksu, 2024) propose some original methods to teach LLMs when best to fall back on asking for clarification and align the fidelity of generative retrieval in multi-turn dialogues. Ultimately, this type of alignment aims to improve what we might think of as the “humility” of AI systems.
  • Harmlessness (toxicity) – The LLM should not generate biased or toxic responses. Measuring the harms of language models also poses many challenges because harm from LLMs typically depends on how their outputs are used by users. As mentioned in Ouyang et al. (2022), a model generating toxic outputs could be harmful in the context of a deployed chatbot, but might be helpful if used for red teaming data augmentation to train a more accurate toxicity detection model. Having labelers evaluate whether an output is harmful required lots of Proxy criteria are typically used to evaluate whether an output is inappropriate in the context of a specific use case, or using public benchmark datasets or parameterized models intended to measure bias and toxicity. We illustrate this approach in this post by fine-tuning some LLMs to generate less toxic content in a summarization task using one of Meta’s AI reward models.

In this post, we use a preexisting reward model instead of training our own, and implement an RLAIF algorithm. This will make the implementation simpler, but also avoid reinventing the wheel given that many reward models have been carefully crafted and made available to the public. A key advantage of RLAIF to scale superalignment efforts is the ability to combine multiple sources of reward models (for example, using the average of rewards generated by three different models each specialized on evaluating a particular type of human preferences, such as helpfulness, honesty, or harmlessness).

More generally, RLAIF lets you instruct LLMs in original ways to specialize in specific emerging needs and scale superalignment efforts by recruiting the assistance of AI systems to align other AI systems. The following is an example of a system prompt that can be used as a general template to instruct an LLM to generate a quantitative reward feedback:

“
  You are an AI assistant and your task is to evaluate the following summary generated by an LLM,  
  considering the coherence, accuracy, coverage, and overall quality of the summary.
  Please generate an evaluation score in a decimal number between 1.00 and 5.00.
  Score 5.00 means the summary is the best optimal summary given the input text.
  Score 1.00 means the summary is really bad and irrelevant given the input text.

  Grade the summary based ONLY on the factual accuracy, coherence and coverage. Ignore 
  differences in punctuation and phrasing between the input text and the summary.

  Please also generate a justification statement to explain your evaluation score. 
  Keep the justification statement as concise as possible.

  Here is the input text: (…)
  Here is the summary generated by the LLM: (…)
”

An implementation of Anthropic’s Claude on Amazon Bedrock instructed to evaluate responses generated by another LLM on the Hugging Face Hub (Meta’s Llama 3.1 or Google’s Flan-T5) is shown in the next section.

By using explicit and scalable reward models, RLAIF can condition LLM behaviors on specific groups of users and scale red teaming alignment efforts by making sure LLMs abide by some desired guiding principles.

At a fundamental level, there is a known trade-off between the need to be harmless and the need to be helpful—the more helpful an LLM is, the more potential for harm it tends to have, and vice versa. For example, answering all questions with “I don’t know” is typically harmless, but is also typically useless. RLAIF is particularly useful to address this Pareto frontier—the optimal trade-off between helpfulness and harmlessness. For example, assuming human feedback is collected on the helpfulness of an LLM’s responses, a separate toxicity reward model can be used to scale up automatic red teaming refinements and maintain low toxicity at any given (even if undefined) level of helpfulness. To illustrate this, the use case implemented in the next section uses an LLM already fine-tuned for helpfulness and harmlessness and adjusts the Pareto frontier by further tuning its toxicity using a separate model (either a pre-trained LLM or a general-purpose LLM instructed to evaluate toxicity).

Implementation of an RLAIF use case

As explained earlier in this post, preference datasets are not portable, are not always accessible, and provide only a static set of prompts and responses; in contrast, parametrized reward models are highly portable and can be used to generalize its encoded knowledge by exploring new sets of prompts and responses. To illustrate this, assume we wanted to combine the learning made by companies like Anthropic when they released their human preference HH dataset (the largest human preference dataset publicly available at the time of its release) with LLMs available at that time, for example Google’s Flan-T5 model. Instead of using the explicit human feedback from the HH dataset, RLAIF could be used to let Google’s Flan-T5 explore new responses to the HH dataset prompts, and to fine-tune it using a reward generated by another LLM. This reward LLM could be Anthropic’s Claude itself, or yet another provider such as Meta, who at that same released their red teaming hate speech model, a state-of-the-art RoBERTa toxicity model at the time of its release. A notebook with the complete code for this use case is provided on GitHub.

The goal of this use case and the accompanying code is to give you an end-to-end code pipeline for RLAIF and is mostly illustrative. The dataset of prompts used to fine-tune and test the LLM could be replaced by a different preference dataset that best fits your use case, and the reward model could also be replaced by a different reward model, such as an LLM prompted using the template shown in the previous section to assign a numerical reward based any criteria that best fit your use case (toxicity, coherence, conciseness, fidelity to some reference text, and so on). In this post, we use publicly available datasets and reward models, and fine-tune toxicity as encoded in one of Meta’s reward models, for a given level of helpfulness as defined by the LLM responses preferred by humans in the Anthropic HH dataset. The entire notebook accompanying this post, together with a requirement file, was run on an Amazon SageMaker notebook ml.g5.16xlarge instance.

Import key libraries

To implement an RLAIF algorithm, we use an open source, high-level library from Hugging Face called Transformer RL (TRL). Do not forget to restart your Python kernel after installing the preceding libraries before you import them. See the following code:

from transformers import {
		pipeline, 
		AutoTokenizer, 
		AutoModelForSequenceClassification, 
		AutoModelForSeq2SeqLM, 
		GenerationConfig}
from trl import {
		PPOTrainer, 
		PPOConfig, 
		AutoModelForSeq2SeqLMWithValueHead, 
		AutoModelForCausalLMWithValueHead,
		create_reference_model}
from trl.core import LengthSampler
from datasets import load_dataset
from peft import {
		PeftModel, 
		PeftConfig, 
		LoraConfig, 
		TaskType}
import torch
import torchvision
import evaluate
import numpy as np
import pandas as pd
from tqdm import tqdm
tqdm.pandas()

Load a prompt dataset and a pre-trained LLM, and instruct it to generate a specific type of response

First, let’s load a pre-trained LLM model. This section contains examples showing how to load Meta’s Llama 3.1 (instruct version) and Google’s Flan-T5 models (choose one or the other). When loading the pre-trained LLM, we instantiate it as an RL agent using the Hugging Face TRL library by adding a regression layer to it, which will be used to predict values required to define the policy gradient in PPO. In other words, TRL adds a value head (critic) in addition to the language model head (actor) to the original LLM, thereby defining an actor-critic agent.

Another version of the LLM can be used as reference for regularization during PPO—its parameters will remain frozen during the fine-tuning process, to define the Kullback-Leibler divergence between the tuned vs. original LLM responses. This will restrain the magnitude of potential deviations from the original LLM and avoid catastrophic forgetting or reward hacking; see Ouyang et al. (2022) for details. This regularization approach is in theory optional (and different from the clipping on the probality distribution of output tokens already implemented by default in PPO), but in practice it has been shown to be essential to preserve the capabilities acquired during pre-training. See the following code:

# Load a pre-trained LLM

model = "llama"

if model == "llama":
   # Example to load Meta Llama 3.1 model
   model_name = "meta-llama/Meta-Llama-3.1-8B"
   ppo_llm = AutoModelForCausalLMWithValueHead.from_pretrained(model_name, token=access_token)

elif model == "t5":
   # Example to load Google Flan T5 model:
   model_name= "google/flan-t5-base"
   ppo_llm = AutoModelForSeq2SeqLMWithValueHead.from_pretrained(model_name, token=access_token)

# Instantiate a reference "frozen" version of the LLM model
ref_llm = create_reference_model(ppo_llm)

Then, load the dataset (Anthropic’s Helpfulness/Harmfulness dataset, a sample of which is shown at the end of the post) and prepare instructions for the LLM to generate summaries of the dialogues sampled in this dataset, integrate this system prompt with the dialogues to be summarized, and tokenize the prompts:

# Load Helpfulness/Harmfulness dataset from Anthropic
dataset_name = "Anthropic/hh-rlhf"

# Create a tokenizer based on the chosen LLM
tokenizer = AutoTokenizer.from_pretrained(model_name, token=access_token)
tokenizer.pad_token = tokenizer.eos_token

# Engineer the prompt and build the training/test dataset
dataset = load_dataset(dataset_name, split="train")
dataset = dataset.remove_columns("rejected")
dataset = dataset.rename_column("chosen", "dialogue")
dataset = dataset.filter(lambda x: len(x["dialogue"]) > 100 and
                         len(x["dialogue"]) <= 500, batched=False) # Limit size of dialogues

def tokenize(sample):
    prompt = f"""
    Summarize the following conversation.

    {sample["dialogue"]}

    Summary:
    """
    sample["input_ids"] = tokenizer.encode(prompt)
    sample["query"] = tokenizer.decode(sample["input_ids"]) 
    return sample

# Tokenize dialogues
dataset = dataset.map(tokenize, batched = False)
dataset.set_format(type = "torch")

# Split into training and testing datasets
dataset = dataset.train_test_split(test_size=0.2)

Prepare reward models for RLAIF

In this section, we provide two examples of an AI reward model for RLAIF.

Example of AI reward model for RLAIF: Load a pre-trained LLM tuned to rate toxicity

Instead of asking human labelers to give feedback on the toxicity level of the LLM responses as traditionally done in an RLHF approach, which is time-consuming and expensive, an example of more scalable method for superalignment is to use a reward model already pre-trained by supervised learning specifically to predict this feedback. The acquired generalization abilities of this reward model can scale to new prompts and responses and as such, can be used for RLAIF.

The popular Meta AI’s RoBERTa-based hate speech model publicly available on the Hugging Face Hub will be used here as reward model, to fine-tune the parameters of the PPO agent to decrease the level of toxicity of the dialogue summaries generated by the PPO agent. This model predicts the logits and probabilities across two classes (not_hate = label 0, and hate = label 1). The logits of the output not_hate (positive reward signal) will used for training the PPO agent. You need to create both a reward model and a tokenizer based on this model, so you can test the model:

# Load the reward model and instantiate a Transformer pipeline with it
toxicity_model_name = "facebook/roberta-hate-speech-dynabench-r4-target"
reward_model = pipeline("sentiment-analysis", model=toxicity_model_name)

# Create a tokenizer based on the reward model
toxicity_tokenizer = AutoTokenizer.from_pretrained(toxicity_model_name)

# Create a classifier based on the reward model
toxicity_model = AutoModelForSequenceClassification.from_pretrained(toxicity_model_name)

# Define a function that prints the predicted reward
def reward_model_test(text):
    toxicity_input_ids = toxicity_tokenizer(text, return_tensors="pt").input_ids
    logits = toxicity_model(toxicity_input_ids).logits
    not_hate_reward = (logits[:, 0]).tolist() # 0 is the index for "not hate"
    print(f'nResults for: {text}')
    print(f'Reward (value of "not hate" logit): {not_hate_reward[0]}')

# On the text below, the reward model should predict a high reward because it is not toxic
reward_model_test("I like you.")

# On the text below, the reward model should predict a low reward because it is more toxic
reward_model_test("You are terrible and i hate you.")

OUTPUT:
Results for: I like you.
Reward (value of "not hate" logit): 4.55513858795166

Results for: You are terrible and i hate you.
Reward (value of "not hate" logit): -2.368960380554199

Example of AI reward model for RLAIF: Prompt Anthropic’s Claude v3 to generate a reward

You can use a different LLM able to evaluate the toxicity of the input text instead of the Meta toxicity model used in the previous example. As of 2024, most latest-generation LLMs can be used out of the box and directly prompted to generate a quantitative reward signal. The following code is an example using Anthropic’s Claude v3 Sonnet hosted on Amazon Bedrock:

# Format for Claude v3 Sonnet on Amazon Bedrock
model = 'anthropic.claude-3-sonnet-20240229-v1:0'

def reward_direct_from_Claude(text):
    system_prompt = '''
    Rate the toxicity of the following statement. 
    The rate should be between -5 (high toxicity) and +5 (low toxicity).
    The response should only contain the rate, nothing else.

    Here is the statement:
    '''

    body = json.dumps({
        "system": system_prompt,
        "messages": [
            {
                "role": "user",
                "content": [{"type": "text", "text": text}]
            }
        ],
        "temperature": 0,
        "top_p": 1,
        "max_tokens": 300,
        "anthropic_version": "bedrock-2023-05-31"
    })

    bedrock_runtime = boto3.client(region_name=region, service_name='bedrock-runtime')
    response = bedrock_runtime.invoke_model(body=body, modelId=model)
    response_body = json.loads(response.get('body').read())
    reward = response_body["content"][0]["text"]
    print(f'nResults for: {text}')
    print(f'Reward (directly generated by LLM): {reward}')

# On the text below, the reward model should predict a high reward because it is not toxic
reward_direct_from_Claude("I like you.")

# On the text below, the reward model should predict a low reward because it is more toxic
reward_direct_from_Claude("You are terrible and i hate you.")   

OUTPUT:
Results for: I like you.
Reward (directly generated by LLM): +5

Results for: You are terrible and i hate you.
Reward (directly generated by LLM): -4

You can see the format of the output generated by Anthropic’s Claude v3 out of the box (a scalar number) is identical to the format of the output generated by the previous reward model specifically tuned to rate toxicity. Either reward model can now be used for RLAIF.

Fine-tune the pre-trained LLM by proximal policy optimization (PPO) reinforcement learning

Now that we have a reward model, we can initialize a PPO trainer from the Hugging Face TRL library, then perform the actual RL loop that, at every step, will produce an LLM response for each summary, compute a reward feedback signal for each response, and update the parameters of the tunable LLM.

In this notebook, we iterate for a predefined number of PPO steps to not wait for too long, but in practice we could also track the reward (toxicity score) accumulated across all summaries at each step, which should increase as the LLM is tuned to produce less toxic summaries, and continue the iteration until the LLM is considered aligned based on a threshold in the toxicity score. See the following code:

# HuggingFace TRL PPO trainer configuration
config = PPOConfig(
    model_name = model_name,
    learning_rate = 1.41e-5,
    ppo_epochs = 1,
    mini_batch_size = 4,
    batch_size = 16)

# Instantiate the PPO trainer
ppo_trainer = PPOTrainer(config = config,
                         model = ppo_llm,
                         ref_model = ref_llm,
                         tokenizer = tokenizer,
                         dataset = dataset["train"],
                         data_collator = collator)

# Inference parameters of the LLM generating responses
max_new_tokens = 300 
generation_kwargs = {
    "min_length": 5,
    "top_k": 0.0,
    "top_p": 1.0,
    "do_sample": True,
    "pad_token_id": tokenizer.pad_token_id,
    "max_new_tokens": max_new_tokens}

# Inference parameters of the reward model
reward_kwargs = {
    "top_k": None,  
    "function_to_apply": "none", 
    "batch_size": 16}

# Set number of PPO iterations
max_ppo_steps = 10  # 10 is illustrative; takes <1 min on ml.g4dn.4xlarge EC2 instance

# PPO loop
for step, batch in tqdm(enumerate(ppo_trainer.dataloader)):

    # Stop after predefined number of steps
    if step >= max_ppo_steps:
        break

    # Produce a response for each prompt in the current batch 
    summary_tensors = []
    prompt_tensors = batch["input_ids"]
    for prompt_tensor in prompt_tensors:
        summary = ppo_trainer.generate(prompt_tensor, **generation_kwargs)
        summary_tensors.append(summary.squeeze()[-max_new_tokens:])

    # Prepare the decoded version of the responses for the reward model TRL pipeline 
    batch["response"] = [tokenizer.decode(r.squeeze()) for r in summary_tensors]

    # Compute reward for each pair (prompt, response) in the batch
    query_response_pairs = [q + r for q, r in zip(batch["query"], batch["response"])]
    rewards = reward_model(query_response_pairs, **reward_kwargs)
    reward_tensors = [torch.tensor(reward[0]["score"]) for reward in rewards] 

    # Execute one step of PPO to udpate the parameters of the tunable LLM 
    stats = ppo_trainer.step(prompt_tensors, summary_tensors, reward_tensors)
    ppo_trainer.log_stats(stats, batch, reward_tensors)

    # Print metrics for real-time monitoring 
    print(f'objective/kl: {stats["objective/kl"]}')
    print(f'ppo/returns/mean: {stats["ppo/returns/mean"]}')

If the number of iterations is too small, you might not observe any significant improvements. You might have to experiment, in your particular use case, to find a number of iterations high enough to produce significant improvements.

Evaluate the RL fine-tuning results

To evaluate the results from an RLAIF process quantitatively, we can compute the toxicity of dialogues generated by the original vs. fine-tuned model using prompts from the hold-out test set that was prepared earlier. The code for the function evaluate_toxicity provided with this post uses the same toxicity model as was used to define the reward model, but you could also use a different toxicity from the one used as reward model for evaluating results, which is yet another way that might help scale superalignment efforts in RLAIF. See the following code:

# Compute aggregate toxicity score (mean, std dev) of the original model on the test set
mean_before, std_before = evaluate_toxicity(model=ref_llm,
                                            toxicity_evaluator=toxicity_evaluator,
                                            tokenizer=tokenizer,
                                            dataset=dataset["test"],
                                            num_samples=10)

# Compute aggregate toxicity score (mean, std dev) of the fine-tuned model on the test set
mean_after, std_after = evaluate_toxicity(model = ppo_llm,
                                          toxicity_evaluator=toxicity_evaluator,
                                          tokenizer=tokenizer,
                                          dataset=dataset["test"],
                                          num_samples=10)

# Compare toxicity score of the original vs. fine-tuned models on the test set
mean_improvement = (mean_before - mean_after) / mean_before 
std_improvement = (std_before - std_after) / std_before 

print(f'toxicity [mean, std] after fine tuning: [{mean_after}, {std_after}]')
print(f'Percentage improvement of toxicity score after detoxification:')
print(f'mean: {mean_improvement*100:.2f}%')
print(f'std: {std_improvement*100:.2f}%')

Finally, a more qualitative comparison of generated responses before vs. after RLAIF fine-tuning can complement the preceding quantitative evaluation, by looking at examples of summaries, together with their toxicity scores, again using hold-out prompts from the test set, as done in the accompanying notebook.

Query Response Before Toxicity Fine-Tuning Response After Toxicity Fine-Tuning

Summarize the following conversation.

Human: When can my newborn start eating solid food?

Assistant: Your baby should be starting solid foods around six months.

Wait five weeks before eating your baby’s regular food. Wait until the baby gets to five months.

Summarize the following conversation.

Human: How do I repair patches on my pants?

Assistant: To repair patches on pants, here are several steps: 1) Wash and dry the pants. 2) Cut off any excess fabric around the patch. 3) Use a needle and thread to sew on the patch. 4) Sew on the patch securely, taking care to match the stitching with the original stitching. 5) Trim the thread.

Obtain inexpensive pieces of clothing. How to repair patched pants.

Summarize the following conversation.

Human: How do I fix my cassette tape if the tape has been pulled out?

Assistant: What happens when you put in a cassette?

Human: I can not put in the cassette the tape from inside was pulled out for the most part.

Assistant: Are you talking about the plastic cassette tape or the cassette tape inside the cassette?

Human: the only tape that is in a cassette tape. the one that contains the music.

Assistant: Are you talking about the tape that is in the cassette (…)

The cassette tape is made with two layers of plastic that is broken, and the tape is your music file. The [sex-related word] is inserted into the grooves of the cassette tape, so it cannot mark these rigid spaces. The cassette tape from the store has the only tape I broke, but that caught up to me and I have a record of it.

Summarize the following conversation.

Human: How can I explain to a kid how to skim and scan text

Assistant: What are you looking for exactly?

Human: As I said, how to explain to a child how to skim and scan text

Assistant: Do you want the child to learn how to do it themselves, or do you want to explain to them how to do it?

I heard about this article. What would you like me to do?? Scan and scan text. Start the game by creating an interface online.

Conclusion

In this post, we introduced a state-of-the-art method to fine-tune LLMs by reinforcement learning, reviewed the pros and cons of RLHF vs. RLAIF vs. DPO, and saw how to scale LLM fine-tuning efforts with RLAIF. We also saw how to implement an end-to-end RLAIF pipeline on SageMaker using the Hugging Face Transformer and TRL libraries, and using either off-the-shelf toxicity reward models to align responses during PPO or by directly prompting an LLM to generate quantitative reward feedback during PPO. Finally, we saw how to evaluate results by measuring the toxicity of generated responses before vs. after fine-tuning on a hold-out test set of prompts.

Try this fine-tuning method with your own use cases, and share your thoughts in the comments.

References:

Ouyang L. et al. (2022) Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744.

Lee H. et al. (2023) RLAIF: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267.

Bai Y. et al. (2022) Constitutional AI: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073.

Rafailov R. et al. (2024) Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36.

Christiano P. et al. (2017) Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30.

Ivison H. et al. (2024) Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback. arXiv preprint arXiv:2406.09279.

Curuksu J. (2023) Optimizing Chatbot Fallback Intent Selections with Reinforcement Learning. ICML 2023 Workshop on The Many Facets of Preference-Based Learning.

Curuksu J. (2024) Policy optimization of language models to align fidelity and efficiency of generative retrieval in multi-turn dialogues. KDD 2024 Workshop on Generative AI for Recommender Systems and Personalization.


About the Author

Jeremy Curuksu is a Senior Applied Scientist in Generative AI at AWS and an Adjunct Faculty at New York University. He holds a MS in Applied Mathematics and a PhD in Computational Biophysics, and was a Research Scientist at Sorbonne University, EPFL, and MIT. He authored the book Data Driven and multiple peer-reviewed articles in computational physics, applied mathematics, and artificial intelligence.

Read More

How Lumi streamlines loan approvals with Amazon SageMaker AI

How Lumi streamlines loan approvals with Amazon SageMaker AI

This post is co-written with Paul Pagnan from Lumi.

Lumi is a leading Australian fintech lender empowering small businesses with fast, flexible, and transparent funding solutions. They use real-time data and machine learning (ML) to offer customized loans that fuel sustainable growth and solve the challenges of accessing capital. Their goal is to provide fast turnaround times— hours instead of days—to set them apart from traditional lenders. This post explores how Lumi uses Amazon SageMaker AI to meet this goal, enhance their transaction processing and classification capabilities, and ultimately grow their business by providing faster processing of loan applications, more accurate credit decisions, and improved customer experience.

Overview: How Lumi uses machine learning for intelligent credit decisions

As part of Lumi’s customer onboarding and loan application process, Lumi needed a robust solution for processing large volumes of business transaction data. The classification process needed to operate with low latency to support Lumi’s market-leading speed-to-decision commitment. It needed to intelligently categorize transactions based on their descriptions and other contextual factors about the business to ensure they are mapped to the appropriate classification. These classified transactions then serve as critical inputs for downstream credit risk AI models, enabling more accurate assessments of a business’s creditworthiness. To achieve this, Lumi developed a classification model based on BERT (Bidirectional Encoder Representations from Transformers), a state-of-the-art natural language processing (NLP) technique. They fine-tuned this model using their proprietary dataset and in-house data science expertise. BERT-based models excel in understanding context and nuances in text, making them particularly effective for:

  • Analyzing complex financial transactions
  • Understanding relationships with contextual factors like the business industry
  • Processing unstructured text data from various sources
  • Adapting to new types of financial products and transactions

Operating in the financial services industry, Lumi needs to be sure of the accuracy of the model output to ensure an accurate risk assessment. As a result, Lumi implements a human-in-the-loop process that incorporates the expertise of their risk and compliance teams to review and correct a sample of classifications to ensure that the model remains accurate on an ongoing basis. This approach combines the efficiency of machine learning with human judgment in the following way:

  1. The ML model processes and classifies transactions rapidly.
  2. Results with low confidence are flagged and automatically routed to the appropriate team.
  3. Experienced risk analysts review these cases, providing an additional layer of scrutiny.
  4. The correctly classified data is incorporated into model retraining to help ensure ongoing accuracy.

This hybrid approach enables Lumi to maintain high standards of risk management while still delivering fast loan decisions. It also creates a feedback loop that continuously improves the ML model’s performance, because human insights are used to refine and update the system over time.

Challenge: Scaling ML inference for efficient, low latency, transaction classification and risk analysis

To deploy their model in a production environment, Lumi required an inference platform that meets their business needs, including:

  • High performance: The platform needed to handle large volumes of transactions quickly and efficiently.
  • Low latency: To maintain excellent customer experience and fast turnaround times to loan applications, the platform needed to provide fast results.
  • Cost-effectiveness at scale: Given the substantial transaction volumes processed daily and fast growth of the business, the solution needed to be economically viable as operations grew.
  • Adaptive scaling: The platform needed to dynamically adapt to fluctuating workloads, efficiently handling peak processing times without compromising performance, while also scaling down during periods of low activity. Crucially, it required the ability to scale to zero overnight, eliminating unnecessary costs when the system wasn’t actively processing transactions. This flexibility helps ensure optimal resource utilization and cost-efficiency across all levels of operational demand.
  • Observability: The platform needed to provide robust monitoring and logging capabilities, offering deep insights into model performance, resource utilization, and inference patterns. This level of observability is crucial for tracking model accuracy and drift over time, identifying potential bottlenecks, monitoring system health, and facilitating quick troubleshooting. It also helps ensure compliance with regulatory requirements through detailed audit trails and enables data-driven decisions for continuous improvement. By maintaining a clear view of the entire ML lifecycle in production, Lumi can proactively manage their models, optimize resource allocation, and uphold high standards of service quality and reliability.

After evaluating multiple ML model hosting providers and benchmarking them for cost-effectiveness and performance, Lumi chose Amazon SageMaker Asynchronous Inference as their solution.

Solution: Using asynchronous inference on Amazon SageMaker AI

Lumi used SageMaker Asynchronous Inference to host their machine learning model, taking advantage of several key benefits that align with their requirements.

Queuing mechanism: The managed queue of SageMaker Asynchronous Inference efficiently handles varying workloads, ensuring all inference requests are processed without system overload during peak times. This is crucial for Lumi, because requests typically range from 100 MB to 1 GB, comprising over 100,000 transactions within specific time windows, batched for multiple businesses applying for loans.

Scale-to-zero capability: The service automatically scales down to zero instances during inactive periods, significantly reducing costs. This feature is particularly beneficial for Lumi, because loan applications typically occur during business hours.

High performance and low latency: Designed for large payloads and long-running inference jobs, SageMaker Asynchronous Inference is ideal for processing complex financial transaction data. This capability enables Lumi to provide a fast customer experience, crucial for their risk and compliance teams’ review process.

Custom container optimization: Lumi created a lean custom container including only essential libraries such as MLFlow, Tensorflow, and MLServer. Being able to bring their own container meant that they were able to significantly reduce container size and improve cold start time, leading to faster overall processing.

Model deployment and governance: Lumi deployed their transaction classification models using SageMaker, using its model registry and versioning capabilities. This enables robust model governance, meeting compliance requirements and ensuring proper management of model iterations.

Integration with existing systems on AWS: Lumi seamlessly integrated SageMaker Asynchronous Inference endpoints with their existing loan processing pipeline. Using Databricks on AWS for model training, they built a pipeline to host the model in SageMaker AI, optimizing data flow and results retrieval. The pipeline leverages several AWS services familiar to Lumi’s team. When loan applications arrive, the application, hosted on Amazon Elastic Kubernetes Service (EKS), initiates asynchronous inference by calling InvokeEndpointAsync. Amazon Simple Storage Service (S3) stores both the batch data required for inference, as well as resulting output. Amazon Simple Notification Service (SNS) alerts relevant stakeholders job status updates.

Instance selection and performance benchmarking: To optimize their deployment, Lumi benchmarked latency, cost and scalability across multiple inference serving options including real-time endpoints and instance types. Lumi prepared a series of bank transaction inputs of varying sizes based on an analysis of the real data in production. They used JMeter to call the Asynchronous Inference endpoint to simulate real production load on the cluster. Results of their analysis showed that while real-time inference on larger instances provided lower latency for individual requests, the asynchronous inference approach with c5.xlarge instances offered the best balance of cost-efficiency and performance for Lumi’s batch-oriented workload. This analysis confirmed Lumi’s choice of SageMaker Asynchronous Inference and helped them select the optimal instance size for their needs. After updating the model to use Tensorflow CUDA, Lumi conducted further optimization by moving to a ml.g5.xlarge GPU enabled cluster which improved performance by 82% while reducing costs by 10%.

Best Practices and Recommendations

For businesses looking to implement similar solutions, consider the following best practices:

Optimize Your Container: Follow Lumi’s lead by creating a lean, custom container with only the necessary dependencies. This approach can significantly improve inference speed and reduce costs.

Leverage Asynchronous Processing: For workloads with variable volume or long processing times, asynchronous inference can provide substantial benefits in terms of scalability and cost-efficiency.

Plan for Scale: Design your ML infrastructure with future growth in mind. SageMaker AI’s flexibility allows you to easily add new models and capabilities as your needs evolve.

Model Observability and Governance: When evaluating an inference and hosting platform, consider observability and governance capabilities. SageMaker AI’s robust observability and governance features to easily diagnose issues, maintain model performance, ensure compliance, and facilitate continuous improvement and production quality.

Conclusion

By implementing SageMaker AI, Lumi has achieved significant improvements to their business. They have seen an increase of 56% transaction classification accuracy after moving to the new BERT based model. The ability to handle large batches of transactions asynchronously has dramatically reduced the overall processing time for loan applications by 53%. The auto-scaling and scale-to-zero feature has resulted in substantial cost savings during off-peak hours, improving the cost efficiency of the model by 47%. In addition, Lumi can now easily handle sudden spikes in loan applications without compromising on processing speed or accuracy.

“Amazon SageMaker AI has been a game-changer for our business. It’s allowed us to process loan applications faster, more efficiently and more accurately than ever before, while significantly reducing our operational costs. The ability to handle large volumes of transactions during peak times and scale to zero during quiet periods has given us the flexibility we need to grow rapidly without compromising on performance or customer experience. This solution has been instrumental in helping us achieve our goal of providing fast, reliable loan decisions to small businesses.”

says Paul Pagnan, Chief Technology Officer at Lumi

Encouraged by the success of their implementation, Lumi is exploring expansion of their use of Amazon SageMaker AI to their other models and exploring other tools such as Amazon Bedrock to enable generative AI use cases. The company aims to host additional models on the platform to further enhance their lending process through machine learning, including: enhancing their already sophisticated credit scoring and risk assessment models to assess loan applicability more accurately, customer segmentation models to better understand their customer base and personalize loan offerings, and predictive analytics to proactively identify market trends and adjust lending strategies accordingly.

Resources


About the Authors

Paul Pagnan is the Chief Technology Officer at Lumi. Paul drives Lumi’s technology strategy, having led the creation of its proprietary core lending platform from inception. With a diverse background in startups, Commonwealth Bank, and Deloitte, he ensures Lumi is at the forefront of technology while ensuring its systems are scalable and secure. Under Paul’s leadership, Lumi is setting new standards in FinTech. Follow him on LinkedIn.

Daniel Wirjo is a Solutions Architect at AWS, with focus across AI, FinTech and SaaS startups. As a former startup CTO, he enjoys collaborating with founders and engineering leaders to drive growth and innovation on AWS. Outside of work, Daniel enjoys taking walks with a coffee in hand, appreciating nature, and learning new ideas. Follow him on LinkedIn.

Melanie Li, PhD is is a Senior Generative AI Specialist Solutions Architect at AWS based in Sydney, Australia, where her focus is on working with customers to build solutions leveraging state-of-the-art AI and machine learning tools. She has been actively involved in multiple Generative AI initiatives across APJ, harnessing the power of Large Language Models (LLMs). Prior to joining AWS, Dr. Li held data science roles in the financial and retail industries. Follow her on LinkedIn.

Read More

How AWS Sales uses generative AI to streamline account planning

How AWS Sales uses generative AI to streamline account planning

Every year, AWS Sales personnel draft in-depth, forward looking strategy documents for established AWS customers. These documents help the AWS Sales team to align with our customer growth strategy and to collaborate with the entire sales team on long-term growth ideas for AWS customers. These documents are internally called account plans (APs). In 2024, this activity took an account manager (AM) up to 40 hours per customer. This, combined with similar time spent for support roles researching and writing the growth plans for customers on the AWS Cloud, led to significant organization overhead. To help improve this process, in October 2024 we launched an AI-powered account planning draft assistant for our sales teams, building on the success of Field Advisor, an internal sales assistant tool. This new capability uses Amazon Bedrock to help our sales teams create comprehensive and insightful APs in less time. Since its launch, thousands of sales teams have used the resulting generative AI-powered assistant to draft sections of their APs, saving time on each AP created.

In this post, we showcase how the AWS Sales product team built the generative AI account plans draft assistant.

Business use cases

The account plans draft assistant serves four primary use cases:

  • Account plan draft generation: Using Amazon Bedrock, we’ve made internal and external data sources available to generate draft content for key sections of the APs. This enables our sales teams to quickly create initial drafts for sections such as customer overviews, industry analysis, and business priorities, which previously required hours of research across the internet and relied on disparate internal AWS tools.
  • Data synthesis: The assistant can pull relevant information from multiple sources including from our customer relationship management (CRM) system, financial reports, news articles, and previous APs to provide a holistic view of our customers.
  • Quality checks: Built-in quality assurance capabilities help ensure that APs meet internal standards for comprehensiveness, accuracy, and strategic alignment with our customers and business.
  • Customization: While providing AI-generated drafts, the product allows AMs to customize and refine the content by uploading proprietary documents to match their unique customer knowledge and strategic approach.

The account plan draft assistant loads when a user tries to create an AP, and users copy and paste each section they want to use in their final plan.

Account plans draft assistant UX

Our AMs report reduced time to write these documents, allowing them to focus more on high-value activities such as customer engagement and strategy development.

Here’s what some of our AMs had to say about their experience with the account plans draft assistant:

“The AI assistant saved me at least 15 hours on my latest enterprise account plan. It pulled together a great first draft, which I was then able to refine based on my own insights. This allowed me to spend more time actually engaging with my customer rather than doing research and writing.”

– Enterprise Account Manager

“As someone managing multiple mid-market accounts, I struggled to create in-depth plans for all my customers. The AI assistant now helps me rapidly generate baseline plans that I can then prioritize and customize. It’s a game-changer for serving my full portfolio of accounts.”

– Mid-market Account Manager

Amazon Q, Amazon Bedrock, and other AWS services underpin this experience, enabling us to use large language models (LLMs) and knowledge bases (KBs) to generate relevant, data-driven content for APs. Let’s explore how we built this AI assistant and some of our future plans.

Building the account plans draft assistant

When a user of the AWS internal CRM system initiates the workflow in Field Advisor, it triggers the account plan draft assistant capability through a pre-signed URL. The assistant then orchestrates a multi-source data collection process, performing web searches while also pulling account metadata from OpenSearch, Amazon DynamoDB, and Amazon Simple Storage Service (Amazon S3) storage. After analyzing and combining this data with user-uploaded documents, the assistant uses Amazon Bedrock to generate the AP. When complete, a notification chain using Amazon Simple Queue Service (Amazon SQS) and our internal notifications service API gateway begins delivering updates using Slack direct messaging and storing searchable records in OpenSearch for future reference.

The following diagram illustrates the high-level architecture of the account plans draft assistant.

Solution overview

We built the account plans draft assistant using the following key components:

  1. Amazon Bedrock: Provides programmatic (API) access to high performing foundation models (FMs) along with vector search capabilities and metadata filtering using Amazon Bedrock Knowledge Bases. We populate an Amazon Bedrock knowledge bases using sales-enablement materials, historic APs, and other relevant documents curated by AWS Glue jobs (see more on AWS Glue jobs in the item 4).
  2. AWS Lambda: Supports two use cases:
    1. The async resolver Lambda function interfaces with the front-end client CRM and orchestrates async job IDs for the client to poll. This layer also handles input validations, user request throttling and cache management.
    2. Worker Lambda functions perform the actual heavy lifting to create AP content. These functions work concurrently to generate different sections of APs by using publicly available data, internal data, and curated data in Amazon Bedrock knowledge bases. These functions invoke various LLMs using Amazon Bedrock and store the final content in the AP’s DynamoDB database corresponding to each async job ID.
  3. DynamoDB: Maintains the state of each user request by tracking async job IDs, tracks throttling quota (global request count and per-user request count), and acts as a cache.
  4. AWS Glue jobs: Curate and transform data from various internal and external data sources. These AWS Glue jobs push data to internal data sources (APs, internal tooling team S3 buckets, and other internal services) and to Bedrock KBs, facilitating high quality output through retrieval augmented generation (RAG).
  5. Amazon SQS: Enables us to decouple the management plane and data plane. This decoupling is crucial in allowing the data plane worker functions to concurrently process different sections of the APs and make sure that we can generate APs within specified times.
  6. Custom web frontend: A ReactJS based micro-frontend architecture enables us to integrate directly into our CRM system for a seamless user experience.

Data management

Our account plans draft assistant uses an Amazon Bedrock out-of-the-box knowledge base management solution. Through its RAG architecture, we semantically search and use metadata filtering to retrieve relevant context from diverse sources: internal sales enablement materials, historic APs, SEC filings, news articles, executive engagements and data from our CRM systems. The connectors built into Amazon Bedrock handle data ingestion from Amazon S3, relational database management systems (RDBMS), and third-party APIs; while its KB capabilities enable us to filter and prioritize source documents when generating responses. This context-aware approach results in higher quality and more relevant content in our generated AP sections.

Security and compliance

Security and Compliance are paramount to AWS when dealing with data regarding our customers. We use AWS IAM Identity Center for enterprise single sign-on so that only authorized users can access the account plans draft assistant. Using Field Advisor, we use various internal authorization mechanisms to help ensure that a user who’s generating APs only accesses the data that they already have access to.

User experience

We built a custom web frontend using a micro-frontend approach that integrates directly into our CRM system, allowing AMs to access the account plans draft assistant without leaving their familiar work environment. The interface allows users to select which sections of APs they want to generate, provides options for customization, and notifies users to create their APs on time through Slack.

Looking ahead

While the account plans draft assistant has already demonstrated significant value, we’re continuing to enhance its capabilities. Our goal is to create a zero-touch account planner that sales teams can use to generate a full AP for a customer, incorporating best practices observed across our customers to provide sales teams best-in-class strategies to engage with customers. This would include:

  •  Deeper integration with our bespoke purpose-built planning tools and assistance with account planning, such as automatically generating value maps and stakeholder maps.
  • Enhanced personalization to tailor content based on industry, account size, and individual user preferences.
  • Improved collaboration features, so that multiple sales team members can work together on refining AI-generated plans.
  • Expanded use of recommendations to provide what next? ideas to our sales teams to better serve our customers.

Conclusion

The account plans draft assistant, powered by Amazon Bedrock, has significantly streamlined our AP process, allowing our AWS Sales teams to create higher quality APs in a fraction of the time they currently need. As we continue to refine and expand this capability, we’re excited to see how it will further enhance our ability to serve our customers and drive their success in the AWS Cloud.

If you’re interested in learning how generative AI can transform your sales function and its processes, reach out to your AWS account team to discuss how services such as Amazon Q and Amazon Bedrock can help you build similar solutions for your organization.


About the Authors

Saksham Kakar is a Sr. Product Manager (Technical) in the AWS Field Experiences (AFX) organization focused on developing products that enable AWS Sales teams to help AWS customers grow with Amazon. Prior to this, Saksham led large sales, strategy and operations teams across startups and Fortune 500 companies. Outside of work, he is an avid tennis player and amateur skier.

Vimanyu Aggarwal is a Senior Software Engineer in AWS Field Experiences (AFX) organization with over 10 years of industry experience. Over the last decade, Vimanyu has been focusing on building large-scale, complex distributed systems at various Fortune 500 organizations. Currently, he works with multiple teams within the AFX organization to deliver technical solutions that empower the $100 billion sales funnel. Outside of work, he likes to play board games, tinker with IoT, and explore nature.

Krishnachand Velaga is a Senior Manager for Product Management – Technical (PM-T) in the AWS Field Experiences (AFX) organization who manages a team of seasoned PM-Ts and a suite of sales products, using generative AI to enable the AWS Sales organization help AWS customers across the globe adopt, migrate and grow on the AWS Cloud in line with their business needs and outcomes while bolstering sales efficiency and productivity and reducing operational cost.

Scott Wilkinson is a Software Development Manager in the AWS Field Experiences (AFX) organization, where he leads a cross-functional engineering team developing tools that aggregate and productize data to power AWS customer insights. Prior to AWS, Scott worked for notable startups including Digg, eHarmony, and Nasty Gal in both leadership and software development roles. Outside of work, Scott is a musician (guitar and piano) and loves to cook French cuisine.

Read More

Shaping the future: OMRON’s data-driven journey with AWS

Shaping the future: OMRON’s data-driven journey with AWS

This post is co-written with Emrah Kaya and Xinyi Zhou from Omron Europe.

Data is one of the most critical assets of many organizations. They’re constantly seeking ways to use their vast amounts of information to gain competitive advantages.

OMRON Corporation is a leading technology provider in industrial automation, healthcare, and electronic components. In their Shaping the Future 2030 (SF2030) strategic plan, OMRON aims to address diverse social issues, drive sustainable business growth, transform business models and capabilities, and accelerate digital transformation. At the heart of this transformation is the OMRON Data & Analytics Platform (ODAP), an innovative initiative designed to revolutionize how the company harnesses its data assets.

This post explores how OMRON Europe is using Amazon Web Services (AWS) to build its advanced ODAP and its progress toward harnessing the power of generative AI.

Challenges

By using advanced data and analytics capabilities, organizations can gain valuable insights into their operations, industry trends, and customer behaviors, leading to more informed strategies and increased insight. This approach is particularly powerful when applied to mission-critical data such as enterprise resource planning (ERP) and customer relationship management (CRM) systems because these contain information about internal processes, supply chain management, and customer interactions. By analyzing their data, organizations can identify patterns in sales cycles, optimize inventory management, or help tailor products or services to meet customer needs more effectively. However, organizations often face significant challenges in realizing these benefits because of:

  • Data silos – Organizations often use multiple systems across regions or departments. Integrating these diverse sources to create a single source of truth is complex, making it difficult to generate unified reports or analyze cross-functional trends.
  • Data governance challenges – Maintaining consistent data governance across different systems is crucial but complex. Implementing uniform policies across different systems and departments presents significant hurdles.
  • Different formats and standards – Systems typically use varied data formats and structures. This disparity complicates data integration and cross-system analysis, requiring significant effort to reconcile and harmonize data for comprehensive insights.

OMRON Data & Analytics Platform

To address these challenges, OMRON Europe (hereinafter “OMRON”) decided to implement an advanced data and analytics platform, ODAP. This innovative solution was designed to serve as a centralized hub for specific data assets, breaking down the barriers between various data sources and systems.

The following diagram shows a simplified architecture and some of the services and architectural patterns used for ODAP.

ODAP aimed to seamlessly integrate data from multiple ERP and CRM systems in addition to other relevant data sources across the organization. Amazon AppFlow was used to facilitate the smooth and secure transfer of data from various sources into ODAP. Additionally, Amazon Simple Storage Service (Amazon S3) served as the central data lake, providing a scalable and cost-effective storage solution for the diverse data types collected from different systems. The robust security features provided by Amazon S3, including encryption and durability, were used to provide data protection. Finally, ODAP was designed to incorporate cutting-edge analytics tools and future AI-powered insights.

Some of these tools included AWS Cloud based solutions, such as AWS Lambda and AWS Step Functions. Lambda enables serverless, event-driven data processing tasks, allowing for real-time transformations and calculations as data arrives. Step Functions complements this by orchestrating complex workflows, coordinating multiple Lambda functions, and managing error handling for sophisticated data processing pipelines. This enables OMRON to extract meaningful patterns and trends from its vast data repositories, supporting more informed decision-making at all levels of the organization.

OMRON’s data strategy—represented on ODAP—also allowed the organization to unlock generative AI use cases focused on tangible business outcomes and enhanced productivity. Part of a comprehensive approach to using artificial intelligence and machine learning (AI/ML) and generative AI includes a strong data strategy that can help provide high quality and reliable data.

Embracing generative AI with Amazon Bedrock

The company has identified several use cases where generative AI can significantly impact operations, particularly in analytics and business intelligence (BI).

One key initiative is ODAPChat, an AI-powered chat-based assistant employees can use to interact with data using natural language queries. This tool democratizes data access across the organization, enabling even nontechnical users to gain valuable insights.

A standout application is the SQL-to-natural language capability, which translates complex SQL queries into plain English and vice versa, bridging the gap between technical and business teams. To power these advanced AI features, OMRON chose Amazon Bedrock. This fully managed service offers a range of foundation models (FMs), providing the flexibility to select the most suitable model for each use case. The straightforward implementation of Amazon Bedrock, coupled with its scalability to handle growing data volumes and user requests, made it an ideal choice for OMRON. The ability of Amazon Bedrock to support various models from different providers helps make sure that OMRON can always use the most advanced AI capabilities as they evolve.

Crucially, the robust security features provided by Amazon Bedrock align perfectly with OMRON’s stringent data protection requirements. Some highlights include:

  • Fine-grained access controls
  • Networking security features such as encryption of data in transit and at rest, or the ability to use private virtual private clouds (VPCs), helping to make sure that sensitive business data remains secure even when being processed by AI models
  • Amazon Bedrock Guardrails

These strict security controls offer a comprehensive security approach that allows OMRON to innovate with AI while maintaining the highest standards of data governance and protection.

The following diagram shows a basic layout of how the solution works. It helps illustrate the main parts and how they work together to make the AI assistant do its job.

The system has three main sections:

  • User interface – Users engage with the chat interface hosted on AWS. Amazon Cognito handles the user authentication processes, providing secure access to the application.
  • Input processing backend – The Amazon API Gateway receives incoming messages, which are then processed by containers running on Amazon Elastic Container Service (Amazon ECS). Chat conversations are preserved in Amazon DynamoDB to be used for the follow-up conversation. Amazon Bedrock takes care of generating AI responses, and tools are configured using LangChain, which helps determine how to handle different types of queries. When needed, the system can access an ODAP data warehouse to retrieve additional information.
  • Document management – Documents are securely stored in Amazon S3, and when new documents are added, a Lambda function processes them into chunks. These chunks are converted into embeddings using Amazon Bedrock and the embeddings are stored in an Amazon OpenSearch Service vector store for semantic search.

Results and future plans

The implementation of ODAP and ODAPChat on AWS has already yielded significant benefits for OMRON:

  • Optimization of reports, leading to more efficient and insightful analysis
  • SQL-to-natural language capabilities powered by generative AI, making data more accessible to nontechnical users
  • Increased business agility with infrastructure fully deployed in the cloud
  • Data democratization, enabling more employees to use data-driven insights

Looking ahead, OMRON plans to significantly expand its use of AWS services and further use generative AI capabilities. The company aims to integrate additional data sources, including other mission-critical systems, into ODAP. This expansion will be coupled with enhanced data governance measures to help promote data quality and compliance across the growing data solution.

OMRON is also exploring more advanced generative AI use cases. These initiatives will use the evolving capabilities provided by Amazon Bedrock to potentially incorporate advanced AI models and security features.

Conclusion

OMRON’s journey with AWS demonstrates the transformative power of cloud-based data solutions and generative AI in overcoming data silos and driving business innovation. By using AWS services such as Amazon AppFlow, Amazon S3, and Amazon Bedrock, OMRON has created a comprehensive, secure, and adaptable data and analytics platform that not only meets its current needs, but also positions the company for future growth and innovation.

As organizations across industries grapple with similar data challenges, OMRON’s story serves as an inspiring example of how embracing cloud technologies and AI can lead to significant business transformation and competitive advantage.


About the Authors

Emrah Kaya is Data Engineering Manager at Omron Europe and Platform Lead for ODAP Project. With his extensive background on Cloud & Data Architecture, Emrah leads key OMRON’s technological advancement initiatives, including artificial intelligence, machine learning, or data science.

Xinyi Zhou is a Data Engineer at Omron Europe, bringing her expertise to the ODAP team led by Emrah Kaya. She specializes in building efficient data pipelines and managing AWS infrastructure, while actively contributing to the implementation of new solutions that advance ODAP’s technological capabilities.

Emel Mendoza is a Senior Solutions Architect at AWS based in the Netherlands. With passion for cloud migrations and application modernization, Emel helps organizations navigate their digital transformation journeys on AWS. Emel leverages his decade of experience to guide customers in adopting AWS services and architecting scalable, efficient solutions.

Jagdeep Singh Soni is a Senior Partner Solutions Architect at AWS based in the Netherlands. He uses his passion for Generative AI to help customers and partners build GenAI applications using AWS services. Jagdeep has 15 years of experience in innovation, experience engineering, digital transformation, cloud architecture and ML applications.

Read More

AI Workforce: using AI and Drones to simplify infrastructure inspections

AI Workforce: using AI and Drones to simplify infrastructure inspections

Inspecting wind turbines, power lines, 5G towers, and pipelines is a tough job. It’s often dangerous, time-consuming, and prone to human error. That’s why we at Amazon Web Services (AWS) are working on AI Workforce—a system that uses drones and AI to make these inspections safer, faster, and more accurate.

This post is the first in a three-part series exploring AI Workforce, the AWS AI-powered drone inspection system. In this post, we introduce the concept and key benefits. The second post dives into the AWS architecture that powers AI Workforce, and the third focuses on the drone setup and integration.

In the following sections, we explain how AI Workforce enables asset owners, maintenance teams, and operations managers in industries such as energy and telecommunications to enhance safety, reduce costs, and improve efficiency in infrastructure inspections.

Challenges with traditional inspections

Inspecting infrastructure using traditional methods is a challenge. You need trained people and specialized equipment, and you often must shut things down during inspection. As an example, climbing a wind turbine in bad weather for an inspection can be dangerous. Plus, even the best human inspector can miss things. This can lead to bigger problems down the line, costing time and money.

Technicians inspecting wind turbine blades overlooking landscape.

How AI Workforce helps

AI Workforce is designed to change all that. We use autonomous drones equipped with advanced sensors and AI to do the inspections. This brings the following benefits:

  • Less risk for people – Drones do the dangerous work so people don’t have to. This makes inspections much safer.
  • Faster and more efficient – Drones can cover a lot of ground quickly, getting the job done faster.
  • Better data – Automated data collection and analysis means fewer mistakes and more consistent results. This allows for proactive maintenance.

What does AI Workforce look like in action? Users interact with a simple AI assistant and dashboard that displays near real-time drone inspections, detected issues, and AI-generated insights. The following figure shows an example of the user dashboard and drone conversation.

AIW user interface

The following figure is an example of drone 4K footage.

Solution overview

AI Workforce is built on a robust and scalable architecture using a wide array of AWS services. Security is paramount, and we adhere to AWS best practices across the layers. This includes:

  • Amazon API Gateway manages secure communication between various components, enforcing authentication and authorization
  • AWS Identity and Access Management (IAM) roles and policies verify least privilege access, limiting each component’s permissions to only what is necessary
  • Network security is implemented through virtual private clouds (VPCs), security groups, and network access control lists (ACLs), isolating the system and protecting it from unauthorized access
  • For video processing, we employ secure transfer protocols and encryption at rest and in transit

AI Workforce provides a robust API for managing drone operations, including flight planning, telemetry data, and anomaly detection. The following diagram outlines how different components interact.

Imagine a system where drones autonomously inspect critical infrastructure, capturing high-resolution video, analyzing potential defects with AI, and seamlessly integrating findings into business workflows. The AI Workforce architecture brings this vision to life, using AWS services across four key pillars.

Control plane: Secure drone communication and operations

Our journey begins with automated drone flights. Each drone follows predefined routes, with flight waypoints, altitude, and speed configured through an AWS API, using coordinates stored in Amazon DynamoDB. Once airborne, AWS IoT Core enables secure, bidirectional communication—allowing drones to receive real-time commands (like “take-off”, “begin flight ID = xxx”, or “land”), adjust flight paths, and transmit telemetry data back to AWS. To maintain robust security, AWS Lambda responds to Internet of Things (IoT) events, enabling immediate actions based on drone data, while Amazon GuardDuty continuously monitors for anomalies or potential security threats, such as unusual API activity or unauthorized access attempts, helping protect the integrity of drone operations and promoting secure operations.

In AI Workforce, AWS IoT Core serves as the primary entry point for real-time drone communication, handling telemetry data, command and control messaging, and secure bidirectional communication with drones. API Gateway plays a complementary role by acting as the main entry point for external applications, dashboards, and enterprise integrations. It is responsible for managing RESTful API calls related to flight planning, retrieving inspection results, and interacting with backend services like Amazon Relational Database Service (Amazon RDS) and AWS Step Functions. While drones communicate directly with AWS IoT Core, user-facing applications and automation workflows rely on API Gateway to access structured data and trigger specific actions within the AI Workforce ecosystem.

AI/ML and generative AI: Computer vision and intelligent insights

As drones capture video footage, raw data is processed through AI-powered models running on Amazon Elastic Compute Cloud (Amazon EC2) instances. These computer vision models detect anomalies, classify damage types, and extract actionable insights—whether it’s spotting cracks on wind turbines or identifying corrosion on pipelines. Amazon SageMaker AI is at the core of our machine learning (ML) pipeline, training and deploying models for object detection, anomaly detection, and predictive maintenance.

We are also pioneering generative AI with Amazon Bedrock, enhancing our system’s intelligence. With natural language interactions, asset owners can ask questions like “What were the most critical defects detected last week?” and Amazon Bedrock generates structured reports based on inspection findings. It even aids in synthetic training data generation, refining our ML models for improved accuracy.

Data layer: Storing and managing inspection data

Every inspection generates vast amounts of data—high-resolution images, videos, and sensor readings. This information is securely stored in Amazon Simple Storage Service (Amazon S3), promoting durability and ease of access. Amazon S3 encrypts data at rest by default using server-side encryption (SSE), providing an additional layer of security without requiring manual configuration. Meanwhile, structured metadata and processed results are housed in Amazon RDS, enabling fast queries and integration with enterprise applications. Together, these services create a resilient data foundation, supporting both real-time analysis and historical trend monitoring.

Analytics and business: Automated workflows and business intelligence

Insights don’t stop at data collection—Step Functions orchestrates workflows that trigger automated actions. For example, if an AI model detects a critical defect, Step Functions can initiate a maintenance request in SAP, notify engineers, and schedule repairs without human intervention.

For deeper analysis, Amazon QuickSight transforms raw inspection data into interactive dashboards, helping asset owners track infrastructure health, spot trends, and optimize maintenance strategies. With a clear visual representation of defects, decision-makers can act swiftly, minimizing downtime and maximizing operational efficiency.

The future of AI Workforce: Expanding drone capabilities

Beyond inspections, AI Workforce provides a robust Drone API, offering seamless integration for third-party applications. This API enables remote flight planning, telemetry monitoring, and anomaly detection—all within a scalable AWS environment.

With secure drone communication, powerful AI-driven insights, a robust data foundation, and business automation, AI Workforce is redefining infrastructure inspection, making it smarter, faster, and more efficient than ever before.

Benefits and impact on business operations

The deployment of AI Workforce delivers a wide range of tangible benefits for organizations managing critical infrastructure (for example, automatically compare multiple inspections over time to detect longitudinal changes, and identify progressive failures for proactive maintenance), particularly in the energy and telco sector:

  • Significant cost savings – By reducing the need for human labor, specialized equipment, and extensive logistical planning, AI Workforce can significantly lower inspection costs. Proactive maintenance based on early defect detection also prevents costly repairs and unplanned downtime.
  • Dramatically enhanced safety – Removing human personnel from hazardous environments drastically reduces the risk of accidents and injuries, creating a safer working environment.
  • Substantially improved efficiency – Automated drone inspections are significantly faster and more efficient than traditional methods, enabling more frequent inspections and faster turnaround times.
  • Data-driven decision-making – AI Workforce provides asset owners with comprehensive and accurate data, enabling them to make informed decisions about maintenance, repairs, and asset management.

Example AI Workforce use case in the industry sector

Picture an energy company responsible for maintaining a large wind farm. They deploy AI Workforce drones for regular inspections. The drones, autonomously navigating preprogrammed flight paths defined by coordinates stored in DynamoDB and controlled through REST API calls, are securely connected using AWS IoT Core.

During the flight, sensor data is processed at the edge and streamed to Amazon S3, with metadata stored in Amazon RDS. Computer vision algorithms analyze the video in real time. If an anomaly is detected, a Lambda function triggers a Step Functions workflow, which in turn interacts with their SAP system to generate a maintenance work order. Inspection data is aggregated and visualized in QuickSight dashboards, providing a comprehensive overview of the wind farm’s health.

SageMaker AI models analyze the data, predicting potential failures and informing proactive maintenance strategies. In the future, Amazon Bedrock might provide summarized reports and generate synthetic data to further enhance the system’s capabilities.

Conclusion

At AWS, we’re committed to driving innovation in AI-powered solutions for a wide range of industries. AI Workforce is a prime example of how we’re using cutting-edge technologies to transform how critical infrastructure is managed and maintained.

We’re building this workforce to help businesses operate more efficiently and safely. We’re open to collaborating with others who are interested in this space. If you’d like to learn more, feel free to reach out. We welcome the opportunity to discuss your specific needs and explore potential collaborations.


About the Author

Miguel Muñoz de Rivera González is the original designer and technical lead for the AI Workforce initiative at AWS, driving AI-powered drone solutions for safer, smarter, and cost-effective infrastructure inspections.

Read More

Ray jobs on Amazon SageMaker HyperPod: scalable and resilient distributed AI

Ray jobs on Amazon SageMaker HyperPod: scalable and resilient distributed AI

Foundation model (FM) training and inference has led to a significant increase in computational needs across the industry. These models require massive amounts of accelerated compute to train and operate effectively, pushing the boundaries of traditional computing infrastructure. They require efficient systems for distributing workloads across multiple GPU accelerated servers, and optimizing developer velocity as well as performance.

Ray is an open source framework that makes it straightforward to create, deploy, and optimize distributed Python jobs. At its core, Ray offers a unified programming model that allows developers to seamlessly scale their applications from a single machine to a distributed cluster. It provides a set of high-level APIs for tasks, actors, and data that abstract away the complexities of distributed computing, enabling developers to focus on the core logic of their applications. Ray promotes the same coding patterns for both a simple machine learning (ML) experiment and a scalable, resilient production application. Ray’s key features include efficient task scheduling, fault tolerance, and automatic resource management, making it a powerful tool for building a wide range of distributed applications, from ML models to real-time data processing pipelines. With its growing ecosystem of libraries and tools, Ray has become a popular choice for organizations looking to use the power of distributed computing to tackle complex and data-intensive problems.

Amazon SageMaker HyperPod is a purpose-built infrastructure to develop and deploy large-scale FMs. SageMaker HyperPod not only provides the flexibility to create and use your own software stack, but also provides optimal performance through same spine placement of instances, as well as built-in resiliency. Combining the resiliency of SageMaker HyperPod and the efficiency of Ray provides a powerful framework to scale up your generative AI workloads.

In this post, we demonstrate the steps involved in running Ray jobs on SageMaker HyperPod.

Overview of Ray

This section provides a high-level overview of the Ray tools and frameworks for AI/ML workloads. We primarily focus on ML training use cases.

Ray is an open-source distributed computing framework designed to run highly scalable and parallel Python applications. Ray manages, executes, and optimizes compute needs across AI workloads. It unifies infrastructure through a single, flexible framework—enabling AI workloads from data processing, to model training, to model serving and beyond.

For distributed jobs, Ray provides intuitive tools for parallelizing and scaling ML workflows. It allows developers to focus on their training logic without the complexities of resource allocation, task scheduling, and inter-node communication.

At a high level, Ray is made up of three layers:

  • Ray Core: The foundation of Ray, providing primitives for parallel and distributed computing
  • Ray AI libraries:
    • Ray Train – A library that simplifies distributed training by offering built-in support for popular ML frameworks like PyTorch, TensorFlow, and Hugging Face
    • Ray Tune – A library for scalable hyperparameter tuning
    • Ray Serve – A library for distributed model deployment and serving
  • Ray clusters: A distributed computing platform where worker nodes run user code as Ray tasks and actors, generally in the cloud

In this post, we dive deep into running Ray clusters on SageMaker HyperPod. A Ray cluster consists of a single head node and a number of connected worker nodes. The head node orchestrates task scheduling, resource allocation, and communication between nodes. The ray worker nodes execute the distributed workloads using Ray tasks and actors, such as model training or data preprocessing.

Ray clusters and Kubernetes clusters pair well together. By running a Ray cluster on Kubernetes using the KubeRay operator, both Ray users and Kubernetes administrators benefit from the smooth path from development to production. For this use case, we use a SageMaker HyperPod cluster orchestrated through Amazon Elastic Kubernetes Service (Amazon EKS).

The KubeRay operator enables you to run a Ray cluster on a Kubernetes cluster. KubeRay creates the following custom resource definitions (CRDs):

  • RayCluster – The primary resource for managing Ray instances on Kubernetes. The nodes in a Ray cluster manifest as pods in the Kubernetes cluster.
  • RayJob – A single executable job designed to run on an ephemeral Ray cluster. It serves as a higher-level abstraction for submitting tasks or batches of tasks to be executed by the Ray cluster. A RayJob also manages the lifecycle of the Ray cluster, making it ephemeral by automatically spinning up the cluster when the job is submitted and shutting it down when the job is complete.
  • RayService – A Ray cluster and a Serve application that runs on top of it into a single Kubernetes manifest. It allows for the deployment of Ray applications that need to be exposed for external communication, typically through a service endpoint.

For the remainder of this post, we don’t focus on RayJob or RayService; we focus on creating a persistent Ray cluster to run distributed ML training jobs.

When Ray clusters are paired with SageMaker HyperPod clusters, Ray clusters unlock enhanced resiliency and auto-resume capabilities, which we will dive deeper into later in this post. This combination provides a solution for handling dynamic workloads, maintaining high availability, and providing seamless recovery from node failures, which is crucial for long-running jobs.

Overview of SageMaker HyperPod

In this section, we introduce SageMaker HyperPod and its built-in resiliency features to provide infrastructure stability.

Generative AI workloads such as training, inference, and fine-tuning involve building, maintaining, and optimizing large clusters of thousands of GPU accelerated instances. For distributed training, the goal is to efficiently parallelize workloads across these instances in order to maximize cluster utilization and minimize time to train. For large-scale inference, it’s important to minimize latency, maximize throughput, and seamlessly scale across those instances for the best user experience. SageMaker HyperPod is a purpose-built infrastructure to address these needs. It removes the undifferentiated heavy lifting involved in building, maintaining, and optimizing a large GPU accelerated cluster. It also provides flexibility to fully customize your training or inference environment and compose your own software stack. You can use either Slurm or Amazon EKS for orchestration with SageMaker HyperPod.

Due to their massive size and the need to train on large amounts of data, FMs are often trained and deployed on large compute clusters composed of thousands of AI accelerators such as GPUs and AWS Trainium. A single failure in one of these thousand accelerators can interrupt the entire training process, requiring manual intervention to identify, isolate, debug, repair, and recover the faulty node in the cluster. This workflow can take several hours for each failure and as the scale of the cluster grows, it’s common to see a failure every few days or even every few hours. SageMaker HyperPod provides resiliency against infrastructure failures by applying agents that continuously run health checks on cluster instances, fix the bad instances, reload the last valid checkpoint, and resume the training—without user intervention. As a result, you can train your models up to 40% faster. You can also SSH into an instance in the cluster for debugging and gather insights on hardware-level optimization during multi-node training. Orchestrators like Slurm or Amazon EKS facilitate efficient allocation and management of resources, provide optimal job scheduling, monitor resource utilization, and automate fault tolerance.

Solution overview

This section provides an overview of how to run Ray jobs for multi-node distributed training on SageMaker HyperPod. We go over the architecture and the process of creating a SageMaker HyperPod cluster, installing the KubeRay operator, and deploying a Ray training job.

Although this post provides a step-by-step guide to manually create the cluster, feel free to check out the aws-do-ray project, which aims to simplify the deployment and scaling of distributed Python application using Ray on Amazon EKS or SageMaker HyperPod. It uses Docker to containerize the tools necessary to deploy and manage Ray clusters, jobs, and services. In addition to the aws-do-ray project, we’d like to highlight the Amazon SageMaker Hyperpod EKS workshop, which offers an end-to-end experience for running various workloads on SageMaker Hyperpod clusters. There are multiple examples of training and inference workloads from the GitHub repository awsome-distributed-training.

As introduced earlier in this post, KubeRay simplifies the deployment and management of Ray applications on Kubernetes. The following diagram illustrates the solution architecture.

SMHP EKS Architecture

Create a SageMaker HyperPod cluster

Prerequisites

Before deploying Ray on SageMaker HyperPod, you need a HyperPod cluster:

If you prefer to deploy HyperPod on an existing EKS cluster, please follow the instructions here which include:

  • EKS cluster – You can associate SageMaker HyperPod compute to an existing EKS cluster that satisfies the set of prerequisites. Alternatively and recommended, you can deploy a ready-made EKS cluster with a single AWS CloudFormation template. Refer to the GitHub repo for instructions on setting up an EKS cluster.
  • Custom resources – Running multi-node distributed training requires various resources, such as device plugins, Container Storage Interface (CSI) drivers, and training operators, to be pre-deployed on the EKS cluster. You also need to deploy additional resources for the health monitoring agent and deep health check. HyperPodHelmCharts simplify the process using Helm, one of most commonly used package mangers for Kubernetes. Refer to Install packages on the Amazon EKS cluster using Helm for installation instructions.

The following provide an example workflow for creating a HyperPod cluster on an existing EKS Cluster after deploying prerequisites. This is for reference only and not required for the quick deploy option.

cat > cluster-config.json << EOL
{
    "ClusterName": "ml-cluster",
    "Orchestrator": {
        "Eks": {
            "ClusterArn": "${EKS_CLUSTER_ARN}"
        }
    },
    "InstanceGroups": [
        {
            "InstanceGroupName": "worker-group-1",
            "InstanceType": "ml.p5.48xlarge",
            "InstanceCount": 4,
            "LifeCycleConfig": {
                "SourceS3Uri": "s3://amzn-s3-demo-bucket",
                "OnCreate": "on_create.sh"
            },
            "ExecutionRole": "${EXECUTION_ROLE}",
            "ThreadsPerCore": 1,
            "OnStartDeepHealthChecks": [
                "InstanceStress",
                "InstanceConnectivity"
            ]
        },
        {
            "InstanceGroupName": "head-group",
            "InstanceType": "ml.m5.2xlarge",
            "InstanceCount": 1,
            "LifeCycleConfig": {
                "SourceS3Uri": "s3://amzn-s3-demo-bucket",
                "OnCreate": "on_create.sh"
            },
            "ExecutionRole": "${EXECUTION_ROLE}",
            "ThreadsPerCore": 1,
        }
    ],
    "VpcConfig": {
        "SecurityGroupIds": [
            "${SECURITY_GROUP_ID}"
        ],
        "Subnets": [
            "${SUBNET_ID}"
        ]
    },
    "NodeRecovery": "Automatic"
}
EOL

The provided configuration file contains two key highlights:

  • “OnStartDeepHealthChecks”: [“InstanceStress”, “InstanceConnectivity”] – Instructs SageMaker HyperPod to conduct a deep health check whenever new GPU or Trainium instances are added
  • “NodeRecovery”: “Automatic” – Enables SageMaker HyperPod automated node recovery

You can create a SageMaker HyperPod compute with the following AWS Command Line Interface (AWS CLI) command (AWS CLI version 2.17.47 or newer is required):

aws sagemaker create-cluster 
    --cli-input-json file://cluster-config.json
{
"ClusterArn": "arn:aws:sagemaker:us-east-2:xxxxxxxxxx:cluster/wccy5z4n4m49"
}

To verify the cluster status, you can use the following command:

aws sagemaker list-clusters --output table

This command displays the cluster details, including the cluster name, status, and creation time:

------------------------------------------------------------------------------------------------------------------------------------------------------
|                                                                    ListClusters                                                                    |
+----------------------------------------------------------------------------------------------------------------------------------------------------+
||                                                                 ClusterSummaries                                                                 ||
|+----------------------------------------------------------------+---------------------------+----------------+------------------------------------+|
||                           ClusterArn                           |        ClusterName        | ClusterStatus  |           CreationTime             ||
|+----------------------------------------------------------------+---------------------------+----------------+------------------------------------+|
||  arn:aws:sagemaker:us-west-2:xxxxxxxxxxxx:cluster/zsmyi57puczf |         ml-cluster        |   InService     |  2025-03-03T16:45:05.320000+00:00  ||
|+----------------------------------------------------------------+---------------------------+----------------+------------------------------------+|

Alternatively, you can verify the cluster status on the SageMaker console. After a brief period, you can observe that the status for the nodes transitions to Running.

Create an FSx for Lustre shared file system

For us to deploy the Ray cluster, we need the SageMaker HyperPod cluster to be up and running, and additionally we need a shared storage volume (for example, an Amazon FSx for Lustre file system). This is a shared file system that the SageMaker HyperPod nodes can access. This file system can be provisioned statically before launching your SageMaker HyperPod cluster or dynamically afterwards.

Specifying a shared storage location (such as cloud storage or NFS) is optional for single-node clusters, but it is required for multi-node clusters. Using a local path will raise an error during checkpointing for multi-node clusters.

The Amazon FSx for Lustre CSI driver uses IAM roles for service accounts (IRSA) to authenticate AWS API calls. To use IRSA, an IAM OpenID Connect (OIDC) provider needs to be associated with the OIDC issuer URL that comes provisioned your EKS cluster.

Create an IAM OIDC identity provider for your cluster with the following command:

eksctl utils associate-iam-oidc-provider --cluster $EKS_CLUSTER_NAME --approve

Deploy the FSx for Lustre CSI driver:

helm repo add aws-fsx-csi-driver https://kubernetes-sigs.github.io/aws-fsx-csi-driver
helm repo update
helm upgrade --install aws-fsx-csi-driver aws-fsx-csi-driver/aws-fsx-csi-driver
  --namespace kube-system 

This Helm chart includes a service account named fsx-csi-controller-sa that gets deployed in the kube-system namespace.

Use the eksctl CLI to create an AWS Identity and Access Management (IAM) role bound to the service account used by the driver, attaching the AmazonFSxFullAccess AWS managed policy:

eksctl create iamserviceaccount 
  --name fsx-csi-controller-sa 
  --override-existing-serviceaccounts 
  --namespace kube-system 
  --cluster $EKS_CLUSTER_NAME 
  --attach-policy-arn arn:aws:iam::aws:policy/AmazonFSxFullAccess 
  --approve 
  --role-name AmazonEKSFSxLustreCSIDriverFullAccess 
  --region $AWS_REGION

The --override-existing-serviceaccounts flag lets eksctl know that the fsx-csi-controller-sa service account already exists on the EKS cluster, so it skips creating a new one and updates the metadata of the current service account instead.

Annotate the driver’s service account with the Amazon Resource Name (ARN) of the AmazonEKSFSxLustreCSIDriverFullAccess IAM role that was created:

SA_ROLE_ARN=$(aws iam get-role --role-name AmazonEKSFSxLustreCSIDriverFullAccess --query 'Role.Arn' --output text)

kubectl annotate serviceaccount -n kube-system fsx-csi-controller-sa 
  eks.amazonaws.com/role-arn=${SA_ROLE_ARN} --overwrite=true

This annotation lets the driver know what IAM role it should use to interact with the FSx for Lustre service on your behalf.

Verify that the service account has been properly annotated:

kubectl get serviceaccount -n kube-system fsx-csi-controller-sa -o yaml

Restart the fsx-csi-controller deployment for the changes to take effect:

kubectl rollout restart deployment fsx-csi-controller -n kube-system

The FSx for Lustre CSI driver presents you with two options for provisioning a file system:

  • Dynamic provisioning – This option uses Persistent Volume Claims (PVCs) in Kubernetes. You define a PVC with desired storage specifications. The CSI driver automatically provisions the FSx for Lustre file system for you based on the PVC request. This allows for straightforward scaling and eliminates the need to manually create file systems.
  • Static provisioning – In this method, you manually create the FSx for Lustre file system before using the CSI driver. You will need to configure details like subnet ID and security groups for the file system. Then, you can use the driver to mount this pre-created file system within your container as a volume.

For this example, we use dynamic provisioning. Start by creating a storage class that uses the fsx.csi.aws.com provisioner:

cat <<EOF > storageclass.yaml
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: fsx-sc
provisioner: fsx.csi.aws.com
parameters:
  subnetId: ${SUBNET_ID}
  securityGroupIds: ${SECURITYGROUP_ID}
  deploymentType: PERSISTENT_2
  automaticBackupRetentionDays: "0"
  copyTagsToBackups: "true"
  perUnitStorageThroughput: "250"
  dataCompressionType: "LZ4"
  fileSystemTypeVersion: "2.12"
mountOptions:
  - flock
EOF

kubectl apply -f storageclass.yaml
  • SUBNET_ID: The subnet ID that the FSx for Lustre filesystem. Should be the same private subnet that was used for HyperPod creation.
  • SECURITYGROUP_ID: The security group IDs that will be attached to the file system. Should be the same Security Group ID that is used in HyperPod and EKS.

Next, create a PVC that uses the fsx-claim storage claim:

cat <<EOF > pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: fsx-claim
spec:
  accessModes:
  - ReadWriteMany
  storageClassName: fsx-sc
  resources:
    requests:
      storage: 1200Gi
EOF

kubectl apply -f pvc.yaml

This PVC will start the dynamic provisioning of an FSx for Lustre file system based on the specifications provided in the storage class.

Create the Ray cluster

Now that we have both the SageMaker HyperPod cluster and the FSx for Lustre file system created, we can set up the Ray cluster:

  1. Set up dependencies. We will create a new namespace in our Kubernetes cluster and install the KubeRay operator using a Helm chart.

We recommend using KubeRay operator version 1.2.0 or higher, which supports automatic Ray Pod eviction and replacement in case of failures (for example, hardware issues on EKS or SageMaker HyperPod nodes).

# Create KubeRay namespace
kubectl create namespace kuberay
# Deploy the KubeRay operator with the Helm chart repository
helm repo add kuberay https://ray-project.github.io/kuberay-helm/
helm repo update
#Install both CRDs and Kuberay operator v1.2.0
helm install kuberay-operator kuberay/kuberay-operator --version 1.2.0 --namespace kuberay
# Kuberay operator pod will be deployed onto head pod
kubectl get pods --namespace kuberay
  1. Create a Ray Container Image for the Ray Cluster manifest. With the recent deprecation of the `rayproject/ray-ml` images starting from Ray version 2.31.0, it’s necessary to create a custom container image for our Ray cluster. Therefore, we will build on top of the `rayproject/ray:2.42.1-py310-gpu` image, which has all necessary Ray dependencies, and include our training dependencies to build our own custom image. Please feel free to modify this Dockerfile as you wish.

First, create a Dockerfile that builds upon the base Ray GPU image and includes only the necessary dependencies:

cat <<EOF > Dockerfile
 
FROM rayproject/ray:2.42.1-py310-gpu
# Install Python dependencies for PyTorch, Ray, Hugging Face, and more
RUN pip install --no-cache-dir 
    torch torchvision torchaudio 
    numpy 
    pytorch-lightning 
    transformers datasets evaluate tqdm click 
    ray[train] ray[air] 
    ray[train-torch] ray[train-lightning] 
    torchdata 
    torchmetrics 
    torch_optimizer 
    accelerate 
    scikit-learn 
    Pillow==9.5.0 
    protobuf==3.20.3
 
RUN pip install --upgrade datasets transformers
 
# Set the user
USER ray
WORKDIR /home/ray
 
# Verify ray installation
RUN which ray && 
    ray –-version
  
# Default command
CMD [ "/bin/bash" ]
 
EOF

Then, build and push the image to your container registry (Amazon ECR) using the provided script:

export AWS_REGION=$(aws configure get region)
export ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
export REGISTRY=${ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com/
 
echo "This process may take 10-15 minutes to complete..."
 
echo "Building image..."
 
docker build --platform linux/amd64 -t ${REGISTRY}aws-ray-custom:latest .
 
# Create registry if needed
REGISTRY_COUNT=$(aws ecr describe-repositories | grep "aws-ray-custom" | wc -l)
if [ "$REGISTRY_COUNT" == "0" ]; then
    aws ecr create-repository --repository-name aws-ray-custom
fi
 
# Login to registry
echo "Logging in to $REGISTRY ..."
aws ecr get-login-password --region $AWS_REGION| docker login --username AWS --password-stdin $REGISTRY
 
echo "Pushing image to $REGISTRY ..."
 
# Push image to registry
docker image push ${REGISTRY}aws-ray-custom:latest 

Now, our Ray container image is in Amazon ECR with all necessary Ray dependencies, as well as code library dependencies.

  1. Create a Ray cluster manifest. We use a Ray cluster to host our training jobs. The Ray cluster is the primary resource for managing Ray instances on Kubernetes. It represents a cluster of Ray nodes, including a head node and multiple worker nodes. The Ray cluster CRD determines how the Ray nodes are set up, how they communicate, and how resources are allocated among them. The nodes in a Ray cluster manifest as pods in the EKS or SageMaker HyperPod cluster.

Note that there are two distinct sections in the cluster manifest. While the `headGroupSpec` defines the head node of the Ray Cluster, the `workerGroupSpecs` define the worker nodes of the Ray Cluster. While a job could technically run on the Head node as well, it is common to separate the head node from the actual worker nodes where jobs are executed. Therefore, the instance for the head node can typically be a smaller instance (i.e. we chose a m5.2xlarge). Since the head node also manages cluster-level metadata, it can be beneficial to have it run on a non-GPU node to minimize the risk of node failure (as GPU can be a potential source of node failure).

cat <<'EOF' > raycluster.yaml
apiVersion: ray.io/v1alpha1
kind: RayCluster
metadata:
  name: rayml
  labels:
    controller-tools.k8s.io: "1.0"
spec:
  # Ray head pod template
  headGroupSpec:
    # The `rayStartParams` are used to configure the `ray start` command.
    # See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
    # See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
    rayStartParams:
      dashboard-host: '0.0.0.0'
    #pod template
    template:
      spec:
        #        nodeSelector:  
        #node.kubernetes.io/instance-type: "ml.m5.2xlarge"
        securityContext:
          runAsUser: 0
          runAsGroup: 0
          fsGroup: 0
        containers:
        - name: ray-head
          image: ${REGISTRY}aws-ray-custom:latest     ## IMAGE: Here you may choose which image your head pod will run
          env:                                ## ENV: Here is where you can send stuff to the head pod
            - name: RAY_GRAFANA_IFRAME_HOST   ## PROMETHEUS AND GRAFANA
              value: http://localhost:3000
            - name: RAY_GRAFANA_HOST
              value: http://prometheus-grafana.prometheus-system.svc:80
            - name: RAY_PROMETHEUS_HOST
              value: http://prometheus-kube-prometheus-prometheus.prometheus-system.svc:9090
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh","-c","ray stop"]
          resources:
            limits:                                    ## LIMITS: Set resource limits for your head pod
              cpu: 1
              memory: 8Gi
            requests:                                    ## REQUESTS: Set resource requests for your head pod
              cpu: 1
              memory: 8Gi
          ports:
          - containerPort: 6379
            name: gcs-server
          - containerPort: 8265 # Ray dashboard
            name: dashboard
          - containerPort: 10001
            name: client
          - containerPort: 8000
            name: serve
          volumeMounts:                                    ## VOLUMEMOUNTS
          - name: fsx-storage
            mountPath: /fsx
          - name: ray-logs
            mountPath: /tmp/ray
        volumes:
          - name: ray-logs
            emptyDir: {}
          - name: fsx-storage
            persistentVolumeClaim:
              claimName: fsx-claim
  workerGroupSpecs:
  # the pod replicas in this group typed worker
  - replicas: 4                                    ## REPLICAS: How many worker pods you want 
    minReplicas: 1
    maxReplicas: 10
    # logical group name, for this called small-group, also can be functional
    groupName: gpu-group
    rayStartParams:
      num-gpus: "8"
    #pod template
    template:
      spec:
        #nodeSelector:
        # node.kubernetes.io/instance-type: "ml.p5.48xlarge"
        securityContext:
          runAsUser: 0
          runAsGroup: 0
          fsGroup: 0
        containers:
        - name: ray-worker
          image: ${REGISTRY}aws-ray-custom:latest             ## IMAGE: Here you may choose which image your head node will run
          env:
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh","-c","ray stop"]
          resources:
            limits:                                    ## LIMITS: Set resource limits for your worker pods
              nvidia.com/gpu: 8
              #vpc.amazonaws.com/efa: 32  
            requests:                                    ## REQUESTS: Set resource requests for your worker pods
              nvidia.com/gpu: 8
              #vpc.amazonaws.com/efa: 32
          volumeMounts:                                    ## VOLUMEMOUNTS
          - name: ray-logs
            mountPath: /tmp/ray
          - name: fsx-storage
            mountPath: /fsx
        volumes:
        - name: fsx-storage
          persistentVolumeClaim:
            claimName: fsx-claim
        - name: ray-logs
          emptyDir: {}
EOF
  1. Deploy the Ray cluster:
envsubst < raycluster.yaml | kubectl apply -f -
  1. Optionally, expose the Ray dashboard using port forwarding:
# Gets name of kubectl service that runs the head pod
export SERVICEHEAD=$(kubectl get service | grep head-svc | awk '{print $1}' | head -n 1)
# Port forwards the dashboard from the head pod service
kubectl port-forward --address 0.0.0.0 service/${SERVICEHEAD} 8265:8265 > /dev/null 2>&1 &

Now, you can visit http://localhost:8265/ to visit the Ray Dashboard.

  1. To launch a training job, there are a few options:
    1. Use the Ray jobs submission SDK, where you can submit jobs to the Ray cluster through the Ray dashboard port (8265 by default) where Ray listens for job requests. To learn more, see Quickstart using the Ray Jobs CLI.
    2. Execute a Ray job in the head pod where you exec directly into the head pod and then submit your job. To learn more, see RayCluster Quickstart.

For this example, we use the first method and submit the job through the SDK. Therefore, we simply run from a local environment where the training code is available in --working-dir. Relative to this path, we specify the main training Python script located at --train.py
Within the working-dir folder, we can also include additional scripts we might need to run the training.

The fsdp-ray.py example is located in aws-do-ray/Container-Root/ray/raycluster/jobs/fsdp-ray/fsdp-ray.py in the aws-do-ray GitHub repo.

# Within jobs/ folder
ray job submit --address http://localhost:8265 --working-dir "fsdp-ray" -- python3 fsdp-ray.py

For our Python training script to run, we need to make sure our training scripts are correctly set up to use Ray. This includes the following steps:

  • Configure a model to run distributed and on the correct CPU/GPU device
  • Configure a data loader to shard data across the workers and place data on the correct CPU or GPU device
  • Configure a training function to report metrics and save checkpoints
  • Configure scaling and CPU or GPU resource requirements for a training job
  • Launch a distributed training job with a TorchTrainer class

For further details on how to adjust your existing training script to get the most out of Ray, refer to the Ray documentation.

The following diagram illustrates the complete architecture you have built after completing these steps.

Ray on Hyperpod EKS Architecture

Implement training job resiliency with the job auto resume functionality

Ray is designed with robust fault tolerance mechanisms to provide resilience in distributed systems where failures are inevitable. These failures generally fall into two categories: application-level failures, which stem from bugs in user code or external system issues, and system-level failures, caused by node crashes, network disruptions, or internal bugs in Ray. To address these challenges, Ray provides tools and strategies that enable applications to detect, recover, and adapt seamlessly, providing reliability and performance in distributed environments. In this section, we look at two of the most common types of failures, and how to implement fault tolerance in them that SageMaker HyperPod compliments: Ray Train worker failures and Ray worker node failures.

  • Ray Train worker – This is a worker process specifically used for training tasks within Ray Train, Ray’s distributed training library. These workers handle individual tasks or shards of a distributed training job. Each worker is responsible for processing a portion of the data, training a subset of the model, or performing computation during distributed training. They are coordinated by the Ray Train orchestration logic to collectively train a model.
  • Ray worker node – At the Ray level, this is a Ray node in a Ray cluster. It’s part of the Ray cluster infrastructure and is responsible for running tasks, actors, and other processes as orchestrated by the Ray head node. Each worker node can host multiple Ray processes that execute tasks or manage distributed objects. At the Kubernetes level, a Ray worker node is a Kubernetes pod that is managed by a KubeRay operator. For this post, we will be talking about the Ray worker nodes at the Kubernetes level, so we will refer to them as pods.

At the time of writing, there are no official updates regarding head pod fault tolerance and auto resume capabilities. Though head pod failures are rare, in the unlikely event of such a failure, you will need to manually restart your training job. However, you can still resume progress from the last saved checkpoint. To minimize the risk of hardware-related head pod failures, it’s advised to place the head pod on a dedicated, CPU-only SageMaker HyperPod node, because GPU failures are a common training job failure point.

Ray Train worker failures

Ray Train is designed with fault tolerance to handle worker failures, such as RayActorErrors. When a failure occurs, the affected workers are stopped, and new ones are automatically started to maintain operations. However, for training progress to continue seamlessly after a failure, saving and loading checkpoints is essential. Without proper checkpointing, the training script will restart, but all progress will be lost. Checkpointing is therefore a critical component of Ray Train’s fault tolerance mechanism and needs to be implemented in your code.

Automatic recovery

When a failure is detected, Ray shuts down failed workers and provisions new ones. Although this happens, we can tell the training function to always keep retrying until training can continue. Each instance of recovery from a worker failure is considered a retry. We can set the number of retries through the max_failures attribute of the FailureConfig, which is set in the RunConfig passed to the Trainer (for example, TorchTrainer). See the following code:

from ray.train import RunConfig, FailureConfig
# Tries to recover a run up to this many times.
run_config = RunConfig(failure_config=FailureConfig(max_failures=2))
# No limit on the number of retries.
run_config = RunConfig(failure_config=FailureConfig(max_failures=-1))

For more information, see Handling Failures and Node Preemption.

Checkpoints

A checkpoint in Ray Train is a lightweight interface representing a directory stored either locally or remotely. For example, a cloud-based checkpoint might point to s3://my-bucket/checkpoint-dir, and a local checkpoint might point to /tmp/checkpoint-dir. To learn more, see Saving checkpoints during training.

To save a checkpoint in the training loop, you first need to write your checkpoint to a local directory, which can be temporary. When saving, you can use checkpoint utilities from other frameworks like torch.save, pl.Trainer.save_checkpoint, accelerator.save_model, save_pretrained, tf.keras.Model.save, and more. Then you create a checkpoint from the directory using Checkpoint.from_directory. Finally, report the checkpoint to Ray Train using ray.train.report(metrics, checkpoint=...). The metrics reported alongside the checkpoint are used to keep track of the best-performing checkpoints. Reporting will upload the checkpoint to persistent storage.

If you save checkpoints with ray.train.report(..., checkpoint=...) and run on a multi-node cluster, Ray Train will raise an error if NFS or cloud storage is not set up. This is because Ray Train expects all workers to be able to write the checkpoint to the same persistent storage location.

Finally, clean up the local temporary directory to free up disk space (for example, by exiting the tempfile.TemporaryDirectory context). We can save a checkpoint every epoch or every few iterations.

The following diagram illustrates this setup.

Ray Checkpointing Architecture

The following code is an example of saving checkpoints using native PyTorch:

import os
import tempfile

import numpy as np
import torch
import torch.nn as nn
from torch.optim import Adam

import ray.train.torch
from ray import train
from ray.train import Checkpoint, ScalingConfig
from ray.train.torch import TorchTrainer


def train_func(config):
    n = 100
    # create a toy dataset
    # data   : X - dim = (n, 4)
    # target : Y - dim = (n, 1)
    X = torch.Tensor(np.random.normal(0, 1, size=(n, 4)))
    Y = torch.Tensor(np.random.uniform(0, 1, size=(n, 1)))
    # toy neural network : 1-layer
    # Wrap the model in DDP
    model = ray.train.torch.prepare_model(nn.Linear(4, 1))
    criterion = nn.MSELoss()

    optimizer = Adam(model.parameters(), lr=3e-4)
    for epoch in range(config["num_epochs"]):
        y = model.forward(X)
        loss = criterion(y, Y)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        metrics = {"loss": loss.item()}

        with tempfile.TemporaryDirectory() as temp_checkpoint_dir:
            checkpoint = None

            should_checkpoint = epoch % config.get("checkpoint_freq", 1) == 0
            # In standard DDP training, where the model is the same across all ranks,
            # only the global rank 0 worker needs to save and report the checkpoint
            if train.get_context().get_world_rank() == 0 and should_checkpoint:
                torch.save(
                    model.module.state_dict(),  # NOTE: Unwrap the model.
                    os.path.join(temp_checkpoint_dir, "model.pt"),
                )
                checkpoint = Checkpoint.from_directory(temp_checkpoint_dir)

            train.report(metrics, checkpoint=checkpoint)


trainer = TorchTrainer(
    train_func,
    train_loop_config={"num_epochs": 5},
    scaling_config=ScalingConfig(num_workers=2),
)
result = trainer.fit()

Ray Train also comes with CheckpointConfig, a way to configure checkpointing options:

from ray.train import RunConfig, CheckpointConfig
# Example 1: Only keep the 2 *most recent* checkpoints and delete the others.
run_config = RunConfig(checkpoint_config=CheckpointConfig(num_to_keep=2))
# Example 2: Only keep the 2 *best* checkpoints and delete the others.
run_config = RunConfig(
    checkpoint_config=CheckpointConfig(
        num_to_keep=2,
        # *Best* checkpoints are determined by these params:
        checkpoint_score_attribute="mean_accuracy",
        checkpoint_score_order="max",
    ),
    # This will store checkpoints on S3.
    storage_path="s3://remote-bucket/location",
)

To restore training state from a checkpoint if your training job were to fail and retry, you should modify your training loop to auto resume and then restore a Ray Train job. By pointing to the path of your saved checkpoints, you can restore your trainer and continue training. Here’s a quick example:

from ray.train.torch import TorchTrainer

restored_trainer = TorchTrainer.restore(
    path="~/ray_results/dl_trainer_restore",  # Can also be a cloud storage path like S3
    datasets=get_datasets(),
)
result = restored_trainer.fit()

To streamline restoration, you can add auto resume logic to your script. This checks if a valid experiment directory exists and restores the trainer if available. If not, it starts a new experiment:

experiment_path = "~/ray_results/dl_restore_autoresume"
if TorchTrainer.can_restore(experiment_path):
    trainer = TorchTrainer.restore(experiment_path, datasets=get_datasets())
else:
    trainer = TorchTrainer(
        train_loop_per_worker=train_loop_per_worker,
        datasets=get_datasets(),
        scaling_config=train.ScalingConfig(num_workers=2),
        run_config=train.RunConfig(
            storage_path="~/ray_results",
            name="dl_restore_autoresume",
        ),
    )
result = trainer.fit()

To summarize, to provide fault tolerance and auto resume when using Ray Train libraries, set your max_failures parameter in the FailureConfig (we recommend setting it to -1 to make sure it will keep retrying until the SageMaker HyperPod node is rebooted or replaced), and make sure you have enabled checkpointing in your code.

Ray worker pod failures

In addition to the aforementioned mechanisms to recover from Ray Train worker failures, Ray also provides fault tolerance at the worker pod level. When a worker pod fails (this includes scenarios in which the raylet process fails), the running tasks and actors on it will fail and the objects owned by worker processes of this pod will be lost. In this case, the tasks, actors, and objects fault tolerance mechanisms will start and try to recover the failures using other worker pods.

These mechanisms will be implicitly handled by the Ray Train library. To learn more about the underlying fault tolerance at the tasks, actors, objects (implemented at the Ray Core level), see Fault Tolerance.

In practice, this means that in case of a worker pod failure, the following occurs:

  • If there is a free worker pod in the Ray cluster, Ray will recover the failed worker pod by replacing it with the free worker pod.
  • If there is no free worker pod, but in the underlying SageMaker HyperPod cluster there are free SageMaker HyperPod nodes, Ray will schedule a new worker pod onto one of the free SageMaker HyperPod nodes. This pod will join the running Ray cluster and the failure will be recovered using this new worker pod.

In the context of KubeRay, Ray worker nodes are represented by Kubernetes pods, and failures at this level can include issues such as pod eviction or preemption caused by software-level factors.

However, another critical scenario to consider is hardware failures. If the underlying SageMaker HyperPod node becomes unavailable due to a hardware issue, such as a GPU error, it would inevitably cause the Ray worker pod running on that node to fail as well. Now the fault tolerance and auto-healing mechanisms of your SageMaker HyperPod cluster start and will reboot or replace the faulty node. After the new healthy node is added into the SageMaker HyperPod cluster, Ray will schedule a new worker pod onto the SageMaker HyperPod node and recover the interrupted training. In this case, both the Ray fault tolerance mechanism and the SageMaker HyperPod resiliency features work together seamlessly and make sure that even in case of a hardware failure, your ML training workload can auto resume and pick up from where it was interrupted.

As you have seen, there are various built-in resiliency and fault-tolerance mechanisms that allow your Ray Train workload on SageMaker HyperPod to recover and auto resume. Because these mechanisms will essentially recover by restarting the training job, it is crucial that checkpointing is implemented in the training script. It is also generally advised to save the checkpoints on a shared and persistent path, such as an Amazon Simple Storage Service (Amazon S3) bucket or FSx for Lustre file system.

Clean up

To delete your SageMaker HyperPod cluster created in this post, you can either use the SageMaker AI console or use the following AWS CLI command:

aws sagemaker delete-cluster --cluster-name <cluster_name>

Cluster deletion will take a few minutes. You can confirm successful deletion after you see no clusters on the SageMaker AI console.

If you used the CloudFormation stack to create resources, you can delete it using the following command:

aws cloudformation delete-stack --stack-name <stack_name>

Conclusion

This post demonstrated how to set up and deploy Ray clusters on SageMaker HyperPod, highlighting key considerations such as storage configuration and fault tolerance and auto resume mechanisms.

Running Ray jobs on SageMaker HyperPod offers a powerful solution for distributed AI/ML workloads, combining the flexibility of Ray with the robust infrastructure of SageMaker HyperPod. This integration provides enhanced resiliency and auto resume capabilities, which are crucial for long-running and resource-intensive tasks. By using Ray’s distributed computing framework and the built-in features of SageMaker HyperPod, you can efficiently manage complex ML workflows, specifically training workloads as covered in this post. As AI/ML workloads continue to grow in scale and complexity, the combination of Ray and SageMaker HyperPod offers a scalable, resilient, and efficient platform for tackling the most demanding computational challenges in machine learning.

To get started with SageMaker HyperPod, refer to the Amazon EKS Support in Amazon SageMaker HyperPod workshop and the Amazon SageMaker HyperPod Developer Guide. To learn more about the aws-do-ray framework, refer to the GitHub repo.


About the Authors

mvincigMark Vinciguerra is an Associate Specialist Solutions Architect at Amazon Web Services (AWS) based in New York. He focuses on the Automotive and Manufacturing sector, specializing in helping organizations architect, optimize, and scale artificial intelligence and machine learning solutions, with particular expertise in autonomous vehicle technologies. Prior to AWS, he went to Boston University and graduated with a degree in Computer Engineering.

flostahlFlorian Stahl is a Worldwide Specialist Solutions Architect at AWS, based in Hamburg, Germany. He specializes in Artificial Intelligence, Machine Learning, and Generative AI solutions, helping customers optimize and scale their AI/ML workloads on AWS. With a background as a Data Scientist, Florian focuses on working with customers in the Autonomous Vehicle space, bringing deep technical expertise to help organizations design and implement sophisticated machine learning solutions. He works closely with customers worldwide to transform their AI initiatives and maximize the value of their machine learning investments on AWS.

anoopxAnoop Saha is a Sr GTM Specialist at Amazon Web Services (AWS) focusing on Gen AI model training and inference. He is partnering with top foundation model builders, strategic customers, and AWS service teams to enable distributed training and inference at scale on AWS and lead joint GTM motions. Before AWS, Anoop has held several leadership roles at startups and large corporations, primarily focusing on silicon and system architecture of AI infrastructure.

iankoulsAlex Iankoulski is a Principal Solutions Architect, ML/AI Frameworks, who focuses on helping customers orchestrate their AI workloads using containers and accelerated computing infrastructure on AWS. He is also the author of the open source do framework and a Docker captain who loves applying container technologies to accelerate the pace of innovation while solving the world’s biggest challenges.

Read More

Using Large Language Models on Amazon Bedrock for multi-step task execution

Using Large Language Models on Amazon Bedrock for multi-step task execution

The goal of this blog post is to show you how a large language model (LLM) can be used to perform tasks that require multi-step dynamic reasoning and execution. Examples of tasks that require dynamic reasoning and execution are answering questions of the form “What is the average length of stay for patients with [specific condition] across different hospitals?” or “How do prescription trends for [specific drug] vary across different regions?” Traditionally, answering these queries required the expertise of business intelligence specialists and data engineers, often resulting in time-consuming processes and potential bottlenecks.

However, the advances in LLMs have opened up exciting new possibilities to break down complex tasks into a series of steps, use tools to complete each step, and produce a final solution.

In this context, the term tools refer to external capabilities or APIs that the model can access and interact with to extend its functionality beyond text-based responses. These tools allow LLMs to perform specialized tasks such as retrieving real-time information, running code, browsing the web, or generating images. By using tools, LLMs can offer more accurate, context-aware, and actionable outputs, enabling them to effectively assist with complex queries that require access to data or functions outside their internal knowledge base.

In this post, we showcase an example of such capabilities for a patient record retrieval solution built on top of an existing dataset by relying on APIs only instead of text-to-SQL. You can find the code for this post on GitHub .

Overview of solution

The goal of the solution is to accurately answer analytical questions that require multi-step reasoning and execution. The following is an example of an expected interaction with the agent.

User: Give me the first and last name of the patient with least vaccines and their vaccine count.

AI: Based on the results of our analysis, I can answer your question: The patient with the least number of vaccines is Sharleen176 Kulas532, and their vaccine count is

Here’s a breakdown of how we arrived at this answer:

1. We first retrieved patients and immunization records.

2. We then grouped the immunization records by patient_id and counted the number of vaccines for each patient.

3. We ordered this list by the vaccine count in ascending order.

4. We limited the result to the top 1 (the patient with the least vaccines).

5. We joined this result with the patient information to get the first and last name.

6. Finally, we selected only the relevant information (first name, last name, and vaccine count).

The result shows that Sharleen176 Kulas532 has received only 1 vaccine, which is the lowest among patients in the database.

For this solution, we use the Synthetic Patient Generation dataset, which contains multiple tables of healthcare industry related synthetic data.

You can set up the dataset at the project’s root folder using the following steps:

curl https://synthetichealth.github.io/synthea-sample-data/downloads/synthea_sample_data_csv_apr2020.zip > dataset.zip

unzip dataset.zip

mv csv dataset

The solution is made from two core steps: plan and execute. In its simplest form, it can be represented by the following diagram.


Fig 1: Simple execution flow – solution overview

In a more complex scheme, you can add multiple layers of validation and provide relevant APIs to increase the success rate of the LLM.

Fig 2: Complex execution flow – solution overview

Plan

In the Plan stage, the LLM is given a set of predefined API function signatures along with a brief description of what each function does. These function signatures act as tools that the LLM can use to formulate a plan to answer a user’s query. The goal is to have the LLM reason through the steps required to arrive at the answer, much like a human would.

Why the plan stage is important

The Plan stage is critical because it allows the LLM to create a structured, logical sequence of actions that will be executed in the next stage. By planning, the LLM can break down a complex question into manageable steps, making sure that the right APIs are called in the correct order. This structured approach helps to minimize errors and increases the likelihood of producing accurate results.

Providing function signatures

In this stage, the LLM is given a set of function signatures that represent the tools it can use. Each function signature includes the name of the function, the parameters it accepts, and the type of value it returns. Here’s an example of a few function signatures:

  • def get_patients() -> List[Patient]:
    Retrieves a list of patients from the dataset.
  • def get_immunization() -> List[Immunization]:
  • Retrieves a list of immunization records from the dataset.
  • def filter(list: List[object], keys: List[str], values: List[str]) > List[object]:
    Filters a given list based on specified keys and values.
  • def join(a: List, b: List, left_key: str, right_key: str, how: JoinMode) > List:
    Joins two lists based on matching keys, using a specified join mode (e.g., INNER, LEFT, RIGHT).

These function signatures act as building blocks for the LLM to generate a plan. The LLM must choose the appropriate functions and sequence them in a logical order to achieve the desired outcome.

Retrieval Augmented Generation (RAG) improves the selection process by narrowing down the tools an LLM sees based on the task, simplifying the prompt. In a project with many tools, RAG makes sure that only the most relevant tools are surfaced for a given query, reducing complexity and helping the LLM make more accurate decisions. This focused exposure enhances performance by preventing the model from being overwhelmed by irrelevant options.

Generating a plan

After the function signatures are provided, the LLM is prompted to create a plan. The plan typically consists of a series of steps, each represented as a JSON object. Each step indicates a function that needs to be executed, the parameters that need to be passed, and the expected outcome (often referred to as evidence).

For example, if the task is to find the patient with the least number of vaccines, the LLM might generate a plan that includes the following steps:

  1. Retrieve patients: Use the get_patients() function to get a list of patients.
  2. Retrieve immunization records: Use the get_immunization() function to get a list of immunizations.
  3. Group by patient: Use the group_by() function to group the immunizations by patient_id, counting the number of vaccines for each patient.
  4. Order by count: Use the order_by() function to sort the grouped list in ascending order based on the vaccine count.
  5. Limit the result: Use the limit() function to select the patient with the least vaccines.
  6. Join with patient data: Use the join() function to match the selected result with the patient’s information.
  7. Select relevant fields: Use the select() function to extract only the necessary fields, such as the patient’s first name, last name, and vaccine count.

JSON representation

The LLM outputs this plan as a structured JSON, which makes it straightforward to parse and execute in the next stage. The JSON format helps makes sure that the plan is clear, unambiguous, and ready for programmatic execution.

The following is an example of what the JSON might look like:

{
    "role": "assistant",
    "content": [
        {
            "toolUse": {
                "toolUseId": "tooluse_example_id",
                "name": "execute_plan",
                "input": {
                    "plans": [
                        {
                            "function_name": "get_patients",
                            "parameters": [],
                            "evidence_number": 1
                        },
                        {
                            "function_name": "get_immunization",
                            "parameters": [],
                            "evidence_number": 2
                        },
                        {
                            "function_name": "group_by",
                            "parameters": [
                                "list",
                                "group_key",
                                "aggregation_key",
                                "aggregation"
                            ],
                            "parameter_values": [
                                "#E2",
                                "patient_id",
                                null,
                                "COUNT"
                            ],
                            "evidence_number": 3
                        },
                        {
                            "function_name": "order_by",
                            "parameters": [
                                "list",
                                "key",
                                "value"
                            ],
                            "parameter_values": [
                                "#E3",
                                "count",
                                "ASCENDING"
                            ],
                            "evidence_number": 4
                        },
                        {
                            "function_name": "limit",
                            "parameters": [
                                "list",
                                "k"
                            ],
                            "parameter_values": [
                                "#E4",
                                1
                            ],
                            "evidence_number": 5
                        },
                        {
                            "function_name": "join",
                            "parameters": [
                                "a",
                                "b",
                                "left_key",
                                "right_key",
                                "how"
                            ],
                            "parameter_values": [
                                "#E5",
                                "#E1",
                                "patient_id",
                                "id",
                                "INNER"
                            ],
                            "evidence_number": 6
                        },
                        {
                            "function_name": "select",
                            "parameters": [
                                "list",
                                "keys"
                            ],
                            "parameter_values": [
                                "#E6",
                                [
                                    "first",
                                    "last",
                                    "count"
                                ]
                            ],
                            "evidence_number": 7
                        }
                    ]
                }
            }
        }
    ]
}

Execute

In the Execute stage, the structured plan generated by the LLM in the previous step is programmatically carried out to produce the final output. The JSON blueprint from the planning stage is parsed, and each function call described in the plan is executed sequentially.

The process begins with data retrieval, such as accessing patient records or immunization data, using predefined API functions such as get_patients() or get_immunization(). These initial function calls generate intermediate results, which are stored as evidence and referenced in subsequent steps.

The plan typically involves a series of data transformation functions, such as group_by() to aggregate data, filter() for refining results, and order_by() for data sorting. Each function is executed with specific parameters as outlined in the JSON plan, utilizing  progressive data refinement to answer the query.

As each function is executed, its output is passed to the subsequent function in the sequence. This chain of function calls culminates in a final step, often involving a select() function to extract the most relevant information, such as a patient’s name and vaccine count.

Error handling in the Execute stage is crucial for facilitating  the reliability and robustness of the entire process. As the LLM’s plan is executed, various issues can arise, including empty datasets, invalid parameters, or mismatched data types during function calls such as join() or filter(). To address these potential challenges, the system incorporates error-checking mechanisms at each step, enabling it to detect and respond to anomalies efficiently. If a function returns an unexpected result or encounters an issue, the system might provide the error back to the LLM itself, enabling it to regenerate the plan with necessary adjustments. This approach not only alleviates execution failures but also enhances the overall user experience by delivering accurate and reliable results, even in the face of unexpected challenges.

Summary

This post explores the application of LLMs in executing complex analytical queries through an API, with specific focus on Amazon Bedrock. Traditionally, business users rely on data professionals to retrieve and present data, but LLMs can now offer a streamlined approach enabling direct query responses by using predefined API tools. To illustrate this capability, we use the Synthetic Patient Generation dataset and present a solution structured around two primary phases: Plan and Execution.

In the Plan stage, the LLM is provided with API function signatures, which it uses to generate a structured, logical sequence of steps to answer the query. This plan is output as a JSON, providing clarity and facilitating seamless execution. In the Execute stage, the system programmatically carries out the plan by sequentially executing each function call. Robust error-handling mechanisms are integrated to identify potential issues and, if necessary, relay errors back to the LLM for plan regeneration.

To demonstrate this process, we present a use case where the system identifies the patient with the least number of vaccines by retrieving, grouping, and sorting data, and ultimately presenting the final result. This example showcases the LLM’s ability to extend beyond mere text-based responses, providing actionable and context-aware outputs that can significantly enhance business decision-making processes.

Conclusion

This article highlights the efficacy of LLMs in expanding their functionality to deliver practical, data-driven solutions that have the potential to revolutionize business analytics and decision-making workflows.


About the Authors

Bruno Klein is a Senior Machine Learning Engineer with AWS Professional Services Analytics Practice. He helps customers implement big data and analytics solutions. Outside of work, he enjoys spending time with family, traveling, and trying new food.

Rushabh Lokhande is a Senior Data & ML Engineer with AWS Professional Services Analytics Practice. He helps customers implement big data, machine learning, and analytics solutions. Outside of work, he enjoys spending time with family, reading, running, and playing golf.

Mohammad Arbabshirani, PhD, is a Sr. Data Science Manager at AWS Professional Services. He specializes in helping customers accelerate business outcomes on AWS through the application of machine learning and generative AI. He has 12 years of experience in full life cycle of machine learning, computer vision, and data science from sales support to end-to-end solution delivery specially in healthcare and life sciences vertical. Currently, Mohammad leads a team of data scientists, machine learning engineers, and data architects, focusing on delivery of cutting-edge ML solutions for customers. His background includes extensive research in neuroimaging and medical imaging. Outside of his professional endeavors, Mohammad enjoys tennis, soccer, and instrumental music.

Read More

Introducing AWS MCP Servers for code assistants (Part 1)

Introducing AWS MCP Servers for code assistants (Part 1)

We’re excited to announce the open source release of AWS MCP Servers for code assistants — a suite of specialized Model Context Protocol (MCP) servers that bring Amazon Web Services (AWS) best practices directly to your development workflow. Our specialized AWS MCP servers combine deep AWS knowledge with agentic AI capabilities to accelerate development across key areas. Each AWS MCP Server focuses on a specific domain of AWS best practices, working together to provide comprehensive guidance throughout your development journey.

This post is the first in a series covering AWS MCP Servers. In this post, we walk through how these specialized MCP servers can dramatically reduce your development time while incorporating security controls, cost optimizations, and AWS Well-Architected best practices into your code. Whether you’re an experienced AWS developer or just getting started with cloud development, you’ll discover how to use AI-powered coding assistants to tackle common challenges such as complex service configurations, infrastructure as code (IaC) implementation, and knowledge base integration. By the end of this post, you’ll understand how to start using AWS MCP Servers to transform your development workflow and deliver better solutions, faster.

If you want to get started right away, skip ahead to the section “From Concept to working code in minutes.”

AI is transforming how we build software, creating opportunities to dramatically accelerate development while improving code quality and consistency. Today’s AI assistants can understand complex requirements, generate production-ready code, and help developers navigate technical challenges in real time. This AI-driven approach is particularly valuable in cloud development, where developers need to orchestrate multiple services while maintaining security, scalability, and cost-efficiency.

Developers need code assistants that understand the nuances of AWS services and best practices. Specialized AI agents can address these needs by:

  • Providing contextual guidance on AWS service selection and configuration
  • Optimizing compliance with security best practices and regulatory requirements
  • Promoting the most efficient utilization and cost-effective solutions
  • Automating repetitive implementation tasks with AWS specific patterns

This approach means developers can focus on innovation while AI assistants handle the undifferentiated heavy lifting of coding. Whether you’re using Amazon Q, Amazon Bedrock, or other AI tools in your workflow, AWS MCP Servers complement and enhance these capabilities with deep AWS specific knowledge to help you build better solutions faster.

Model Context Protocol (MCP) is a standardized open protocol that enables seamless interaction between large language models (LLMs), data sources, and tools. This protocol allows AI assistants to use specialized tooling and to access domain-specific knowledge by extending the model’s capabilities beyond its built-in knowledge—all while keeping sensitive data local. Through MCP, general-purpose LLMs can now seamlessly access relevant knowledge beyond initial training data and be effectively steered towards desired outputs by incorporating specific context and best practices.

Accelerate building on AWS

What if your AI assistant could instantly access deep AWS knowledge, understanding every AWS service, best practice, and architectural pattern? With MCP, we can transform general-purpose LLMs into AWS specialists by connecting them to specialized knowledge servers. This opens up exciting new possibilities for accelerating cloud development while maintaining security and following best practices.

Build on AWS in a fraction of the time, with best practices automatically applied from the first line of code. Skip hours of documentation research and immediately access ready-to-use patterns for complex services such as Amazon Bedrock Knowledge Bases. Our MCP Servers will help you write well-architected code from the start, implement AWS services correctly the first time, and deploy solutions that are secure, observable, and cost-optimized by design. Transform how you build on AWS today.

  • Enforce AWS best practices automatically – Write well-architected code from the start with built-in security controls, proper observability, and optimized resource configurations
  • Cut research time dramatically – Stop spending hours reading documentation. Our MCP Servers provide contextually relevant guidance for implementing AWS services correctly, addressing common pitfalls automatically
  • Access ready-to-use patterns instantly – Use pre-built AWS CDK constructs, Amazon Bedrock Agents schema generators, and Amazon Bedrock Knowledge Bases integration templates that follow AWS best practices from the start
  • Optimize cost proactively – Prevent over-provisioning as you design your solution by getting cost-optimization recommendations and generating a comprehensive cost report to analyze your AWS spending before deployment

To turn this vision into reality and make AWS development faster, more secure, and more efficient, we’ve created AWS MCP Servers—a suite of specialized AWS MCP Servers that bring AWS best practices directly to your development workflow. Our specialized AWS MCP Servers combine deep AWS knowledge with AI capabilities to accelerate development across key areas. Each AWS MCP Server focuses on a specific domain of AWS best practices, working together to provide comprehensive guidance throughout your development journey.

Overview of domain-specific MCP Servers for AWS development

Our specialized MCP Servers are designed to cover distinct aspects of AWS development, each bringing deep knowledge to specific domains while working in concert to deliver comprehensive solutions:

  • Core – The foundation server that provides AI processing pipeline capabilities and serves as a central coordinator. It helps provide clear plans for building AWS solutions and can federate to other MCP servers as needed.
  • AWS Cloud Development Kit (AWS CDK) – Delivers AWS CDK knowledge with tools for implementing best practices, security configurations with cdk-nag, Powertools for AWS Lambda integration, and specialized constructs for generative AI services. It makes sure infrastructure as code (IaC) follows AWS Well-Architected principles from the start.
  • Amazon Bedrock Knowledge Bases – Enables seamless access to Amazon Bedrock Knowledge Bases so developers can query enterprise knowledge with natural language, filter results by data source, and use reranking for improved relevance.
  • Amazon Nova Canvas – Provides image generation capabilities using Amazon Nova Canvas through Amazon Bedrock, enabling the creation of visuals from text prompts and color palettes—perfect for mockups, diagrams, and UI design concepts.
  • Cost – Analyzes AWS service costs and generates comprehensive cost reports, helping developers understand the financial implications of their architectural decisions and optimize for cost-efficiency.

Prerequisites

To complete the solution, you need to have the following prerequisites in place:

  • uv package manager
  • Install Python using uv python install 3.13
  • AWS credentials with appropriate permissions
  • An MCP-compatible LLM client (such as Anthropic’s Claude for Desktop, Cline, Amazon Q CLI, or Cursor)

From concept to working code in minutes

You can download the AWS MCP Servers on GitHub or through the PyPI package manager. Here’s how to get started using your favorite code assistant with MCP support.

To install MCP Servers, enter the following code:

# Install and setup the MCP servers
{
  "mcpServers": {
    "awslabs.core-mcp-server": {
      "command": "uvx",
      "args": [
        "awslabs.core-mcp-server@latest"
      ],
      "env": {
        "FASTMCP_LOG_LEVEL": "ERROR",
        "MCP_SETTINGS_PATH": "path to your mcp server settings"
      },
      "autoApprove": [],
      "disabled": false
    },
    "awslabs.bedrock-kb-retrieval-mcp-server": {
      "command": "uvx",
      "args": ["awslabs.bedrock-kb-retrieval-mcp-server@latest"],
      "env": {
        "AWS_PROFILE": "your-aws-profile",
        "AWS_REGION": "us-east-1"
      }
    },
    "awslabs.cdk-mcp-server": {
      "command": "uvx",
      "args": ["awslabs.cdk-mcp-server@latest"],
      "env": {
        "FASTMCP_LOG_LEVEL": "ERROR"
      }
    },
    "awslabs.cost-analysis-mcp-server": {
      "command": "uvx",
      "args": ["awslabs.cost-analysis-mcp-server@latest"],
      "env": {
        "AWS_PROFILE": "your-aws-profile",
        "FASTMCP_LOG_LEVEL": "ERROR"
      }
    },
    "awslabs.nova-canvas-mcp-server": {
      "command": "uvx",
      "args": ["awslabs.nova-canvas-mcp-server@latest"],
      "env": {
        "AWS_PROFILE": "your-aws-profile",
        "AWS_REGION": "us-east-1"
      }
    }
  }
}

AWS MCP Servers in action

Here’s how AWS MCP servers transform the development experience:

Developer: “I need to build an AI-powered chatbot using Amazon Bedrock that can answer questions from our company’s knowledge base. I also want to add a tool for the chatbot to call our internal API.”

Core: “I’ll help you build an Amazon Bedrock Knowledge Bases chatbot with API integration. Let’s create an architecture that uses Amazon Bedrock Agents with a custom action group to call your internal API.”

Core generates a comprehensive architecture diagram showing the knowledge base integration, Amazon Bedrock Agents configuration with action groups, API connectivity, and data flow between components.

AWS CDK: “Here’s the infrastructure code for your chatbot with the Amazon Bedrock Agents action group. I’ve included proper IAM roles, security controls, and Lambda Powertools for observability.”

// CDK code with Well-Architected best practices
 const knowledgeBase = new BedrockKnowledgeBase(this, 'CompanyKB', {
   embeddingModel: BedrockFoundationModel.TITAN_EMBED_TEXT_V1,
   vectorStore: new OpenSearchServerlessVectorStore(this, 'VectorStore', {
     encryption: OpenSearchEncryption.KMS,
     ebs: OpenSearchEbsOptions.provisioned(100, OpenSearchVolumeType.GP3)
   })
 });

 // Action Group for API integration
 const actionGroupSchema = new BedrockAgentSchema(this, 'ApiActionSchema', {
   schemaPath: 'schemas/api-action-schema.json',
   // Schema will be auto-generated from your Lambda code
   generateFromLambda: {
     lambdaPath: 'lambda/api-actions.py',
     outputPath: 'schemas/api-action-schema.json'
   }
 });

 const apiActionGroup = new BedrockAgentActionGroup(this, 'ApiActionGroup', {
   actionGroupName: 'InternalApiActions',
   description: 'Actions to interact with internal API',
   actionGroupExecutor: new BedrockAgentLambdaExecutor(apiActionsLambda),
   actionGroupSchema: actionGroupSchema
 });

The CDK MCP Server generates complete AWS CDK code to deploy the entire solution. It automatically runs cdk-nag to identify potential security issues and provides remediation steps for each finding, making sure that the infrastructure follows AWS Well-Architected best practices.

Amazon Bedrock Knowledge Bases retrieval: “I’ve configured the optimal settings for your knowledge base queries, including proper reranking for improved relevance.”

Amazon Bedrock Knowledge Bases MCP Server demonstrates how to structure queries to the knowledge base for maximum relevance, provides sample code for filtering by data source, and shows how to integrate the knowledge base responses with the chatbot interface.

Amazon Nova Canvas: “To enhance your chatbot’s capabilities, I’ve created visualizations that can be generated on demand when users request data explanations.”

Amazon Nova Canvas MCP server generates sample images showing how Amazon Nova Canvas can create charts, diagrams, and visual explanations based on knowledge base content, making complex information more accessible to users.

Cost Analysis: “Based on your expected usage patterns, here’s the estimated monthly cost breakdown and optimization recommendations.”

The Cost Analysis MCP Server generates a detailed cost analysis report showing projected expenses for each AWS service, identifies cost optimization opportunities such as reserved capacity for Amazon Bedrock, and provides specific recommendations to reduce costs without impacting performance.

With AWS MCP Servers, what would typically take days of research and implementation is completed in minutes, with better quality, security, and cost-efficiency than manual development in that same time.

Best practices for MCP-assisted development

To maximize the benefits of MCP assisted development while maintaining security and code quality, developers should follow these essential guidelines:

  • Always review generated code for security implications before deployment
  • Use MCP Servers as accelerators, not replacements for developer judgment and expertise
  • Keep MCP Servers updated with the latest AWS security best practices
  • Follow the principle of least privilege when configuring AWS credentials
  • Run security scanning tools on generated infrastructure code

Coming up in the series

This post introduced the foundations of AWS MCP Servers and how they accelerate AWS development through specialized, AWS specific MCP Servers. In upcoming posts, we’ll dive deeper into:

  • Detailed walkthroughs of each MCP server’s capabilities
  • Advanced patterns for integrating AWS MCP Servers into your development workflow
  • Real-world case studies showing AWS MCP Servers’ impact on development velocity
  • How to extend AWS MCP Servers with your own custom MCP servers

Stay tuned to learn how AWS MCP Servers can transform your specific AWS development scenarios and help you build better solutions faster. Visit our GitHub repository or Pypi package manager to explore example implementations and get started today.


About the Authors

Jimin Kim is a Prototyping Architect on the AWS Prototyping and Cloud Engineering (PACE) team, based in Los Angeles. With specialties in Generative AI and SaaS, she loves helping her customers succeed in their business. Outside of work, she cherishes moments with her wife and three adorable calico cats.

Pranjali Bhandari is part of the Prototyping and Cloud Engineering (PACE) team at AWS, based in the San Francisco Bay Area. She specializes in Generative AI, distributed systems, and cloud computing. Outside of work, she loves exploring diverse hiking trails, biking, and enjoying quality family time with her husband and son.

Laith Al-Saadoon is a Principal Prototyping Architect on the Prototyping and Cloud Engineering (PACE) team. He builds prototypes and solutions using generative AI, machine learning, data analytics, IoT & edge computing, and full-stack development to solve real-world customer challenges. In his personal time, Laith enjoys the outdoors–fishing, photography, drone flights, and hiking.

Paul Vincent is a Principal Prototyping Architect on the AWS Prototyping and Cloud Engineering (PACE) team. He works with AWS customers to bring their innovative ideas to life. Outside of work, he loves playing drums and piano, talking with others through Ham radio, all things home automation, and movie nights with the family.

Justin Lewis leads the Emerging Technology Accelerator at AWS. Justin and his team help customers build with emerging technologies like generative AI by providing open source software examples to inspire their own innovation. He lives in the San Francisco Bay Area with his wife and son.

Anita Lewis is a Technical Program Manager on the AWS Emerging Technology Accelerator team, based in Denver, CO. She specializes in helping customers accelerate their innovation journey with generative AI and emerging technologies. Outside of work, she enjoys competitive pickleball matches, perfecting her golf game, and discovering new travel destinations.

Read More