Responsible AI in action: How Data Reply red teaming supports generative AI safety on AWS

Responsible AI in action: How Data Reply red teaming supports generative AI safety on AWS

Generative AI is rapidly reshaping industries worldwide, empowering businesses to deliver exceptional customer experiences, streamline processes, and push innovation at an unprecedented scale. However, amidst the excitement, critical questions around the responsible use and implementation of such powerful technology have started to emerge.

Although responsible AI has been a key focus for the industry over the past decade, the increasing complexity of generative AI models brings unique challenges. Risks such as hallucinations, controllability, intellectual property breaches, and unintended harmful behaviors are real concerns that must be addressed proactively.

To harness the full potential of generative AI while reducing these risks, it’s essential to adopt mitigation techniques and controls as an integral part of the build process. Red teaming, an adversarial exploit simulation of a system used to identify vulnerabilities that might be exploited by a bad actor, is a crucial component of this effort.

At Data Reply and AWS, we are committed to helping organizations embrace the transformative opportunities generative AI presents, while fostering the safe, responsible, and trustworthy development of AI systems.

In this post, we explore how AWS services can be seamlessly integrated with open source tools to help establish a robust red teaming mechanism within your organization. Specifically, we discuss Data Reply’s red teaming solution, a comprehensive blueprint to enhance AI safety and responsible AI practices.

Understanding generative AI’s security challenges

Generative AI systems, though transformative, introduce unique security challenges that require specialized approaches to address them. These challenges manifest in two key ways: through inherent model vulnerabilities and adversarial threats.

The inherent vulnerabilities of these models include their potential of producing hallucinated responses (generating plausible but false information), their risk of generating inappropriate or harmful content, and their potential for unintended disclosure of sensitive training data.

These potential vulnerabilities could be exploited by adversaries through various threat vectors. Bad actors might employ techniques such as prompt injection to trick models into bypassing safety controls, intentionally altering training data to compromise model behavior, or systematically probing models to extract sensitive information embedded in their training data. For both types of vulnerabilities, red teaming is a useful mechanism to mitigate those challenges because it can help identify and measure inherent vulnerabilities through systematic testing, while also simulating real-world adversarial exploits to uncover potential exploitation paths.

What is red teaming?

Red teaming is a methodology used to test and evaluate systems by simulating real-world adversarial conditions. In the context of generative AI, it involves rigorously stress-testing models to identify weaknesses, evaluate resilience, and mitigate risks. This practice helps develop AI systems that are functional, safe, and trustworthy. By adopting red teaming as part of the AI development lifecycle, organizations can anticipate threats, implement robust safeguards, and promote trust in their AI solutions.

Red teaming is critical for uncovering vulnerabilities before they are exploited. Data Reply has partnered with AWS to offer support and best practices to help integrate responsible AI and red teaming into your workflows, helping you build secure AI models. This unlocks the following benefits:

  • Mitigating unexpected risks – Generative AI systems can inadvertently produce harmful outputs, such as biased content or factually inaccurate information. With red teaming, Data Reply helps organizations test models for these weaknesses and identify vulnerabilities to adversarial exploitation, such as prompt injections or data poisoning.
  • Compliance with AI regulation – As global regulations around AI continue to evolve, red teaming can help organizations by setting up mechanisms to systematically test their applications and make them more resilient, or serve as a tool to adhere to transparency and accountability requirements. Additionally, it maintains detailed audit trails and documentation of testing activities, which are critical artifacts that can be used as evidence for demonstrating compliance with standards and responding to regulatory inquiries.
  • Reducing data leakage and malicious use – Although generative AI has the potential to be a force for good, models might also be exploited by adversaries looking to extract sensitive information or perform harmful actions. For instance, adversaries might craft prompts to extract private data from training sets or generate phishing emails and malicious code. Red teaming simulates such adversarial scenarios to identify vulnerabilities, enabling safeguards like prompt filtering, access controls, and output moderation.

The following chart outlines some of the common challenges in generative AI systems where red teaming can serve as a mitigation strategy.

Risk Categories for Generative AI

Before diving into specific threats, it’s important to acknowledge the value of having a systematic approach to AI security risk assessment for organizations deploying AI solutions. As an example, the OWASP Top 10 for LLMs can serve as a comprehensive framework for identifying and addressing critical AI vulnerabilities. This industry-standard framework categorizes key threats, including prompt injection, where malicious inputs manipulate model outputs; training data poisoning, which can compromise model integrity; and unauthorized disclosure of sensitive information embedded in model responses. It also addresses emerging risks such as insecure output handling and denial of service (DOS) that could disrupt AI operations. By using such frameworks alongside practical security testing approaches like red teaming exercises, organizations can implement targeted controls and monitoring to make sure their AI models remain secure, resilient, and align with regulatory requirements and responsible AI principles.

How Data Reply uses AWS services for responsible AI

Fairness is an essential component of responsible AI and, as such, part of the AWS core dimensions of responsible AI. To address potential fairness concerns, it can be helpful to evaluate disparities and imbalances in training data or outcomes. Amazon SageMaker Clarify helps identify potential biases during data preparation without requiring code. For example, you can specify input features such as gender or age, and SageMaker Clarify will run an analysis job to detect imbalances in those features. It generates a detailed visual report with metrics and measurements of potential bias, helping organizations understand and address imbalances.

During red teaming, SageMaker Clarify plays a key role by analyzing whether the model’s predictions and outputs treat all demographic groups equitably. If imbalances are identified, tools like Amazon SageMaker Data Wrangler can rebalance datasets using methods such as random undersampling, random oversampling, or Synthetic Minority Oversampling Technique (SMOTE). This supports the model’s fair and inclusive operation, even under adversarial testing conditions.

Veracity and robustness represent another critical dimension for responsible AI deployments. Tools like Amazon Bedrock provide comprehensive evaluation capabilities that enable organizations to assess model security and robustness through automated evaluation. These include specialized tasks such as question-answering assessments with adversarial inputs designed to probe model limitations. For instance, Amazon Bedrock can help you test model behavior across edge case scenarios by analyzing responses to carefully crafted inputs—from ambiguous queries to potentially misleading prompts—to evaluate if the models maintain reliability and accuracy even under challenging conditions.

Privacy and security go hand in hand when implementing responsible AI. Security at Amazon is “job zero” for all employees. Our strong security culture is reinforced from the top down with deep executive engagement and commitment, and from the bottom up with training, mentoring, and strong “see something, say something” as well as “when in doubt, escalate” and “no blame” principles. As an example of this commitment, Amazon Bedrock Guardrails provide organizations with a tool to incorporate robust content filtering mechanisms and protective measures against sensitive information disclosure.

Transparency is another best practice prescribed by industry standards, frameworks, and regulations, and is essential for building user trust in making informed decisions. LangFuse, an open source tool, plays a key role in providing transparency by keeping an audit trail of model decisions. This audit trail offers a way to trace model actions, helping organizations demonstrate accountability and adhere to evolving regulations.

Solution overview

To achieve the goals mentioned in the previous section, Data Reply has developed the Red Teaming Playground, a testing environment that combines several open source tools—like Giskard, LangFuse, and AWS FMEval—to assess the vulnerabilities of AI models. This playground allows AI builders to explore scenarios, perform white hat hacking, and evaluate how models react under adversarial conditions. The following diagram illustrates the solution architecture.

Red Teaming Solution Architecture Diagram

This playground is designed to help you responsibly develop and evaluate your generative AI systems, combining a robust multi-layered approach for authentication, user interaction, model management, and evaluation.

At the outset, the Identity Management Layer handles secure authentication, using Amazon Cognito and integration with external identity providers to help secure authorized access. Post-authentication, users access the UI Layer, a gateway to the Red Teaming Playground built on AWS Amplify and React. This UI directs traffic through an Application Load Balancer (ALB), facilitating seamless user interactions and allowing red team members to explore, interact, and stress-test models in real time. For knowledge retrieval, we use Amazon Bedrock Knowledge Bases, which integrates with Amazon Simple Storage Service (Amazon S3) for document storage, and Amazon OpenSearch Serverless for rapid and scalable search capabilities.

Central to this solution is the Foundation Model Management Layer, responsible for defining model policies and managing their deployment, using Amazon Bedrock Guardrails for safety, Amazon SageMaker services for model evaluation, and a vendor model registry comprising a range of foundation model (FM) options, including other vendor models, supporting model flexibility.

After the models are deployed, they go through online and offline evaluations to validate robustness.

Online evaluation uses AWS AppSync for WebSocket streaming to assess models in real time under adversarial conditions. A dedicated red teaming squad (authorized white hat testers) conducts evaluations focused on OWASP Top 10 for LLMs vulnerabilities, such as prompt injection, model theft, and attempts to alter model behavior. Online evaluation provides an interactive environment where human testers can pivot and respond dynamically to model answers, increasing the chances of identifying vulnerabilities or successfully jailbreaking the model.

Offline evaluation conducts a deeper analysis through services like SageMaker Clarify to check for biases and Amazon Comprehend to detect harmful content. The memory database captures interaction data, such as historical user prompts and model responses. LangFuse plays a vital role in maintaining an audit trail of model activities, allowing each model decision to be tracked for observability, accountability, and compliance. The offline evaluation pipeline uses tools like Giskard to detect performance, bias, and security issues in AI systems. It employs LLM-as-a-judge, where a large language model (LLM) evaluates AI responses for correctness, relevance, and adherence to responsible AI guidelines. Models are tested through offline evaluations first; if successful, they progress through online evaluation and ultimately move into the model registry.

The Red Teaming Playground is a dynamic environment designed to simulate scenarios and rigorously test models for vulnerabilities. Through a dedicated UI, the red team interacts with the model using a Q&A AI assistant (for instance, a Streamlit application), enabling real-time stress testing and evaluation. Team members can provide detailed feedback on model performance and log any issues or vulnerabilities encountered. This feedback is systematically integrated into the red teaming process, fostering continuous improvements and enhancing the model’s robustness and security.

Use case example: Mental health triage AI assistant

Imagine deploying a mental health triage AI assistant—an application that demands extra caution around sensitive topics like dosage information, health records, or judgement call questions. By defining a clear use case and establishing quality expectations, you can guide the model on when to answer, deflect, or provide a safe response:

  • Answer – When the bot is confident that the question is within its domain and is able to retrieve a relevant response, it can provide a direct answer. For example, if asked “What are some common symptoms of anxiety?”, the bot can respond: “Common symptoms of anxiety include restlessness, fatigue, difficulty concentrating, and excessive worry. If you’re experiencing these, consider speaking to a healthcare professional.”
  • Deflect – For questions outside the bot’s scope or purpose, the bot should deflect responsibility and guide the user toward appropriate human support. For instance, if asked “Why does life feel meaningless?”, the bot might reply: “It sounds like you’re going through a tough time. Would you like me to connect you to someone who can help?” This makes sure sensitive topics are handled carefully and responsibly.
  • Safe response – When the question requires human validation or advice that the bot can’t provide, it should offer generalized, neutral suggestions to minimize risks. For example, in response to “How can I stop feeling anxious all the time?”, the bot might say: “Some people find practices like meditation, exercise, or journaling helpful, but I recommend consulting a healthcare provider for advice tailored to your needs.”

Red teaming results help refine model outputs by identifying risks and vulnerabilities. For example, consider a medical AI assistant developed by the fictional company AnyComp. By subjecting this assistant to a red teaming exercise, AnyComp can detect potential risks, such as the assistant generating unsolicited medical advice before deployment. With this insight, AnyComp can refine the assistant to either deflect such queries or provide a safe, appropriate response.

This structured approach—answer, deflect, and safe response—provides a comprehensive strategy for managing various types of questions and scenarios effectively. By clearly defining how to handle each category, you can make sure the AI assistant fulfills its purpose while maintaining safety and reliability. Red teaming further validates these strategies by rigorously testing interactions, making sure that the assistant remains useful and trustworthy in different situations.

Conclusion

Implementing responsible AI policies involves continuous improvement. Scaling solutions, like integrating SageMaker for model lifecycle monitoring or AWS CloudFormation for controlled deployments, helps organizations maintain robust AI governance as they grow.

Integrating responsible AI through red teaming is a crucial step to assess that generative AI systems operate responsibly, securely, and remain compliant. Data Reply collaborates with AWS to industrialize these efforts, from fairness checks to security stress tests, helping organizations stay ahead of emerging threats and evolving standards.

Data Reply has extensive expertise in helping customers adopt generative AI, especially with their GenAI Factory framework, which simplifies the transition from proof of concept to production, benefiting industries such as maintenance and customer service FAQs. The GenAI Factory initiative by Data Reply France is designed to overcome integration challenges and scale generative AI applications effectively, using AWS managed services like Amazon Bedrock and OpenSearch Serverless.

To learn more about Data Reply’s work, check out their specialized offerings for red teaming in generative AI and LLMOps.


About the authors

Cassandre Vandeputte is a Solutions Architect for AWS Public Sector based in Brussels. Since her first steps into the digital world, she has been passionate about harnessing technology to drive positive societal change. Beyond her work with intergovernmental organizations, she drives responsible AI practices across AWS EMEA customers.

Davide Gallitelli is a Senior Specialist Solutions Architect for AI/ML in the EMEA region. He is based in Brussels and works closely with customers throughout Benelux. He has been a developer since he was very young, starting to code at the age of 7. He started learning AI/ML at university, and has fallen in love with it since then.

Amine Aitelharraj is a seasoned cloud leader and ex-AWS Senior Consultant with over a decade of experience driving large-scale cloud, data, and AI transformations. Currently a Principal AWS Consultant and AWS Ambassador, he combines deep technical expertise with strategic leadership to deliver scalable, secure, and cost-efficient cloud solutions across sectors. Amine is passionate about GenAI, serverless architectures, and helping organizations unlock business value through modern data platforms.

Read More

InterVision accelerates AI development using AWS LLM League and Amazon SageMaker AI

InterVision accelerates AI development using AWS LLM League and Amazon SageMaker AI

Cities and local governments are continuously seeking ways to enhance their non-emergency services, recognizing that intelligent, scalable contact center solutions play a crucial role in improving citizen experiences. InterVision Systems, LLC (InterVision), an AWS Premier Tier Services Partner and Amazon Connect Service Delivery Partner, has been at the forefront of this transformation, with their contact center solution designed specifically for city and county services called ConnectIV CX for Community Engagement. Though their solution already streamlines municipal service delivery through AI-powered automation and omnichannel engagement, InterVision recognized an opportunity for further enhancement with advanced generative AI capabilities.

InterVision used the AWS LLM League program to accelerate their generative AI development for non-emergency (311) contact centers. As AWS LLM League events began rolling out in North America, this initiative represented a strategic milestone in democratizing machine learning (ML) and enabling partners to build practical generative AI solutions for their customers.

Through this initiative, InterVision’s solutions architects, engineers, and sales teams participated in fine-tuning large language models (LLMs) using Amazon SageMaker AI specifically for municipal service scenarios. InterVision used this experience to enhance their ConnectIV CX solution and demonstrated how AWS Partners can rapidly develop and deploy domain-specific AI solutions.

This post demonstrates how AWS LLM League’s gamified enablement accelerates partners’ practical AI development capabilities, while showcasing how fine-tuning smaller language models can deliver cost-effective, specialized solutions for specific industry needs.

Understanding the AWS LLM League

The AWS LLM League represents an innovative approach to democratizing ML through gamified enablement. The program proves that with the right tools and guidance, almost any role—from solutions architects and developers to sales teams and business analysts—can successfully fine-tune and deploy generative AI models without requiring deep data science expertise. Though initially run as larger multi-organization events such as at AWS re:Invent, the program has evolved to offer focused single-partner engagements that align directly with specific business objectives. This targeted approach allows for customization of the entire experience around real-world use cases that matter most to the participating organization.

The program follows a three-stage format designed to build practical generative AI capabilities. It begins with an immersive hands-on workshop where participants learn the fundamentals of fine-tuning LLMs using Amazon SageMaker JumpStart. SageMaker JumpStart is an ML hub that can help you accelerate your ML journey.

The competition then moves into an intensive model development phase. During this phase, participants iterate through multiple fine-tuning approaches, which can include dataset preparation, data augmentation, and other techniques. Participants submit their models to a dynamic leaderboard, where each submission is evaluated by an AI system that measures the model’s performance against specific benchmarks. This creates a competitive environment that drives rapid experimentation and learning, because participants can observe how their fine-tuned models perform against larger foundation models (FMs), encouraging optimization and innovation.

The program culminates in an interactive finale structured like a live game show as seen in the following figure, where top-performing participants showcase their models’ capabilities through real-time challenges. Model responses are evaluated through a triple-judging system: an expert panel assessing technical merit, an AI benchmark measuring performance metrics, and audience participation providing real-world perspective. This multi-faceted evaluation verifies that models are assessed not just on technical performance, but also on practical applicability.

AWS LLM League finale event where top-performing participants showcase their models' capabilities through real-time challenges

The power of fine-tuning for business solutions

Fine-tuning an LLM is a type of transfer learning, a process that trains a pre-trained model on a new dataset without training from scratch. This process can produce accurate models with smaller datasets and less training time. Although FMs offer impressive general capabilities, fine-tuning smaller models for specific domains often delivers exceptional results at lower cost. For example, a fine-tuned 3B parameter model can outperform larger 70B parameter models in specialized tasks, while requiring significantly less computational resources. A 3B parameter model can run on an ml.g5.4xlarge instance, whereas a 70B parameter model would require the much more powerful and costly ml.g5.48xlarge instance. This approach aligns with recent industry developments, such as DeepSeek’s success in creating more efficient models through knowledge distillation techniques. Distillation is often implemented through a form of fine-tuning, where a smaller student model learns by mimicking the outputs of a larger, more complex teacher model.

In InterVision’s case, the AWS LLM League program was specifically tailored around their ConnectIV CX solution for community engagement services. For this use case, fine-tuning enables precise handling of municipality-specific procedures and responses aligned with local government protocols. Furthermore, the customized model provides reduced operational cost compared to using larger FMs, and faster inference times for better customer experience.

Fine-tuning with SageMaker Studio and SageMaker Jumpstart

The solution centers on SageMaker JumpStart in Amazon SageMaker Studio, which is a web-based integrated development environment (IDE) for ML that lets you build, train, debug, deploy, and monitor your ML models. With SageMaker JumpStart in SageMaker Studio, ML practitioners use a low-code/no-code (LCNC) environment to streamline the fine-tuning process and deploy their customized models into production.

Fine-tuning FMs with SageMaker Jumpstart involves a few steps in SageMaker Studio:

  • Select a model – SageMaker JumpStart provides pre-trained, publicly available FMs for a wide range of problem types. You can browse and access FMs from popular model providers for text and image generation models that are fully customizable.
  • Provide a training dataset – You select your training dataset that is saved in Amazon Simple Storage Service (Amazon S3), allowing you to use the virtually limitless storage capacity.
  • Perform fine-tuning – You can customize hyperparameters prior to the fine-tuning job, such as epochs, learning rate, and batch size. After choosing Start, SageMaker Jumpstart will handle the entire fine-tuning process.
  • Deploy the model – When the fine-tuning job is complete, you can access the model in SageMaker Studio and choose Deploy to start inferencing it. In addition, you can import the customized models to Amazon Bedrock, a managed service that enables you to deploy and scale models for production.
  • Evaluate the model and iterate – You can evaluate a model in SageMaker Studio using Amazon SageMaker Clarify, an LCNC solution to assess the model’s accuracy, explain model predictions, and review other relevant metrics. This allows you to identify areas where the model can be improved and iterate on the process.

This streamlined approach significantly reduces the complexity of developing and deploying specialized AI models while maintaining high performance standards and cost-efficiency. For the AWS LLM League model development phase, the workflow is depicted in the following figure.

The AWS LLM League Workflow

During the model development phase, you start with a default base model and initial dataset uploaded into an S3 bucket. You then use SageMaker JumpStart to fine-tune your model. You then submit the customized model to the AWS LLM League leaderboard, where it will be evaluated against a larger pre-trained model. This allows you to benchmark your model’s performance and identify areas for further improvement.

The leaderboard, as shown in the following figure, provides a ranking of how you stack up against your peers. This will motivate you to refine your dataset, adjust the training hyperparameters, and resubmit an updated version of your model. This gamified experience fosters a spirit of friendly competition and continuous learning. The top-ranked models from the leaderboard will ultimately be selected to compete in the AWS LLM League’s finale game show event.

AWS LLM League Leaderboard

Empowering InterVision’s AI capabilities

The AWS LLM League engagement provided InterVision with a practical pathway to enhance their AI capabilities while addressing specific customer needs. InterVision participants could immediately apply their learning to solve real business challenges by aligning the competition with their ConnectIV CX solution use cases.

The program’s intensive format proved highly effective, enabling InterVision to compress their AI development cycle significantly. The team successfully integrated fine-tuned models into their environment, enhancing the intelligence and context-awareness of customer interactions. This hands-on experience with SageMaker JumpStart and model fine-tuning created immediate practical value.

“This experience was a true acceleration point for us. We didn’t just experiment with AI—we compressed months of R&D into real-world impact. Now, our customers aren’t asking ‘what if?’ anymore, they’re asking ‘what’s next?’”

– Brent Lazarenko, Head of Technology and Innovation at InterVision.

Using the knowledge gained through the program, InterVision has been able to enhance their technical discussions with customers about generative AI implementation. Their ability to demonstrate practical applications of fine-tuned models has helped facilitate more detailed conversations about AI adoption in customer service scenarios. Building on this foundation, InterVision developed an internal virtual assistant using Amazon Bedrock, incorporating custom models, multi-agent collaboration, and retrieval architectures connected to their knowledge systems. This implementation serves as a proof of concept for similar customer solutions while demonstrating practical applications of the skills gained through the AWS LLM League.

As InterVision progresses toward AWS Generative AI Competency, these achievements showcase how partners can use AWS services to develop and implement sophisticated AI solutions that address specific business needs.

Conclusion

The AWS LLM League program demonstrates how gamified enablement can accelerate partners’ AI capabilities while driving tangible business outcomes. Through this focused engagement, InterVision not only enhanced their technical capabilities in fine-tuning language models, but also accelerated the development of practical AI solutions for their ConnectIV CX environment. The success of this partner-specific approach highlights the value of combining hands-on learning with real-world business objectives.

As organizations continue to explore generative AI implementations, the ability to efficiently develop and deploy specialized models becomes increasingly critical. The AWS LLM League provides a structured pathway for partners and customers to build these capabilities, whether they’re enhancing existing solutions or developing new AI-powered services.

Learn more about implementing generative AI solutions:

You can also visit the AWS Machine Learning blog for more stories about partners and customers implementing generative AI solutions across various industries.


About the Authors

Vu Le is a Senior Solutions Architect at AWS with more than 20 years of experience. He works closely with AWS Partners to expand their cloud business and increase adoption of AWS services. Vu has deep expertise in storage, data modernization, and building resilient architectures on AWS, and has helped numerous organizations migrate mission-critical systems to the cloud. Vu enjoys photography, his family, and his beloved corgi.

Jaya Padma Mutta is a Manager Solutions Architects at AWS based out of Seattle. She is focused on helping AWS Partners build their cloud strategy. She enables and mentors a team of technical Solution Architects aligned to multiple global strategic partners. Prior to joining this team, Jaya spent over 5 years in AWS Premium Support Engineering leading global teams, building processes and tools to improve customer experience. Outside of work, she loves traveling, nature, and is an ardent dog-lover.

Mohan CV is a Principal Solutions Architect at AWS, based in Northern Virginia. He has an extensive background in large-scale enterprise migrations and modernization, with a specialty in data analytics. Mohan is passionate about working with new technologies and enjoys assisting customers in adapting them to meet their business needs.

Rajesh Babu Nuvvula is a Solutions Architect in the Worldwide Public Sector team at AWS. He collaborates with public sector partners and customers to design and scale well-architected solutions. Additionally, he supports their cloud migrations and application modernization initiatives. His areas of expertise include designing distributed enterprise applications and databases.

Brent Lazarenko is the Head of Technology & AI at InterVision Systems, where he’s shaping the future of AI, cloud, and data modernization for over 1,700 clients. A founder, builder, and innovator, he scaled Virtuosity into a global powerhouse before a successful private equity exit. Armed with an MBA, MIT AI & leadership creds, and PMP/PfMP certifications, he thrives at the intersection of tech and business. When he’s not driving digital transformation, he’s pushing the limits of what’s next in AI, Web3, and the cloud.

Read More

Improve Amazon Nova migration performance with data-aware prompt optimization

Improve Amazon Nova migration performance with data-aware prompt optimization

In the era of generative AI, new large language models (LLMs) are continually emerging, each with unique capabilities, architectures, and optimizations. Among these, Amazon Nova foundation models (FMs) deliver frontier intelligence and industry-leading cost-performance, available exclusively on Amazon Bedrock. Since its launch in 2024, generative AI practitioners, including the teams in Amazon, have started transitioning their workloads from existing FMs and adopting Amazon Nova models.

However, when transitioning between different foundation models, the prompts created for your original model might not be as performant for Amazon Nova models without prompt engineering and optimization. Amazon Bedrock prompt optimization offers a tool to automatically optimize prompts for your specified target models (in this case, Amazon Nova models). It can convert your original prompts to Amazon Nova-style prompts. Additionally, during the migration to Amazon Nova, a key challenge is making sure that performance after migration is at least as good as or better than prior to the migration. To mitigate this challenge, thorough model evaluation, benchmarking, and data-aware optimization are essential, to compare the Amazon Nova model’s performance against the model used before the migration, and optimize the prompts on Amazon Nova to align performance with that of the previous workload or improve upon them.

In this post, we present an LLM migration paradigm and architecture, including a continuous process of model evaluation, prompt generation using Amazon Bedrock, and data-aware optimization. The solution evaluates the model performance before migration and iteratively optimizes the Amazon Nova model prompts using user-provided dataset and objective metrics. We demonstrate successful migration to Amazon Nova for three LLM tasks: text summarization, multi-class text classification, and question-answering implemented by Retrieval Augmented Generation (RAG). We also discuss the lessons learned and best practices for you to implement the solution for your real-world use cases.

Migrating your generative AI workloads to Amazon Nova

Migrating the model from your generative AI workload to Amazon Nova requires a structured approach to achieve performance consistency and improvement. It includes evaluating and benchmarking the old and new models, optimizing prompts on the new model, and testing and deploying the new models in your production. In this section, we present a four-step workflow and a solution architecture, as shown in the following architecture diagram.

model migration process

The workflow includes the following steps:

  1. Evaluate the source model and collect key performance metrics based on your business use case, such as response accuracy, response format correctness, latency, and cost, to set a performance baseline as the model migration target.
  2. Automatically update the structure, instruction, and language of your prompts to adapt to the Amazon Nova model for accurate, relevant, and faithful outputs. We will discuss this more in the next section.
  3. Evaluate the optimized prompts on the migrated Amazon Nova model to meet the performance target defined in Step 1. You can conduct the optimization in Step 2 as an iterative process until the optimized prompts meet your business criteria.
  4. Conduct A/B testing to validate the Amazon Nova model performance in your testing and production environment. When you’re satisfied, you can deploy the Amazon Nova model, settings, and prompts in production. 

This four-step workflow needs to run continuously, to adapt to variations in both the model and the data, driven by the changes in business use cases. The continuous adaptation provides ongoing optimization and helps maximize overall model performance.

Data-aware prompt optimization on Amazon Nova

In this section, we present a comprehensive optimization methodology, taking two steps. The first step is to use Amazon Bedrock prompt optimization to refine your prompt structure, and then use an innovative data-aware prompt optimization approach to further optimize the prompt to improve the Amazon Nova model performance.

Amazon Bedrock prompt optimization

Amazon Bedrock provides a prompt optimization feature that rewrites prompts to improve performance for your use cases. Prompt optimization streamlines the way that AWS developers interact with FMs on Amazon Bedrock, automatically adapts the prompts to the selected models, and generates for better performance.

As the first step, you can use prompt optimization to adapt your prompt to Amazon Nova. By analyzing the prompt you provide, the feature interprets the task, system prompt, and instruction within the prompt, and automatically crafts the prompt with Amazon Nova specific format and appropriate words, phrases, and sentences. The following example shows how prompt optimization converts a typical prompt for a summarization task on Anthropic’s Claude Haiku into a well-structured prompt for an Amazon Nova model, with sections that begin with special markdown tags such as ## Task, ### Summarization Instructions, and ### Document to Summarize.

Model Prompt
Anthropic’s Claude 3 Haiku Human:
Act like you are an intelligent AI assistant. You are required to provide a summarization based on given document. Please use below instructions when generating the response.
The document is provided in <doc></doc> XML tags.
Please be brief and concise in your answer. Do not add any information that is not mentioned in the document.
Do not provide any preamble and directly start with the summarization.
Do not make up the answer, If you don’t know the answer, just say that I don’t know.<doc>{document}</doc>Assistant:
Amazon Nova Lite with Amazon
Bedrock prompt optimization
### Task
Your task is to summarize the given document enclosed in <doc></doc> tags in a brief and concise manner. Follow these guidelines:### Summarization Instructions
– Read the document carefully to understand its main points and key information.
– Identify the core ideas, arguments, and supporting details presented in the document.
– Synthesize the essential information into a clear and succinct summary.
– Use your own words to paraphrase the key points – do not copy verbatim from the original text.
– Omit any extraneous or redundant information not central to the main ideas.
– Do not introduce new information or make up content not present in the original document.
– If you cannot summarize the document due to lack of understanding, simply respond “I don’t know.”### Document to Summarize
<doc>{document}</doc>Please provide your concise summary immediately without any preamble.

We applied the preceding prompts to the Anthropic Claude 3 Haiku and Amazon Nova Lite models, respectively, using the public xsum dataset. To evaluate the model performance, because the summarization task doesn’t have a predefined ground truth, we designed an LLM judge as shown in the following prompt to validate the summarization quality:

You are an AI assistant, your task is to compare the following LLM-generated summary with the original document, rate how well it captures the key points and conveys the most critical information, on a scale of 1-5.
    
    The score should be based on the following performance criteria:
    - Consistency: characterizes the summary’s factual and logical correctness. It should stay true to the original text, not introduce additional information, and use the same terminology.
    - Relevance: captures whether the summary is limited to the most pertinent information in the original text. A relevant summary focuses on the essential facts and key messages, omitting unnecessary details or trivial information.
    - Fluency: describes the readability of the summary. A fluent summary is well-written and uses proper syntax, vocabulary, and grammar.
    - Coherence: measures the logical flow and connectivity of ideas. A coherent summary presents the information in a structured, logical, and easily understandable manner.
    
    Score 5 means the LLM-generated summary is the best summary fully aligned with the original document,
    Score 1 means the LLM-generated summary is the worst summary completely irrelevant to the original document.  

    Please also provide an explanation on why you provide the score. Keep the explanation as concise as possible.

    The LLM-generated summary is provided within the <summary> XML tag,
    The original document is provided within the <document> XML tag,

    In your response, present the score within the <score> XML tag, and the explanation within the <thinking> XML tag.

    DO NOT nest <score> and <thinking> element.
    DO NOT put any extra attribute in the <score> and <thinking> tag.
    
    <document>
    {document}
    </document>

    LLM generated summary:
    <summary>
    {summary}
    </summary>

The experiment, using 80 data samples, shows that the accuracy is improved on the Amazon Nova Lite model from 77.75% to 83.25% using prompt optimization.

Data-aware optimization

Although Amazon Bedrock prompt optimization supports the basic needs of prompt engineering, other prompt optimization techniques are available to maximize LLM performance, such as Multi-Aspect Critique, Self-Reflection, Gradient Descent and Beam Search, and Meta Prompting. Specifically, we observed requirements from users that they need to fine-tune their prompts against their optimization objective metrics they define, such as ROUGE, BERT-F1, or an LLM judge score, by using a dataset they provide. To meet these needs, we designed a data-aware optimization architecture as shown in the following diagram.

data-aware-optimization

The data-aware optimization takes inputs. The first input is the user-defined optimization objective metrics; for the summarization task discussed in the previous section, you can use the BERT-F1 score or create your own LLM judge. The second input is a training dataset (DevSet) provided by the user to validate the response quality, for example, a summarization data sample with the following format.

Source Document Summarization
Officers searched properties in the Waterfront Park and Colonsay View areas of the city on Wednesday. Detectives said three firearms, ammunition and a five-figure sum of money were recovered. A 26-year-old man who was arrested and charged appeared at Edinburgh Sheriff Court on Thursday. A man has appeared in court after firearms, ammunition and cash were seized by police in Edinburgh.
<another document ...> <another summarization ...>

The data-aware optimization uses these two inputs to improve the prompt for better Amazon Nova response quality. In this work, we use the DSPy (Declarative Self-improving Python) optimizer for the data-aware optimization. DSPy is a widely used framework for programming language models. It offers algorithms for optimizing the prompts for multiple LLM tasks, from simple classifiers and summarizers to sophisticated RAG pipelines. The dspy.MIPROv2 optimizer intelligently explores better natural language instructions for every prompt using the DevSet, to maximize the metrics you define.

We applied the MIPROv2 optimizer on top of the results optimized by Amazon Bedrock in the previous section for better Amazon Nova performance. In the optimizer, we specify the number of the instruction candidates in the generation space, use Bayesian optimization to effectively search over the space, and run it iteratively to generate instructions and few-shot examples for the prompt in each step:

# Initialize optimizer
teleprompter = MIPROv2(
    metric=metric,
    num_candidates=5,
    auto="light", 
    verbose=False,
)

With the setting of num_candidates=5, the optimizer generates five candidate instructions:

0: Given the fields `question`, produce the fields `answer`.

1: Given a complex question that requires a detailed reasoning process, produce a structured response that includes a step-by-step reasoning and a final answer. Ensure the reasoning clearly outlines each logical step taken to arrive at the answer, maintaining clarity and neutrality throughout.

2: Given the fields `question` and `document`, produce the fields `answer`. Read the document carefully to understand its main points and key information. Identify the core ideas, arguments, and supporting details presented in the document. Synthesize the essential information into a clear and succinct summary. Use your own words to paraphrase the key points without copying verbatim from the original text. Omit any extraneous or redundant information not central to the main ideas. Do not introduce new information or make up content not present in the original document. If you cannot summarize the document due to lack of understanding, simply respond "I don't know.

3: In a high-stakes scenario where you must summarize critical documents for an international legal case, use the Chain of Thought approach to process the question. Carefully read and understand the document enclosed in <doc></doc> tags, identify the core ideas and key information, and synthesize this into a clear and concise summary. Ensure that the summary is neutral, precise, and omits any extraneous details. If the document is too complex or unclear, respond with "I don't know.

4: Given the fields `question` and `document`, produce the fields `answer`. The `document` field contains the text to be summarized. The `answer` field should include a concise summary of the document, following the guidelines provided. Ensure the summary is clear, accurate, and captures the core ideas without introducing new information.

We set other parameters for the optimization iteration, including the number of trials, the number of few-shot examples, and the batch size for the optimization process:

# Optimize program
optimized_program = teleprompter.compile(
        program.deepcopy(),
        trainset=trainset,
        num_trials=7,
        minibatch_size=20,
        minibatch_full_eval_steps=7,
        max_bootstrapped_demos=2,
        max_labeled_demos=2,
        requires_permission_to_run=False,
)

When the optimization starts, MIPROv2 uses each instruction candidate along with the mini-batch of the testing dataset we provided to infer the LLM and calculate the metrics we defined. After the loop is complete, the optimizer evaluates the best instruction by using the full testing dataset and calculates the full evaluation score. Based on the iterations, the optimizer provides the improved instruction for the prompt:

Given the fields `question` and `document`, produce the fields `answer`.
The `document` field contains the text to be summarized.
The `answer` field should include a concise summary of the document, following the guidelines provided.
Ensure the summary is clear, accurate, and captures the core ideas without introducing new information.

Applying the optimized prompt, the summarization accuracy generated by the LLM judge on Amazon Nova Lite model is further improved from 83.25% to 87.75%.

We also applied the optimization process on other LLM tasks, including a multi-class text classification task, and a question-answering task using RAG. In all the tasks, our approach optimized the migrated Amazon Nova model to out-perform the Anthropic Claude Haiku and Meta Llama models before migration. The following table and chart illustrate the optimization results.

Task DevSet Evaluation Before Migration After Migration (Amazon Bedrock Prompt Optimization) After Migration (DSPy with Amazon Bedrock Prompt Optimization)
Summarization (Anthropic Claude 3 Haiku to Amazon Nova Lite) 80 samples LLM Judge 77.75 83.25 87.75
Classification (Meta Llama 3.2 3B to Amazon Nova Micro) 80 samples Accuracy 81.25 81.25 87.5
QA-RAG (Anthropic Claude 3 Haiku to Amazon Nova Lite) 50 samples Semantic Similarity 52.71 51.6 57.15

Migration results

For the text classification use case, we optimized the Amazon Nova Micro model using 80 samples, using the accuracy metrics to evaluate the optimization performance in each step. After seven iterations, the optimized prompt provides 87.5% accuracy, improved from the accuracy of 81.25% running on the Meta Llama 3.2 3B model.

For the question-answering use case, we used 50 samples to optimize the prompt for an Amazon Nova Lite model in the RAG pipeline, and evaluated the performance using a semantic similarity score. The score compares the cosine distance between the model’s answer and the ground truth answer. Comparing to the testing data running on Anthropic’s Claude 3 Haiku, the optimizer improved the score from 52.71 to 57.15 after migrating to the Amazon Nova Lite model and prompt optimization.

You can find more details of these examples in the GitHub repository.

Lessons learned and best practices

Through the solution design, we have identified best practices that can help you properly configure your prompt optimization to maximize the metrics you specify for your use case:

  • Your dataset for optimizer should be of high quality and relevancy, and well-balanced to cover the data patterns and edge cases of your use case, and nuances to minimize biases.
  • The metrics you defined as the target of optimization should be use case specific. For example, if your dataset has ground truth, then you can use statistical and programmatical machine learning (ML) metrics such as accuracy and semantic similarity If your dataset doesn’t include ground truth, a well-designed and human-aligned LLM judge can provide a reliable evaluation score for the optimizer.
  • The optimizer runs with a number of prompt candidates (parameter dspy.num_candidates) and uses the evaluation metric you defined to select the optimal prompt as the output. Avoid setting too few candidates that might miss opportunity for improvement. In the previous summarization example, we set five prompt candidates for optimizing through 80 training samples, and received good optimization performance.
  • The prompt candidates include a combination of prompt instructions and few-shot examples. You can specify the number of examples (parameter dspy.max_labeled_demos for examples from labeled samples, and parameter dspy.max_bootstrapped_demos for examples from unlabeled samples); we recommend the example number be no less than 2.
  • The optimization runs in iteration (parameter dspy.num_trials); you should set enough iterations that allow you to refine prompts based on different scenarios and performance metrics, and gradually enhance clarity, relevance, and adaptability. If you optimize both the instructions and the few-shot examples in the prompt, we recommend you set the iteration number to no less than 2, preferably between 5–10.

In your use case, if your prompt structure is complex with chain-of-thoughts or tree-of-thoughts, long instructions in the system prompt, and multiple inputs in the user prompt, you can use a task-specific class to abstract the DSPy optimizer. The class helps encapsulate the optimization logic, standardize the prompt structure and optimization parameters, and allow straightforward implementation of different optimization strategies. The following is an example of the class created for text classification task:

class Classification(dspy.Signature):

""" You are a product search expert evaluating the quality of specific search results and deciding will that lead to a buying decision or not. You will be given a search query and the resulting product information and will classify the result against a provided classification class. Follow the given instructions to classify the search query using the classification scheme

   Class Categories:

   Class Label:

   Category Label: Positive Search

   The class is chosen when the search query and the product are a full match and hence the customer experience is positive

   Category Label: Negative Search

   The class is chosen when the search query and the product are fully misaligned, meaning you searched for something but the output is completely different

   Category Label: Moderate Search

   The class is chosen when the search query and the product may not be fully same, but still are complementing each other and maybe of similar category

"""

   search_query = dspy.InputField(desc="Search Query consisting of keywords")

   result_product_title = dspy.InputField(desc="This is part of Product Description and indicates the Title of the product")

   result_product_description = dspy.InputField(desc="This is part of Product Description and indicates the description of the product")

   …

   thinking = dspy.OutputField(desc="justification in the scratchpad, explaining the reasoning behind the classification choice and highlighting key factors that led to the decision")

   answer = dspy.OutputField(desc="final classification label for the product result: positive_search/negative_search/moderate_search. ")

""" Instructions:

Begin by creating a scratchpad where you can jot down your initial thoughts, observations, and any pertinent information related to the search query and product. This section is for your personal use and doesn't require a formal structure.
Proceed to examine and dissect the search query. Pinpoint essential terms, brand names, model numbers, and specifications. Assess the user's probable objective based on the query.
Subsequently, juxtapose the query with the product. Seek out precise correspondences in brand, model, and specifications. Recognize commonalities in functionality, purpose, or features. Reflect on how the product connects to or augments the item being queried.
Afterwards, employ a methodical classification approach, contemplating each step carefully
Conclude by verifying the classification. Scrutinize the selected category in relation to its description to confirm its precision. Take into account any exceptional circumstances or possible uncertainties.

"""

Conclusion

In this post, we introduced the workflow and architecture for you to migrate your current generative AI workload into Amazon Nova models, and presented a comprehensive prompt optimization approach using Amazon Bedrock prompt optimization and a data-aware prompt optimization methodology with DSPy. The results on three LLM tasks demonstrated the optimized performance of Amazon Nova in its intelligence classes and the model performance improved by Amazon Bedrock prompt optimization post-model migration, which is further enhanced by the data-aware prompt optimization methodology presented in this post.

The Python library and code examples are publicly available on GitHub. You can use this LLM migration method and the prompt optimization solution to migrate your workloads into Amazon Nova, or in other model migration processes.


About the Authors

YunfeiYunfei Bai is a Principal Solutions Architect at AWS. With a background in AI/ML, data science, and analytics, Yunfei helps customers adopt AWS services to deliver business results. He designs AI/ML and data analytics solutions that overcome complex technical challenges and drive strategic objectives. Yunfei has a PhD in Electronic and Electrical Engineering. Outside of work, Yunfei enjoys reading and music.

Anupam Dewan is a Senior Solutions Architect with a passion for generative AI and its applications in real life. He and his team enable Amazon Builders who build customer facing application using generative AI. He lives in Seattle area, and outside of work loves to go on hiking and enjoy nature.

Shuai Wang is a Senior Applied Scientist and Manager at Amazon Bedrock, specializing in natural language processing, machine learning, large language modeling, and other related AI areas. Outside work, he enjoys sports, particularly basketball, and family activities.

Kashif Imran is a seasoned engineering and product leader with deep expertise in AI/ML, cloud architecture, and large-scale distributed systems. Currently a Senior Manager at AWS, Kashif leads teams driving innovation in generative AI and Cloud, partnering with strategic cloud customers to transform their businesses. Kashif holds dual master’s degrees in Computer Science and Telecommunications, and specializes in translating complex technical capabilities into measurable business value for enterprises.

Read More

Customize Amazon Nova models to improve tool usage

Customize Amazon Nova models to improve tool usage

Modern large language models (LLMs) excel in language processing but are limited by their static training data. However, as industries require more adaptive, decision-making AI, integrating tools and external APIs has become essential. This has led to the evolution and rapid rise of agentic workflows, where AI systems autonomously plan, execute, and refine tasks. Accurate tool use is foundational for enhancing the decision-making and operational efficiency of these autonomous agents and building successful and complex agentic workflows.

In this post, we dissect the technical mechanisms of tool calling using Amazon Nova models through Amazon Bedrock, alongside methods for model customization to refine tool calling precision.

Expanding LLM capabilities with tool use

LLMs excel at natural language tasks but become significantly more powerful with tool integration, such as APIs and computational frameworks. Tools enable LLMs to access real-time data, perform domain-specific computations, and retrieve precise information, enhancing their reliability and versatility. For example, integrating a weather API allows for accurate, real-time forecasts, or a Wikipedia API provides up-to-date information for complex queries. In scientific contexts, tools like calculators or symbolic engines address numerical inaccuracies in LLMs. These integrations transform LLMs into robust, domain-aware systems capable of handling dynamic, specialized tasks with real-world utility.

Amazon Nova models and Amazon Bedrock

Amazon Nova models, unveiled at AWS re:Invent in December 2024, are optimized to deliver exceptional price-performance value, offering state-of-the-art performance on key text-understanding benchmarks at low cost. The series comprises three variants: Micro (text-only, ultra-efficient for edge use), Lite (multimodal, balanced for versatility), and Pro (multimodal, high-performance for complex tasks).

Amazon Nova models can be used for variety of tasks, from generation to developing agentic workflows. As such, these models have the capability to interface with external tools or services and use them through tool calling. This can be achieved through the Amazon Bedrock console (see Getting started with Amazon Nova in the Amazon Bedrock console) and APIs such as Converse and Invoke.

In addition to using the pre-trained models, developers have the option to fine-tune these models with multimodal data (Pro and Lite) or text data (Pro, Lite, and Micro), providing the flexibility to achieve desired accuracy, latency, and cost. Developers can also run self-service custom fine-tuning and distillation of larger models to smaller ones using the Amazon Bedrock console and APIs.

Solution overview

The following diagram illustrates the solution architecture.

Our solution consists data preparation for tool use, finetuning with prepared dataset, hosting the finetuned model and evaluation of the finetuned model

For this post, we first prepared a custom dataset for tool usage. We used the test set to evaluate Amazon Nova models through Amazon Bedrock using the Converse and Invoke APIs. We then fine-tuned Amazon Nova Micro and Amazon Nova Lite models through Amazon Bedrock with our fine-tuning dataset. After the fine-tuning process was complete, we evaluated these customized models through provisioned throughput. In the following sections, we go through these steps in more detail.

Tools

Tool usage in LLMs involves two critical operations: tool selection and argument extraction or generation. For instance, consider a tool designed to retrieve weather information for a specific location. When presented with a query such as “What’s the weather in Alexandria, VA?”, the LLM evaluates its repertoire of tools to determine whether an appropriate tool is available. Upon identifying a suitable tool, the model selects it and extracts the required arguments—here, “Alexandria” and “VA” as structured data types (for example, strings)—to construct the tool call.

Each tool is rigorously defined with a formal specification that outlines its intended functionality, the mandatory or optional arguments, and the associated data types. Such precise definitions, known as tool config, make sure that tool calls are executed correctly and that argument parsing aligns with the tool’s operational requirements. Following this requirement, the dataset used for this example defines eight tools with their arguments and configures them in a structured JSON format. We define the following eight tools (we use seven of them for fine-tuning and hold out the weather_api_call tool during testing in order to evaluate the accuracy on unseen tool use):

  • weather_api_call – Custom tool for getting weather information
  • stat_pull – Custom tool for identifying stats
  • text_to_sql – Custom text-to-SQL tool
  • terminal – Tool for executing scripts in a terminal
  • wikipidea – Wikipedia API tool to search through Wikipedia pages
  • duckduckgo_results_json – Internet search tool that executes a DuckDuckGo search
  • youtube_search – YouTube API search tool that searches video listings
  • pubmed_search – PubMed search tool that searches PubMed abstracts

The following code is an example of what a tool configuration for terminal might look like:

{'toolSpec': {'name': 'terminal',
'description': 'Run shell commands on this MacOS machine ',
'inputSchema': {'json': {'type': 'object',
'properties': {'commands': {'type': 'string',
'description': 'List of shell commands to run. Deserialized using json.loads'}},
'required': ['commands']}}}},

Dataset

The dataset is a synthetic tool calling dataset created with assistance from a foundation model (FM) from Amazon Bedrock and manually validated and adjusted. This dataset was created for our set of eight tools as discussed in the previous section, with the goal of creating a diverse set of questions and tool invocations that allow another model to learn from these examples and generalize to unseen tool invocations.

Each entry in the dataset is structured as a JSON object with key-value pairs that define the question (a natural language user query for the model), the ground truth tool required to answer the user query, its arguments (dictionary containing the parameters required to execute the tool), and additional constraints like order_matters: boolean, indicating if argument order is critical, and arg_pattern: optional, a regular expression (regex) for argument validation or formatting. Later in this post, we use these ground truth labels to supervise the training of pre-trained Amazon Nova models, adapting them for tool use. This process, known as supervised fine-tuning, will be explored in detail in the following sections.

The size of the training set is 560 questions and the test set is 120 questions. The test set consists of 15 questions per tool category, totaling 120 questions. The following are some examples from the dataset:

{
"question": "Explain the process of photosynthesis",
"answer": "wikipedia",
"args": {'query': 'process of photosynthesis'
},
"order_matters":False,
"arg_pattern":None
}
{
"question": "Display system date and time",
"answer": "terminal",
"args": {'commands': ['date'
]
},
"order_matters":True,
"arg_pattern":None
}
{
"question": "Upgrade the requests library using pip",
"answer": "terminal",
"args": {'commands': ['pip install --upgrade requests'
]
},
"order_matters":True,
"arg_pattern": [r'pip(3?) install --upgrade requests'
]
}

Prepare the dataset for Amazon Nova

To use this dataset with Amazon Nova models, we need to additionally format the data based on a particular chat template. Native tool calling has a translation layer that formats the inputs to the appropriate format before passing the model. Here, we employ a DIY tool use approach with a custom prompt template. Specifically, we need to add the system prompt, the user message embedded with the tool config, and the ground truth labels as the assistant message. The following is a training example formatted for Amazon Nova. Due to space constraints, we only show the toolspec for one tool.

{"system": [{"text": "You are a bot that can handle different requests
with tools."}],
"messages": [{"role": "user",
"content": [{"text": "Given the following functions within <tools>,
please respond with a JSON for a function call with its proper arguments
that best answers the given prompt.

Respond in the format
{"name": function name,"parameters": dictionary of argument name and
its value}.
Do not use variables. Donot give any explanations.

ONLY output the resulting
JSON structure and nothing else.

Donot use the word 'json' anywhere in the
result.
<tools>
    {"tools": [{"toolSpec":{"name":"youtube_search",
    "description": " search for youtube videos associated with a person.
    the input to this tool should be a comma separated list, the first part
    contains a person name and the second a number that is the maximum number
    of video results to return aka num_results. the second part is optional", 
    "inputSchema":
    {"json":{"type":"object","properties": {"query":
    {"type": "string",
     "description": "youtube search query to look up"}},
    "required": ["query"]}}}},]}
</tools>
Generate answer for the following question.
<question>
List any products that have received consistently negative reviews
</question>"}]},
{"role": "assistant", "content": [{"text": "{'name':text_to_sql,'parameters':
{'table': 'product_reviews','condition':
'GROUP BY product_id HAVING AVG(rating) < 2'}}"}]}],
"schemaVersion": "tooluse-dataset-2024"}

Upload dataset to Amazon S3

This step is needed later for the fine-tuning for Amazon Bedrock to access the training data. You can upload your dataset either through the Amazon Simple Storage Service (Amazon S3) console or through code.

Tool calling with base models through the Amazon Bedrock API

Now that we have created the tool use dataset and formatted it as required, let’s use it to test out the Amazon Nova models. As mentioned previously, we can use both the Converse and Invoke APIs for tool use in Amazon Bedrock. The Converse API enables dynamic, context-aware conversations, allowing models to engage in multi-turn dialogues, and the Invoke API allows the user to call and interact with the underlying models within Amazon Bedrock.

To use the Converse API, you simply send the messages, system prompt (if any), and the tool config directly in the Converse API. See the following example code:

response = bedrock_runtime.converse(
modelId=model_id,
messages=messages,
system=system_prompt,
toolConfig=tool_config,
)

To parse the tool and arguments from the LLM response, you can use the following example code:

for content_block in response['output']['message']["content"]:

if "toolUse" in content_block:
out_tool_name=content_block['toolUse']['name']
out_tool_inputs_dict=content_block['toolUse']['input']
print(out_tool_name,out_tool_inputs_dict.keys())

For the question: “Hey, what's the temperature in Paris right now?”, you get the following output:

weather_api_call dict_keys(['country', 'city'])

To execute tool use through the Invoke API, first you need to prepare the request body with the user question as well as the tool config that was prepared before. The following code snippet shows how to convert the tool config JSON to string format, which can be used in the message body:

# Convert tools configuration to JSON string
formatted_tool_config = json.dumps(tool_config, indent=2)
prompt = prompt_template.replace("{question}", question)
prompt = prompt.replace("{tool_config}", formatted_tool_config)
# message template
messages = [{"role": "user", "content": [{"text": prompt}]}]
# Prepare request body
model_kwargs = {"system":system_prompt, "messages": messages, "inferenceConfig": inferenceConfig,} body = json.dumps(model_kwargs)
response = bedrock_runtime.invoke_model(
body=body,
modelId=model_id,
accept=accept,
contentType=contentType
)

Using either of the two APIs, you can test and benchmark the base Amazon Nova models with the tool use dataset. In the next sections, we show how you can customize these base models specifically for the tool use domain.

Supervised fine-tuning using the Amazon Bedrock console

Amazon Bedrock offers three different customization techniques: supervised fine-tuning, model distillation, and continued pre-training. At the time of writing, the first two methods are available for customizing Amazon Nova models. Supervised fine-tuning is a popular method in transfer learning, where a pre-trained model is adapted to a specific task or domain by training it further on a smaller, task-specific dataset. The process uses the representations learned during pre-training on large datasets to improve performance in the new domain. During fine-tuning, the model’s parameters (either all or selected layers) are updated using backpropagation to minimize the loss.

In this post, we use the labeled datasets that we created and formatted previously to run supervised fine-tuning to adapt Amazon Nova models for the tool use domain.

Create a fine-tuning job

Complete the following steps to create a fine-tuning job:

  1. Open the Amazon Bedrock console.
  2. Choose us-east-1 as the AWS Region.
  3. Under Foundation models in the navigation pane, choose Custom models.
  4. Choose Create Fine-tuning job under Customization methods. 

At the time of writing, Amazon Nova model fine-tuning is exclusively available in the us-east-1 Region.

create finetuning job from console

  1. Choose Select model and choose Amazon as the model provider.
  2. Choose your model (for this post, Amazon Nova Micro) and choose Apply.

choose model for finetuning

  1. For Fine-tuned model name, enter a unique name.
  2. For Job name¸ enter a name for the fine-tuning job.
  3. In the Input data section, enter following details:
    • For S3 location, enter the source S3 bucket containing the training data.
    • For Validation dataset location, optionally enter the S3 bucket containing a validation dataset.

Choosing data location in console for finetuning

  1. In the Hyperparameters section, you can customize the following hyperparameters:
    • For Epochs¸ enter a value between 1–5.
    • For Batch size, the value is fixed at 1.
    • For Learning rate multiplier, enter a value between 0.000001–0.0001
    • For Learning rate warmup steps, enter a value between 0–100.

We recommend starting with the default parameter values and then changing the settings iteratively. It’s a good practice to change only one or a couple of parameters at a time, in order to isolate the parameter effects. Remember, hyperparameter tuning is model and use case specific.

  1. In the Output data section, enter the target S3 bucket for model outputs and training metrics.
  2. Choose Create fine-tuning job.

Run the fine-tuning job

After you start the fine-tuning job, you will be able to see your job under Jobs and the status as Training. When it finishes, the status changes to Complete.

tracking finetuning job progress

You can now go to the training job and optionally access the training-related artifacts that are saved in the output folder.

Check training artifacts

You can find both training and validation (we highly recommend using a validation set) artifacts here.

training and validation artifacts

You can use the training and validation artifacts to assess your fine-tuning job through loss curves (as shown in the following figure), which track training loss (orange) and validation loss (blue) over time. A steady decline in both indicates effective learning and good generalization. A small gap between them suggests minimal overfitting, whereas a rising validation loss with decreasing training loss signals overfitting. If both losses remain high, it indicates underfitting. Monitoring these curves helps you quickly diagnose model performance and adjust training strategies for optimal results.

training and validation loss curves gives insight on how the training progresses

Host the fine-tuned model and run inference

Now that you have completed the fine-tuning, you can host the model and use it for inference. Follow these steps:

  1. On the Amazon Bedrock console, under Foundation models in the navigation pane, choose Custom models
  2. On the Models tab, choose the model you fine-tuned.

starting provisioned throughtput through console to host FT model

  1. Choose Purchase provisioned throughput.

start provisioned throughput

  1. Specify a commitment term (no commitment, 1 month, 6 months) and review the associated cost for hosting the fine-tuned models.

After the customized model is hosted through provisioned throughput, a model ID will be assigned, which will be used for inference. For inference with models hosted with provisioned throughput, we have to use the Invoke API in the same way we described previously in this post—simply replace the model ID with the customized model ID.

The aforementioned fine-tuning and inference steps can also be done programmatically. Refer to the following GitHub repo for more detail.

Evaluation framework

Evaluating fine-tuned tool calling LLMs requires a comprehensive approach to assess their performance across various dimensions. The primary metric to evaluate tool calling is accuracy, including both tool selection and argument generation accuracy. This measures how effectively the model selects the correct tool and generates valid arguments. Latency and token usage (input and output tokens) are two other important metrics.

Tool call accuracy evaluates if the tool predicted by the LLM matches the ground truth tool for each question; a score of 1 is given if they match and 0 when they don’t. After processing the questions, we can use the following equation: Tool Call Accuracy=∑(Correct Tool Calls)/(Total number of test questions).

Argument call accuracy assesses whether the arguments provided to the tools are correct, based on either exact matches or regex pattern matching. For each tool call, the model’s predicted arguments are extracted. It uses the following argument matching methods:

  • Regex matching – If the ground truth includes regex patterns, the predicted arguments are matched against these patterns. A successful match increases the score.
  • Inclusive string matching – If no regex pattern is provided, the predicted argument is compared to the ground truth argument. Credit is given if the predicted argument contains the ground truth argument. This is to allow for arguments, like search terms, to not be penalized for adding additional specificity.

The score for each argument is normalized based on the number of arguments, allowing partial credit when multiple arguments are required. The cumulative correct argument scores are averaged across all questions: Argument Call Accuracy = ∑Correct Arguments/(Total Number of Questions).

Below we show some example questions and accuracy scores:

Example 1:

User question: Execute this run.py script with an argparse arg adding two gpus
GT tool: terminal   LLM output tool: terminal
Pred args:  ['python run.py —gpus 2']
Ground truth pattern: python(3?) run.py —gpus 2
Arg matching method: regex match
Arg matching score: 1.0

Example 2:

User question: Who had the most rushing touchdowns for the bengals in 2017 season?
GT tool: stat_pull   LLM output tool: stat_pull
Pred args:  ['NFL']
straight match
arg score 0.3333333333333333
Pred args:  ['2017']
Straight match
Arg score 0.6666666666666666
Pred args:  ['Cincinnati Bengals']
Straight match
Arg score 1.0

Results

We are now ready to visualize the results and compare the performance of base Amazon Nova models to their fine-tuned counterparts.

Base models

The following figures illustrate the performance comparison of the base Amazon Nova models.

Improvement of finetuned models over base models in tool use

The comparison reveals a clear trade-off between accuracy and latency, shaped by model size. Amazon Nova Pro, the largest model, delivers the highest accuracy in both tool call and argument call tasks, reflecting its advanced computational capabilities. However, this comes with increased latency.

In contrast, Amazon Nova Micro, the smallest model, achieves the lowest latency, which ideal for fast, resource-constrained environments, though it sacrifices some accuracy compared to its larger counterparts.

Fine-tuned models vs. base models

The following figure visualizes accuracy improvement after fine-tuning.

Improvement of finetuned models over base models in tool use

The comparative analysis of the Amazon Nova model variants reveals substantial performance improvements through fine-tuning, with the most significant gains observed in the smaller Amazon Nova Micro model. The fine-tuned Amazon Nova model showed remarkable growth in tool call accuracy, increasing from 75.8% to 95%, which is a 25.38% improvement. Similarly, its argument call accuracy rose from 77.8% to 87.7%, reflecting a 12.74% increase.

In contrast, the fine-tuned Amazon Nova Lite model exhibited more modest gains, with tool call accuracy improving from 90.8% to 96.66%—a 6.46% increase—and argument call accuracy rising from 85% to 89.9%, marking a 5.76% improvement. Both fine-tuned models surpassed the accuracy achieved by the Amazon Nova Pro base model.

These results highlight that fine-tuning can significantly enhance the performance of lightweight models, making them strong contenders for applications where both accuracy and latency are critical.

Conclusion

In this post, we demonstrated model customization (fine-tuning) for tool use with Amazon Nova. We first introduced a tool usage use case, and gave details about the dataset. We walked through the details of Amazon Nova specific data formatting and showed how to do tool calling through the Converse and Invoke APIs in Amazon Bedrock. After getting the baseline results from Amazon Nova models, we explained in detail the fine-tuning process, hosting fine-tuned models with provisioned throughput, and using the fine-tuned Amazon Nova models for inference. In addition, we touched upon getting insights from training and validation artifacts from a fine-tuning job in Amazon Bedrock.

Check out the detailed notebook for tool usage to learn more. For more information on Amazon Bedrock and the latest Amazon Nova models, refer to the Amazon Bedrock User Guide and Amazon Nova User Guide. The Generative AI Innovation Center has a group of AWS science and strategy experts with comprehensive expertise spanning the generative AI journey, helping customers prioritize use cases, build roadmaps, and move solutions into production. See Generative AI Innovation Center for our latest work and customer success stories.


About the Authors

Baishali Chaudhury is an Applied Scientist at the Generative AI Innovation Center at AWS, where she focuses on advancing Generative AI solutions for real-world applications. She has a strong background in computer vision, machine learning, and AI for healthcare. Baishali holds a PhD in Computer Science from University of South Florida and PostDoc from Moffitt Cancer Centre.

Isaac Privitera is a Principal Data Scientist with the AWS Generative AI Innovation Center, where he develops bespoke generative AI-based solutions to address customers’ business problems. His primary focus lies in building responsible AI systems, using techniques such as RAG, multi-agent systems, and model fine-tuning. When not immersed in the world of AI, Isaac can be found on the golf course, enjoying a football game, or hiking trails with his loyal canine companion, Barry.

Mengdie (Flora) Wang is a Data Scientist at AWS Generative AI Innovation Center, where she works with customers to architect and implement scalableGenerative AI solutions that address their unique business challenges. She specializes in model customization techniques and agent-based AI systems, helping organizations harness the full potential of generative AI technology. Prior to AWS, Flora earned her Master’s degree in Computer Science from the University of Minnesota, where she developed her expertise in machine learning and artificial intelligence.

Read More

Evaluate Amazon Bedrock Agents with Ragas and LLM-as-a-judge

Evaluate Amazon Bedrock Agents with Ragas and LLM-as-a-judge

AI agents are quickly becoming an integral part of customer workflows across industries by automating complex tasks, enhancing decision-making, and streamlining operations. However, the adoption of AI agents in production systems requires scalable evaluation pipelines. Robust agent evaluation enables you to gauge how well an agent is performing certain actions and gain key insights into them, enhancing AI agent safety, control, trust, transparency, and performance optimization.

Amazon Bedrock Agents uses the reasoning of foundation models (FMs) available on Amazon Bedrock, APIs, and data to break down user requests, gather relevant information, and efficiently complete tasks—freeing teams to focus on high-value work. You can enable generative AI applications to automate multistep tasks by seamlessly connecting with company systems, APIs, and data sources.

Ragas is an open source library for testing and evaluating large language model (LLM) applications across various LLM use cases, including Retrieval Augmented Generation (RAG). The framework enables quantitative measurement of the effectiveness of the RAG implementation. In this post, we use the Ragas library to evaluate the RAG capability of Amazon Bedrock Agents.

LLM-as-a-judge is an evaluation approach that uses LLMs to assess the quality of AI-generated outputs. This method employs an LLM to act as an impartial evaluator, to analyze and score outputs. In this post, we employ the LLM-as-a-judge technique to evaluate the text-to-SQL and chain-of-thought capabilities of Amazon Bedrock Agents.

Langfuse is an open source LLM engineering platform, which provides features such as traces, evals, prompt management, and metrics to debug and improve your LLM application.

In the post Accelerate analysis and discovery of cancer biomarkers with Amazon Bedrock Agents, we showcased research agents for cancer biomarker discovery for pharmaceutical companies. In this post, we extend the prior work and showcase Open Source Bedrock Agent Evaluation with the following capabilities:

  • Evaluating Amazon Bedrock Agents on its capabilities (RAG, text-to-SQL, custom tool use) and overall chain-of-thought
  • Comprehensive evaluation results and trace data sent to Langfuse with built-in visual dashboards
  • Trace parsing and evaluations for various Amazon Bedrock Agents configuration options

First, we conduct evaluations on a variety of different Amazon Bedrock Agents. These include a sample RAG agent, a sample text-to-SQL agent, and pharmaceutical research agents that use multi-agent collaboration for cancer biomarker discovery. Then, for each agent, we showcase navigating the Langfuse dashboard to view traces and evaluation results.

Technical challenges

Today, AI agent developers generally face the following technical challenges:

  • End-to-end agent evaluation – Although Amazon Bedrock provides built-in evaluation capabilities for LLM models and RAG retrieval, it lacks metrics specifically designed for Amazon Bedrock Agents. There is a need for evaluating the holistic agent goal, as well as individual agent trace steps for specific tasks and tool invocations. Support is also needed for both single and multi-agents, and both single and multi-turn datasets.
  • Challenging experiment management – Amazon Bedrock Agents offers numerous configuration options, including LLM model selection, agent instructions, tool configurations, and multi-agent setups. However, conducting rapid experimentation with these parameters is technically challenging due to the lack of systematic ways to track, compare, and measure the impact of configuration changes across different agent versions. This makes it difficult to effectively optimize agent performance through iterative testing.

Solution overview

The following figure illustrates how Open Source Bedrock Agent Evaluation works on a high level. The framework runs an evaluation job that will invoke your own agent in Amazon Bedrock and evaluate its response.

Evaluation Workflow

The workflow consists of the following steps:

  1. The user specifies the agent ID, alias, evaluation model, and dataset containing question and ground truth pairs.
  2. The user executes the evaluation job, which will invoke the specified Amazon Bedrock agent.
  3. The retrieved agent invocation traces are run through a custom parsing logic in the framework.
  4. The framework conducts an evaluation based on the agent invocation results and the question type:
    1. Chain-of-thought – LLM-as-a-judge with Amazon Bedrock LLM calls (conducted for every evaluation run for different types of questions)
    2. RAG – Ragas evaluation library
    3. Text-to-SQL – LLM-as-a-judge with Amazon Bedrock LLM calls
  5. Evaluation results and parsed traces are gathered and sent to Langfuse for evaluation insights.

Prerequisites

To deploy the sample RAG and text-to-SQL agents and follow along with evaluating them using Open Source Bedrock Agent Evaluation, follow the instructions in Deploying Sample Agents for Evaluation.

To bring your own agent to evaluate with this framework, refer to the following README and follow the detailed instructions to deploy the Open Source Bedrock Agent Evaluation framework.

Overview of evaluation metrics and input data

First, we create sample Amazon Bedrock agents to demonstrate the capabilities of Open Source Bedrock Agent Evaluation. The text-to-SQL agent uses the BirdSQL Mini-Dev dataset, and the RAG agent uses the Hugging Face rag-mini-wikpedia dataset.

Evaluation metrics

The Open Source Bedrock Agent Evaluation framework conducts evaluations on two broad types of metrics:

  • Agent goal – Chain-of-thought (run on every question)
  • Task accuracy – RAG, text-to-SQL (run only when the specific tool is used to answer question)

Agent goal metrics measure how well an agent identifies and achieves the goals of the user. There are two main types: reference-based evaluation and no reference evaluation. Examples can be found in Agent Goal accuracy as defined by Ragas:

  • Reference-based evaluation – The user provides a reference that will be used as the ideal outcome. The metric is computed by comparing the reference with the goal achieved by the end of the workflow.
  • Evaluation without reference – The metric evaluates the performance of the LLM in identifying and achieving the goals of the user without reference.

We will showcase evaluation without reference using chain-of-thought evaluation. We conduct evaluations by comparing the agent’s reasoning and the agent’s instruction. For this evaluation, we use some metrics from the evaluator prompts for Amazon Bedrock LLM-as-a-judge. In this framework, the chain-of-thought evaluations are run on every question that the agent is evaluated against.

Task accuracy metrics measure how well an agent calls the required tools to complete a given task. For the two task accuracy metrics, RAG and text-to-SQL, evaluations are conducted based on comparing the actual agent answer against the ground truth dataset that must be provided in the input dataset. The task accuracy metrics are only evaluated when the corresponding tool is used to answer the question.

The following is a breakdown of the key metrics used in each evaluation type included in the framework:

  • RAG:
    • Faithfulness – How factually consistent a response is with the retrieved context
    • Answer relevancy – How directly and appropriately the original question is addressed
    • Context recall – How many of the relevant pieces of information were successfully retrieved
    • Semantic similarity – The assessment of the semantic resemblance between the generated answer and the ground truth
  • Text-to-SQL:
  • Chain-of-thought:
    • Helpfulness – How well the agent satisfies explicit and implicit expectations
    • Faithfulness – How well the agent sticks to available information and context
    • Instruction following – How well the agent respects all explicit directions

User-agent trajectories

The input dataset is in the form of trajectories, where each trajectory consists of one or more questions to be answered by the agent. The trajectories are meant to simulate how a user might interact with the agent. Each trajectory consists of a unique question_id, question_type, question, and ground_truth information. The following are examples of actual trajectories used to evaluate each type of agent in this post.

For more simple agent setups like the RAG and text-to-SQL sample agent, we created trajectories consisting of a single question, as shown in the following examples.

The following is an example of a RAG sample agent trajectory:

{
	"Trajectory0": [
		{
			"question_id": 0,
			"question_type": "RAG",
			"question": "Was Abraham Lincoln the sixteenth President of the United States?",
			"ground_truth": "yes"
		}
	]
}

The following is an example of a text-to-SQL sample agent trajectory:

{
	"Trajectory1": [
		{
			"question_id": 1,
			"question": "What is the highest eligible free rate for K-12 students in the schools in Alameda County?",
			"question_type": "TEXT2SQL",
			"ground_truth": {
				"ground_truth_sql_query": "SELECT `Free Meal Count (K-12)` / `Enrollment (K-12)` FROM frpm WHERE `County Name` = 'Alameda' ORDER BY (CAST(`Free Meal Count (K-12)` AS REAL) / `Enrollment (K-12)`) DESC LIMIT 1",
				"ground_truth_sql_context": "[{'table_name': 'frpm', 'columns': [('cdscode', 'varchar'), ('academic year', 'varchar'), ...",
				"ground_truth_query_result": "1.0",
				"ground_truth_answer": "The highest eligible free rate for K-12 students in schools in Alameda County is 1.0."
		}
	]
}

Pharmaceutical research agent use case example

In this section, we demonstrate how you can use the Open Source Bedrock Agent Evaluation framework to evaluate pharmaceutical research agents discussed in the post Accelerate analysis and discovery of cancer biomarkers with Amazon Bedrock Agents . It showcases a variety of specialized agents, including a biomarker database analyst, statistician, clinical evidence researcher, and medical imaging expert in collaboration with a supervisor agent.

The pharmaceutical research agent was built using the multi-agent collaboration feature of Amazon Bedrock. The following diagram shows the multi-agent setup that was evaluated using this framework.

HCLS Agents Architecture

As shown in the diagram, the RAG evaluations will be conducted on the clinical evidence researcher sub-agent. Similarly, text-to-SQL evaluations will be run on the biomarker database analyst sub-agent. The chain-of-thought evaluation evaluates the final answer of the supervisor agent to check if it properly orchestrated the sub-agents and answered the user’s question.

Research agent trajectories

For a more complex setup like the pharmaceutical research agents, we used a set of industry relevant pregenerated test questions. By creating groups of questions based on their topic regardless of the sub-agents that might be invoked to answer the question, we created trajectories that include multiple questions spanning multiple types of tool use. With relevant questions already generated, integrating with the evaluation framework simply required properly formatting the ground truth data into trajectories.

We walk through evaluating this agent against a trajectory containing a RAG question and a text-to-SQL question:

{
	"Trajectory1": [
		{
			"question_id": 3,
			"question_type": "RAG",
			"question": "According to the knowledge base, how did the EGF pathway associate with CT imaging features?",
			"ground_truth": "The EGF pathway was significantly correlated with the presence of ground-glass opacity and irregular nodules or nodules with poorly defined margins."
		},
		{
			"question_id": 4,
			"question_type": "TEXT2SQL",
			"question": "According to the database, What percentage of patients have EGFR mutations?",
			"ground_truth": {
				"ground_truth_sql_query": "SELECT (COUNT(CASE WHEN EGFR_mutation_status = 'Mutant' THEN 1 END) * 100.0 / COUNT(*)) AS percentage FROM clinical_genomic;",
				"ground_truth_sql_context": "Table clinical_genomic: - Case_ID: VARCHAR(50) - EGFR_mutation_status: VARCHAR(50)",
				"ground_truth_query_result": "14.285714",
				"ground_truth_answer": "According to the query results, approximately 14.29% of patients in the clinical_genomic table have EGFR mutations."
			}
		}
	]
}

Chain-of-thought evaluations are conducted for every question, regardless of tool use. This will be illustrated through a set of images of agent trace and evaluations on the Langfuse dashboard.

After running the agent against the trajectory, the results are sent to Langfuse to view the metrics. The following screenshot shows the trace of the RAG question (question ID 3) evaluation on Langfuse.

Langfuse RAG Trace

The screenshot displays the following information:

  • Trace information (input and output of agent invocation)
  • Trace steps (agent generation and the corresponding sub-steps)
  • Trace metadata (input and output tokens, cost, model, agent type)
  • Evaluation metrics (RAG and chain-of-thought metrics)

The following screenshot shows the trace of the text-to-SQL question (question ID 4) evaluation on Langfuse, which evaluated the biomarker database analyst agent that generates SQL queries to run against an Amazon Redshift database containing biomarker information.

Langfuse text-to-SQL Trace

The screenshot shows the following information:

  • Trace information (input and output of agent invocation)
  • Trace steps (agent generation and the corresponding sub-steps)
  • Trace metadata (input and output tokens, cost, model, agent type)
  • Evaluation metrics (text-to-SQL and chain-of-thought metrics)

The chain-of-thought evaluation is included in part of both questions’ evaluation traces. For both traces, LLM-as-a-judge is used to generate scores and explanation around an Amazon Bedrock agent’s reasoning on a given question.

Overall, we ran 56 questions grouped into 21 trajectories against the agent. The traces, model costs, and scores are shown in the following screenshot.

Langfuse Dashboard

The following table contains the average evaluation scores across 56 evaluation traces.

Metric Category Metric Type Metric Name Number of Traces Metric Avg. Value
Agent Goal COT Helpfulness 50 0.77
Agent Goal COT Faithfulness 50 0.87
Agent Goal COT Instruction following 50 0.69
Agent Goal COT Overall (average of all metrics) 50 0.77
Task Accuracy TEXT2SQL Answer correctness 26 0.83
Task Accuracy TEXT2SQL SQL semantic equivalence 26 0.81
Task Accuracy RAG Semantic similarity 20 0.66
Task Accuracy RAG Faithfulness 20 0.5
Task Accuracy RAG Answer relevancy 20 0.68
Task Accuracy RAG Context recall 20 0.53

Security considerations

Consider the following security measures:

  • Enable Amazon Bedrock agent logging – For security best practices of using Amazon Bedrock Agents, enable Amazon Bedrock model invocation logging to capture prompts and responses securely in your account.
  • Check for compliance requirements – Before implementing Amazon Bedrock Agents in your production environment, make sure that the Amazon Bedrock compliance certifications and standards align with your regulatory requirements. Refer to Compliance validation for Amazon Bedrock for more information and resources on meeting compliance requirements.

Clean up

If you deployed the sample agents, run the following notebooks to delete the resources created.

If you chose the self-hosted Langfuse option, follow these steps to clean up your AWS self-hosted Langfuse setup.

Conclusion

In this post, we introduced the Open Source Bedrock Agent Evaluation framework, a Langfuse-integrated solution that streamlines the agent development process. The framework comes equipped with built-in evaluation logic for RAG, text-to-SQL, chain-of-thought reasoning, and integration with Langfuse for viewing evaluation metrics. With the Open Source Bedrock Agent Evaluation agent, developers can quickly evaluate their agents and rapidly experiment with different configurations, accelerating the development cycle and improving agent performance.

We demonstrated how this evaluation framework can be integrated with pharmaceutical research agents. We used it to evaluate agent performance against biomarker questions and sent traces to Langfuse to view evaluation metrics across question types.

The Open Source Bedrock Agent Evaluation framework enables you to accelerate your generative AI application building process using Amazon Bedrock Agents. To self-host Langfuse in your AWS account, see Hosting Langfuse on Amazon ECS with Fargate using CDK Python. To explore how you can streamline your Amazon Bedrock Agents evaluation process, get started with Open Source Bedrock Agent Evaluation.

Refer to Towards Effective GenAI Multi-Agent Collaboration: Design and Evaluation for Enterprise Applications from the Amazon Bedrock team to learn more about multi-agent collaboration and end-to-end agent evaluation.


About the authors

Hasan PoonawalaHasan Poonawala is a Senior AI/ML Solutions Architect at AWS, working with healthcare and life sciences customers. Hasan helps design, deploy, and scale generative AI and machine learning applications on AWS. He has over 15 years of combined work experience in machine learning, software development, and data science on the cloud. In his spare time, Hasan loves to explore nature and spend time with friends and family.

Blake ShinBlake Shin is an Associate Specialist Solutions Architect at AWS who enjoys learning about and working with new AI/ML technologies. In his free time, Blake enjoys exploring the city and playing music.

Rishiraj ChandraRishiraj Chandra is an Associate Specialist Solutions Architect at AWS, passionate about building innovative artificial intelligence and machine learning solutions. He is committed to continuously learning and implementing emerging AI/ML technologies. Outside of work, Rishiraj enjoys running, reading, and playing tennis.

Read More

Enterprise-grade natural language to SQL generation using LLMs: Balancing accuracy, latency, and scale

Enterprise-grade natural language to SQL generation using LLMs: Balancing accuracy, latency, and scale

This blog post is co-written with Renuka Kumar and Thomas Matthew from Cisco.

Enterprise data by its very nature spans diverse data domains, such as security, finance, product, and HR. Data across these domains is often maintained across disparate data environments (such as Amazon Aurora, Oracle, and Teradata), with each managing hundreds or perhaps thousands of tables to represent and persist business data. These tables house complex domain-specific schemas, with instances of nested tables and multi-dimensional data that require complex database queries and domain-specific knowledge for data retrieval.

Recent advances in generative AI have led to the rapid evolution of natural language to SQL (NL2SQL) technology, which uses pre-trained large language models (LLMs) and natural language to generate database queries in the moment. Although this technology promises simplicity and ease of use for data access, converting natural language queries to complex database queries with accuracy and at enterprise scale has remained a significant challenge. For enterprise data, a major difficulty stems from the common case of database tables having embedded structures that require specific knowledge or highly nuanced processing (for example, an embedded XML formatted string). As a result, NL2SQL solutions for enterprise data are often incomplete or inaccurate.

This post describes a pattern that AWS and Cisco teams have developed and deployed that is viable at scale and addresses a broad set of challenging enterprise use cases. The methodology allows for the use of simpler, and therefore more cost-effective and lower latency, generative models by reducing the processing required for SQL generation.

Specific challenges for enterprise-scale NL2SQL

Generative accuracy is paramount for NL2SQL use cases; inaccurate SQL queries might result in a sensitive enterprise data leak, or lead to inaccurate results impacting critical business decisions. Enterprise-scale data presents specific challenges for NL2SQL, including the following:

  • Complex schemas optimized for storage (and not retrieval) – Enterprise databases are often distributed in nature and optimized for storage and not for retrieval. As a result, the table schemas are complex, involving nested tables and multi-dimensional data structures (for example, a cell containing an array of data). As a further result, creating queries for retrieval from these data stores requires specific expertise and involves complex filtering and joins.
  • Diverse and complex natural language queries – The user’s natural language input might also be complex because they might refer to a list of entities of interest or date ranges. Converting the logical meaning of these user queries into a database query can lead to overly long and complex SQL queries due to the original design of the data schema.
  • LLM knowledge gap – NL2SQL language models are typically trained on data schemas that are publicly available for education purposes and might not have the necessary knowledge complexity required of large, distributed databases in production environments. Consequently, when faced with complex enterprise table schemas or complex user queries, LLMs have difficulty generating correct query statements because they have difficulty understanding interrelationships between the values and entities of the schema.
  • LLM attention burden and latency – Queries containing multi-dimensional data often involve multi-level filtering over each cell of the data. To generate queries for cases such as these, the generative model requires more attention to support attending to the increase in relevant tables, columns, and values; analyzing the patterns; and generating more tokens. This increases the LLM’s query generation latency, and the likelihood of query generation errors, because of the LLM misunderstanding data relationships and generating incorrect filter statements.
  • Fine-tuning challenge – One common approach to achieve higher accuracy with query generation is to fine-tune the model with more SQL query samples. However, it is non-trivial to craft training data for generating SQL for embedded structures within columns (for example, JSON, or XML), to handle sets of identifiers, and so on, to get baseline performance (which is the problem we are trying to solve in the first place). This also introduces a slowdown in the development cycle.

Solution design and methodology

The solution described in this post provides a set of optimizations that solve the aforementioned challenges while reducing the amount of work that has to be performed by an LLM for generating accurate output. This work extends upon the post Generating value from enterprise data: Best practices for Text2SQL and generative AI. That post has many useful recommendations for generating high-quality SQL, and the guidelines outlined might be sufficient for your needs, depending on the inherent complexity of the database schemas.

To achieve generative accuracy for complex scenarios, the solution breaks down NL2SQL generation into a sequence of focused steps and sub-problems, narrowing the generative focus to the appropriate data domain. Using data abstractions for complex joins and data structure, this approach enables the use of smaller and more affordable LLMs for the task. This approach results in reduced prompt size and complexity for inference, reduced response latency, and improved accuracy, while enabling the use of off-the-shelf pre-trained models.

Narrowing scope to specific data domains

The solution workflow narrows down the overall schema space into the data domain targeted by the user’s query. Each data domain corresponds to the set of database data structures (tables, views, and so on) that are commonly used together to answer a set of related user queries, for an application or business domain. The solution uses the data domain to construct prompt inputs for the generative LLM.

This pattern consists of the following elements:

  • Mapping input queries to domains – This involves mapping each user query to the data domain that is appropriate for generating the response for NL2SQL at runtime. This mapping is similar in nature to intent classification, and enables the construction of an LLM prompt that is scoped for each input query (described next).
  • Scoping data domain for focused prompt construction – This is a divide-and-conquer pattern. By focusing on the data domain of the input query, redundant information, such as schemas for other data domains in the enterprise data store, can be excluded. This might be considered as a form of prompt pruning; however, it offers more than prompt reduction alone. Reducing the prompt context to the in-focus data domain enables greater scope for few-shot learning examples, declaration of specific business rules, and more.
  • Augmenting SQL DDL definitions with metadata to enhance LLM inference – This involves enhancing the LLM prompt context by augmenting the SQL DDL for the data domain with descriptions of tables, columns, and rules to be used by the LLM as guidance on its generation. This is described in more detail later in this post.
  • Determine query dialect and connection information – For each data domain, the database server metadata (such as the SQL dialect and connection URI) is captured during use case onboarding and made available at runtime to be automatically included in the prompt for SQL generation and subsequent query execution. This enables scalability through decoupling the natural language query from the specific queried data source. Together, the SQL dialect and connectivity abstractions allow for the solution to be data source agnostic; data sources might be distributed within or across different clouds, or provided by different vendors. This modularity enables scalable addition of new data sources and data domains, because each is independent.

Managing identifiers for SQL generation (resource IDs)

Resolving identifiers involves extracting the named resources, as named entities, from the user’s query and mapping the values to unique IDs appropriate for the target data source prior to NL2SQL generation. This can be implemented using natural language processing (NLP) or LLMs to apply named entity recognition (NER) capabilities to drive the resolution process. This optional step has the most value when there are many named resources and the lookup process is complex. For instance, in a user query such as “In what games did Isabelle Werth, Nedo Nadi, and Allyson Felix compete?” there are named resources: ‘allyson felix’, ‘isabelle werth’, and ‘nedo nadi’. This step allows for rapid and precise feedback to the user when a resource can’t be resolved to an identifier (for example, due to ambiguity).

This optional process of handling many or paired identifiers is included to offload the burden on LLMs for user queries with challenging sets of identifiers to be incorporated, such as those that might come in pairs (such as ID-type, ID-value), or where there are many identifiers. Rather than having the generative LLM insert each unique ID into the SQL directly, the identifiers are made available by defining a temporary data structure (such as a temporary table) and a set of corresponding insert statements. The LLM is prompted with few-shot learning examples to generate SQL for the user query by joining with the temporary data structure, rather than attempt identity injection. This results in a simpler and more consistent query pattern for cases when there are one, many, or pairs of identifiers.

Handling complex data structures: Abstracting domain data structures

This step is aimed at simplifying complex data structures into a form that can be understood by the language model without having to decipher complex inter-data relationships. Complex data structures might appear as nested tables or lists within a table column, for instance.

We can define temporary data structures (such as views and tables) that abstract complex multi-table joins, nested structures, and more. These higher-level abstractions provide simplified data structures for query generation and execution. The top-level definitions of these abstractions are included as part of the prompt context for query generation, and the full definitions are provided to the SQL execution engine, along with the generated query. The resulting queries from this process can use simple set operations (such as IN, as opposed to complex joins) that LLMs are well trained on, thereby alleviating the need for nested joins and filters over complex data structures.

Augmenting data with data definitions for prompt construction

Several of the optimizations noted earlier require making some of the specifics of the data domain explicit. Fortunately, this only has to be done when schemas and use cases are onboarded or updated. The benefit is higher generative accuracy, reduced generative latency and cost, and the ability to support arbitrarily complex query requirements.

To capture the semantics of a data domain, the following elements are defined:

  • The standard tables and views in data schema, along with comments to describe the tables and columns.
  • Join hints for the tables and views, such as when to use outer joins.
  • Data domain-specific rules, such as which columns might not appear in a final select statement.
  • The set of few-shot examples of user queries and corresponding SQL statements. A good set of examples would include a wide variety of user queries for that domain.
  • Definitions of the data schemas for any temporary tables and views used in the solution.
  • A domain-specific system prompt that specifies the role and expertise that the LLM has, the SQL dialect, and the scope of its operation.
  • A domain-specific user prompt.
  • Additionally, if temporary tables or views are used for the data domain, a SQL script is required that, when executed, creates the desired temporary data structures needs to be defined. Depending on the use case, this can be a static or dynamically generated script.

Accordingly, the prompt for generating the SQL is dynamic and constructed based on the data domain of the input question, with a set of specific definitions of data structure and rules appropriate for the input query. We refer to this set of elements as the data domain context. The purpose of the data domain context is to provide the necessary prompt metadata for the generative LLM. Examples of this, and the methods described in the previous sections, are included in the GitHub repository. There is one context for each data domain, as illustrated in the following figure.

Bringing it all together: The execution flow

This section describes the execution flow of the solution. An example implementation of this pattern is available in the GitHub repository. Access the repository to follow along with the code.

To illustrate the execution flow, we use an example database with data about Olympics statistics and another with the company’s employee vacation schedule. We follow the execution flow for the domain regarding Olympics statistics using the user query “In what games did Isabelle Werth, Nedo Nadi, and Allyson Felix compete?” to show the inputs and outputs of the steps in the execution flow, as illustrated in the following figure.

High-level processing workflow

Preprocess the request

The first step of the NL2SQL flow is to preprocess the request. The main objective of this step is to classify the user query into a domain. As explained earlier, this narrows down the scope of the problem to the appropriate data domain for SQL generation. Additionally, this step identifies and extracts the referenced named resources in the user query. These are then used to call the identity service in the next step to get the database identifiers for these named resources.

Using the earlier mentioned example, the inputs and outputs of this step are as follows:

user_query = "In what games did Isabelle Werth, Nedo Nadi and Allyson Felix compete?"
pre_processed_request = request_pre_processor.run(user_query)
domain = pre_processed_request[app_consts.DOMAIN]

# Output pre_processed_request:
  {'user_query': 'In what games did Isabelle Werth, Nedo Nadi and Allyson Felix compete?',
   'domain': 'olympics',
   'named_resources': {'allyson felix', 'isabelle werth', 'nedo nadi'} }

Resolve identifiers (to database IDs)

This step processes the named resources’ strings extracted in the previous step and resolves them to be identifiers that can be used in database queries. As mentioned earlier, the named resources (for example, “group22”, “user123”, and “I”) are looked up using solution-specific means, such through database lookups or an ID service.

The following code shows the execution of this step in our running example:

named_resources = pre_processed_request[app_consts.NAMED_RESOURCES]
if len(named_resources) > 0:
  identifiers = id_service_facade.resolve(named_resources)
  # add identifiers to the pre_processed_request object
  pre_processed_request[app_consts.IDENTIFIERS] = identifiers
else:
  pre_processed_request[app_consts.IDENTIFIERS] = []

# Output pre_processed_request:
  {'user_query': 'In what games did Isabelle Werth, Nedo Nadi and Allyson Felix compete?',
   'domain': 'olympics',
   'named_resources': {'allyson felix', 'isabelle werth', 'nedo nadi'},
   'identifiers': [ {'id': 34551, 'role': 32, 'name': 'allyson felix'},
   {'id': 129726, 'role': 32, 'name': 'isabelle werth'},
   {'id': 84026, 'role': 32, 'name': 'nedo nadi'} ] }

Prepare the request

This step is pivotal in this pattern. Having obtained the domain and the named resources along with their looked-up IDs, we use the corresponding context for that domain to generate the following:

  • A prompt for the LLM to generate a SQL query corresponding to the user query
  • A SQL script to create the domain-specific schema

To create the prompt for the LLM, this step assembles the system prompt, the user prompt, and the received user query from the input, along with the domain-specific schema definition, including new temporary tables created as well as any join hints, and finally the few-shot examples for the domain. Other than the user query that is received as in input, other components are based on the values provided in the context for that domain.

A SQL script for creating required domain-specific temporary structures (such as views and tables) is constructed from the information in the context. The domain-specific schema in the LLM prompt, join hints, and the few-shot examples are aligned with the schema that gets generated by running this script. In our example, this step is shown in the following code. The output is a dictionary with two keys, llm_prompt and sql_preamble. The value strings for these have been clipped here; the full output can be seen in the Jupyter notebook.

prepared_request = request_preparer.run(pre_processed_request)

# Output prepared_request:
{'llm_prompt': 'You are a SQL expert. Given the following SQL tables definitions, ...
CREATE TABLE games (id INTEGER PRIMARY KEY, games_year INTEGER, ...);
...
<example>
question: How many gold medals has Yukio Endo won? answer: ```{"sql":
"SELECT a.id, count(m.medal_name) as "count"
FROM athletes_in_focus a INNER JOIN games_competitor gc ...
WHERE m.medal_name = 'Gold' GROUP BY a.id;" }```
</example>
...
'sql_preamble': [ 'CREATE temp TABLE athletes_in_focus (row_id INTEGER
PRIMARY KEY, id INTEGER, full_name TEXT DEFAULT NULL);',
'INSERT INTO athletes_in_focus VALUES
(1,84026,'nedo nadi'), (2,34551,'allyson felix'), (3,129726,'isabelle werth');"]}

Generate SQL

Now that the prompt has been prepared along with any information necessary to provide the proper context to the LLM, we provide that information to the SQL-generating LLM in this step. The goal is to have the LLM output SQL with the correct join structure, filters, and columns. See the following code:

llm_response = llm_service_facade.invoke(prepared_request[ 'llm_prompt' ])
generated_sql = llm_response[ 'llm_output' ]

# Output generated_sql:
{'sql': 'SELECT g.games_name, g.games_year FROM athletes_in_focus a
JOIN games_competitor gc ON gc.person_id = a.id
JOIN games g ON gc.games_id = g.id;'}

Execute the SQL

After the SQL query is generated by the LLM, we can send it off to the next step. At this step, the SQL preamble and the generated SQL are merged to create a complete SQL script for execution. The complete SQL script is then executed against the data store, a response is fetched, and then the response is passed back to the client or end-user. See the following code:

sql_script = prepared_request[ 'sql_preamble' ] + [ generated_sql[ 'sql' ] ]
database = app_consts.get_database_for_domain(domain)
results = rdbms_service_facade.execute_sql(database, sql_script)

# Output results:
{'rdbms_output': [
('games_name', 'games_year'),
('2004 Summer', 2004),
...
('2016 Summer', 2016)],
'processing_status': 'success'}

Solution benefits

Overall, our tests have shown several benefits, such as:

  • High accuracy – This is measured by a string matching of the generated query with the target SQL query for each test case. In our tests, we observed over 95% accuracy for 100 queries, spanning three data domains.
  • High consistency – This is measured in terms of the same SQL generated being generated across multiple runs. We observed over 95% consistency for 100 queries, spanning three data domains. With the test configuration, the queries were accurate most of the time; a small number occasionally produced inconsistent results.
  • Low cost and latency – The approach supports the use of small, low-cost, low-latency LLMs. We observed SQL generation in the 1–3 second range using models Meta’s Code Llama 13B and Anthropic’s Claude Haiku 3.
  • Scalability – The methods that we employed in terms of data abstractions facilitate scaling independent of the number of entities or identifiers in the data for a given use case. For instance, in our tests consisting of a list of 200 different named resources per row of a table, and over 10,000 such rows, we measured a latency range of 2–5 seconds for SQL generation and 3.5–4.0 seconds for SQL execution.
  • Solving complexity – Using the data abstractions for simplifying complexity enabled the accurate generation of arbitrarily complex enterprise queries, which almost certainly would not be possible otherwise.

We attribute the success of the solution with these excellent but lightweight models (compared to a Meta Llama 70B variant or Anthropic’s Claude Sonnet) to the points noted earlier, with the reduced LLM task complexity being the driving force. The implementation code demonstrates how this is achieved. Overall, by using the optimizations outlined in this post, natural language SQL generation for enterprise data is much more feasible than would be otherwise.

AWS solution architecture

In this section, we illustrate how you might implement the architecture on AWS. The end-user sends their natural language queries to the NL2SQL solution using a REST API. Amazon API Gateway is used to provision the REST API, which can be secured by Amazon Cognito. The API is linked to an AWS Lambda function, which implements and orchestrates the processing steps described earlier using a programming language of the user’s choice (such as Python) in a serverless manner. In this example implementation, where Amazon Bedrock is noted, the solution uses Anthropic’s Claude Haiku 3.

Briefly, the processing steps are as follows:

  1. Determine the domain by invoking an LLM on Amazon Bedrock for classification.
  2. Invoke Amazon Bedrock to extract relevant named resources from the request.
  3. After the named resources are determined, this step calls a service (the Identity Service) that returns identifier specifics relevant to the named resources for the task at hand. The Identity Service is logically a key/value lookup service, which might support for multiple domains.
  4. This step runs on Lambda to create the LLM prompt to generate the SQL, and to define temporary SQL structures that will be executed by the SQL engine along with the SQL generated by the LLM (in the next step).
  5. Given the prepared prompt, this step invokes an LLM running on Amazon Bedrock to generate the SQL statements that correspond to the input natural language query.
  6. This step executes the generated SQL query against the target database. In our example implementation, we used an SQLite database for illustration purposes, but you could use another database server.

The final result is obtained by running the preceding pipeline on Lambda. When the workflow is complete, the result is provided as a response to the REST API request.

The following diagram illustrates the solution architecture.

Example solution architecture

Conclusion

In this post, the AWS and Cisco teams unveiled a new methodical approach that addresses the challenges of enterprise-grade SQL generation. The teams were able to reduce the complexity of the NL2SQL process while delivering higher accuracy and better overall performance.

Though we’ve walked you through an example use case focused on answering questions about Olympic athletes, this versatile pattern can be seamlessly adapted to a wide range of business applications and use cases. The demo code is available in the GitHub repository. We invite you to leave any questions and feedback in the comments.


About the authors

Author image

Renuka Kumar is a Senior Engineering Technical Lead at Cisco, where she has architected and led the development of Cisco’s Cloud Security BU’s AI/ML capabilities in the last 2 years, including launching first-to-market innovations in this space. She has over 20 years of experience in several cutting-edge domains, with over a decade in security and privacy. She holds a PhD from the University of Michigan in Computer Science and Engineering.

Author image

Toby Fotherby is a Senior AI and ML Specialist Solutions Architect at AWS, helping customers use the latest advances in AI/ML and generative AI to scale their innovations. He has over a decade of cross-industry expertise leading strategic initiatives and master’s degrees in AI and Data Science. Toby also leads a program training the next generation of AI Solutions Architects.

author image

Shweta Keshavanarayana is a Senior Customer Solutions Manager at AWS. She works with AWS Strategic Customers and helps them in their cloud migration and modernization journey. Shweta is passionate about solving complex customer challenges using creative solutions. She holds an undergraduate degree in Computer Science & Engineering. Beyond her professional life, she volunteers as a team manager for her sons’ U9 cricket team, while also mentoring women in tech and serving the local community.

author imageThomas Matthew is an AL/ML Engineer at Cisco. Over the past decade, he has worked on applying methods from graph theory and time series analysis to solve detection and exfiltration problems found in Network security. He has presented his research and work at Blackhat and DevCon. Currently, he helps integrate generative AI technology into Cisco’s Cloud Security product offerings.

Daniel Vaquero is a Senior AI/ML Specialist Solutions Architect at AWS. He helps customers solve business challenges using artificial intelligence and machine learning, creating solutions ranging from traditional ML approaches to generative AI. Daniel has more than 12 years of industry experience working on computer vision, computational photography, machine learning, and data science, and he holds a PhD in Computer Science from UCSB.

author imageAtul Varshneya is a former Principal AI/ML Specialist Solutions Architect with AWS. He currently focuses on developing solutions in the areas of AI/ML, particularly in generative AI. In his career of 4 decades, Atul has worked as the technology R&D leader in multiple large companies and startups.

author imageJessica Wu is an Associate Solutions Architect at AWS. She helps customers build highly performant, resilient, fault-tolerant, cost-optimized, and sustainable architectures.

Read More

AWS Field Experience reduced cost and delivered low latency and high performance with Amazon Nova Lite foundation model

AWS Field Experience reduced cost and delivered low latency and high performance with Amazon Nova Lite foundation model

AWS Field Experience (AFX) empowers Amazon Web Services (AWS) sales teams with generative AI solutions built on Amazon Bedrock, improving how AWS sellers and customers interact. The AFX team uses AI to automate tasks and provide intelligent insights and recommendations, streamlining workflows for both customer-facing roles and internal support functions. Their approach emphasizes operational efficiency and practical enhancements to daily processes.

Last year, AFX introduced Account Summaries as the first in a forthcoming lineup of tools designed to support and streamline sales workflows. By integrating structured and unstructured data—from sales collateral and customer engagements to external insights and machine learning (ML) outputs—the tool delivers summarized insights that offer a comprehensive view of customer accounts. These summaries provide concise overviews and timely updates, enabling teams to make informed decisions during customer interactions.

The following screenshot shows an example of Account Summary for a customer account, including an executive summary, company overview, and recent account changes.

An image showing account summary of AnyCompany Air Lines, including financial metrics, AWS usage, support cases, and recent updates.

Migration to the Amazon Nova Light foundation model

Initially, AFX selected a range of models available on Amazon Bedrock, each chosen for its specific capabilities tailored to the diverse requirements of various summary sections. This was done to optimize accuracy, response time, and cost efficiency. However, following the introduction of state-of-the-art Amazon Nova foundation models in December 2024, the AFX team consolidated all its generative AI workload onto the Nova Lite model to capitalize on its industry-leading price performance and optimized latency.

Since moving to the Nova Lite model, the AFX team has achieved a remarkable 90% reduction in inference costs. This has empowered them to scale operations and deliver greater business value that directly supports their mission of creating efficient, high-performing sales processes.

Because Account Summaries are often used by sellers during on-the-go customer engagements, response speed is critical for maintaining seller efficiency. The Nova Lite model’s ultra-low latency helps ensure that sellers receive fast, reliable responses, without compromising on the quality of the insights.

The AFX team also highlighted the seamless migration experience, noting that their existing prompting, reasoning, and evaluation criteria transferred smoothly for the Amazon Nova Lite model without requiring significant modifications. The combination of tailored prompt controls and authorized reference content creates a bounded response framework, minimizing hallucinations elements and inaccuracies.

Overall impact

Since using the Nova Lite model, over 15,600 summaries have been generated by 3,600 sellers—with 1,500 of those sellers producing more than four summaries each. Impressively, the generative AI Account Summaries have achieved a 72% favorability rate, underscoring strong seller confidence and widespread approval.

AWS sellers report saving an average of 35 minutes per summary, a benefit that significantly boosts productivity and allocates more time for customer engagements. Additionally, about one-third of surveyed sellers noted that the summaries positively influenced their customer interactions, and those using generative AI Account Summaries experienced a 4.9% increase in the value of opportunities created.

A member of the AFX team explained, “The Amazon Nova Lite model has significantly reduced our costs without compromising performance. It allowed us to get fast, reliable account summaries, making customer interaction more productive and impactful.”

Conclusion

The AFX team’s product migration to the Nova Lite model has delivered tangible enterprise value by enhancing sales workflows. By migrating to the Amazon Nova Lite model, the team has not only achieved significant cost savings and reduced latency, but has also empowered sellers with a leading intelligent and reliable solution. This process has translated into real-world benefits—saving time, simplifying research, and bolstering customer engagement—laying a solid foundation for ongoing business goals and sustained success.

Get started with Amazon Nova on the Amazon Bedrock console. Learn more at the Amazon Nova product page.


About the Authors

Anuj Jauhari is a Senior Product Marketing Manager at Amazon Web Services, where he helps customers realize value from innovations in generative AI.

Ashwin Nadagoudar is a Software Development Manager at Amazon Web Services, leading go-to-market (GTM) strategies and user journey initiatives with generative AI.

Sonciary Perez is a Principal Product Manager at Amazon Web Services, supporting the transformation of AWS Sales through AI-powered solutions that drive seller productivity and accelerate revenue growth.

Read More

Combine keyword and semantic search for text and images using Amazon Bedrock and Amazon OpenSearch Service

Combine keyword and semantic search for text and images using Amazon Bedrock and Amazon OpenSearch Service

Customers today expect to find products quickly and efficiently through intuitive search functionality. A seamless search journey not only enhances the overall user experience, but also directly impacts key business metrics such as conversion rates, average order value, and customer loyalty. According to a McKinsey study, 78% of consumers are more likely to make repeat purchases from companies that provide personalized experiences. As a result, delivering exceptional search functionality has become a strategic differentiator for modern ecommerce services. With ever expanding product catalogs and increasing diversity of brands, harnessing advanced search technologies is essential for success.

Semantic search enables digital commerce providers to deliver more relevant search results by going beyond keyword matching. It uses an embeddings model to create vector embeddings that capture the meaning of the input query. This helps the search be more resilient to phrasing variations and to accept multimodal inputs such as text, image, audio, and video. For example, a user inputs a query containing text and an image of a product they like, and the search engine translates both into vector embeddings using a multimodal embeddings model and retrieves related items from the catalog using embeddings similarities. To learn more about semantic search and how Amazon Prime Video uses it to help customers find their favorite content, see Amazon Prime Video advances search for sports using Amazon OpenSearch Service.

While semantic search provides contextual understanding and flexibility, keyword search remains a crucial component for a comprehensive ecommerce search solution. At its core, keyword search provides the essential baseline functionality of accurately matching user queries to product data and metadata, making sure explicit product names, brands, or attributes can be reliably retrieved. This matching capability is vital, because users often have specific items in mind when initiating a search, and meeting these explicit needs with precision is important to deliver a satisfactory experience.

Hybrid search combines the strengths of keyword search and semantic search, enabling retailers to deliver more accurate and relevant results to their customers. Based on OpenSearch blog post, hybrid search improves result quality by 8–12% compared to keyword search and by 15% compared to natural language search. However, combining keyword search and semantic search presents significant complexity because different query types provide scores on different scales. Using Amazon OpenSearch Service hybrid search, customers can seamlessly integrate these approaches by combining relevance scores from multiple search types into one unified score.

OpenSearch Service is the AWS recommended vector database for Amazon Bedrock. It’s a fully managed service that you can use to deploy, operate, and scale OpenSearch on AWS. OpenSearch is a distributed open-source search and analytics engine composed of a search engine and vector database. OpenSearch Service can help you deploy and operate your search infrastructure with native vector database capabilities delivering as low as single-digit millisecond latencies for searches across billions of vectors, making it ideal for real-time AI applications. To learn more, see Improve search results for AI using Amazon OpenSearch Service as a vector database with Amazon Bedrock.

Multimodal embedding models like Amazon Titan Multimodal Embeddings G1, available through Amazon Bedrock, play a critical role in enabling hybrid search functionality. These models generate embeddings for both text and images by representing them in a shared semantic space. This allows systems to retrieve relevant results across modalities such as finding images using text queries or combining text with image inputs.

In this post, we walk you through how to build a hybrid search solution using OpenSearch Service powered by multimodal embeddings from the Amazon Titan Multimodal Embeddings G1 model through Amazon Bedrock. This solution demonstrates how you can enable users to submit both text and images as queries to retrieve relevant results from a sample retail image dataset.

Overview of solution

In this post, you will build a solution that you can use to search through a sample image dataset in the retail space, using a multimodal hybrid search system powered by OpenSearch Service. This solution has two key workflows: a data ingestion workflow and a query workflow.

Data ingestion workflow

The data ingestion workflow generates vector embeddings for text, images, and metadata using Amazon Bedrock and the Amazon Titan Multimodal Embeddings G1 model. Then, it stores the vector embeddings, text, and metadata in an OpenSearch Service domain.

In this workflow, shown in the following figure, we use a SageMaker JupyterLab notebook to perform the following actions:

  1. Read text, images, and metadata from an Amazon Simple Storage Service (Amazon S3) bucket, and encode images in Base64 format.
  2. Send the text, images, and metadata to Amazon Bedrock using its API to generate embeddings using the Amazon Titan Multimodal Embeddings G1 model.
  3. The Amazon Bedrock API replies with embeddings to the Jupyter notebook.
  4. Store both the embeddings and metadata in an OpenSearch Service domain.

Query workflow

In the query workflow, an OpenSearch search pipeline is used to convert the query input to embeddings using the embeddings model registered with OpenSearch. Then, within the OpenSearch search pipeline results processor, results of semantic search and keyword search are combined using the normalization processor to provide relevant search results to users. Search pipelines take away the heavy lifting of building score results normalization and combination outside your OpenSearch Service domain.

The workflow consists of the following steps shown in the following figure:

  1. The client submits a query input containing text, a Base64 encoded image, or both to OpenSearch Service. Text submitted is used for both semantic and keyword search, and the image is used for semantic search.
  2. The OpenSearch search pipeline performs the keyword search using textual inputs and a neural search using vector embeddings generated by Amazon Bedrock using Titan Multimodal Embeddings G1 model.
  3. The normalization processor within the pipeline scales search results using techniques like min_max and combines keyword and semantic scores using arithmetic_mean.
  4. Ranked search results are returned to the client.

Walkthrough overview

To deploy the solution, complete the following high-level steps:

  1. Create a connector for Amazon Bedrock in OpenSearch Service.
  2. Create an OpenSearch search pipeline and enable hybrid search.
  3. Create an OpenSearch Service index for storing the multimodal embeddings and metadata.
  4. Ingest sample data to the OpenSearch Service index.
  5. Create OpenSearch Service query functions to test search functionality.

Prerequisites

For this walkthrough, you should have the following prerequisites:

The code is open source and hosted on GitHub.

Create a connector for Amazon Bedrock in OpenSearch Service

To use OpenSearch Service machine learning (ML) connectors with other AWS services, you need to set up an IAM role allowing access to that service. In this section, we demonstrate the steps to create an IAM role and then create the connector.

Create an IAM role

Complete the following steps to set up an IAM role to delegate Amazon Bedrock permissions to OpenSearch Service:

  1. Add the following policy to the new role to allow OpenSearch Service to invoke the Amazon Titan Multimodal Embeddings G1 model:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": "bedrock:InvokeModel",
                "Resource": "arn:aws:bedrock:region:account-id:foundation-model/amazon.titan-embed-image-v1"
            }
        ]
    }
  1. Modify the role trust policy as follows. You can follow the instructions in IAM role management to edit the trust relationship of the role.
    {
    	"Version": "2012-10-17",
    	"Statement": [
    		{
    			"Effect": "Allow",
    			"Principal": {
    			"Service": "opensearchservice.amazonaws.com"
    		},
    			"Action": "sts:AssumeRole"
    		}
    	]
    }

Connect an Amazon Bedrock model to OpenSearch

After you create the role, you can use the Amazon Resource Name (ARN) of the role to define the constant in the SageMaker notebook along with the OpenSearch domain endpoint. Complete the following steps:

  1. Register a model group. Note the model group ID returned in the response to register a model in a later step.
  2. Create a connector, which facilitates registering and deploying external models in OpenSearch. The response will contain the connector ID.
  3. Register the external model to the model group and deploy the model. In this step, you register and deploy the model at the same time—by setting up deploy=true, the registered model is deployed as well.

Create an OpenSearch search pipeline and enable hybrid search

A search pipeline runs inside the OpenSearch Service domain and can have three types of processors: search request processor, search response processor, and search phase result processor. For our search pipeline, we use the search phase result processor, which runs between the search phases at the coordinating node level. The processor uses the normalization processor and normalizes the score from keyword and semantic search. For hybrid search, min-max normalization and arithmetic_mean combination techniques are preferred, but you can also try L2 normalization and geometric_mean or harmonic_mean combination techniques depending on your data and use case.

payload={
	"phase_results_processors": [
		{
			"normalization-processor": {
				"normalization": {
					"technique": "min_max"
				},
				"combination": {
					"technique": "arithmetic_mean",
					"parameters": {
						"weights": [
							OPENSEARCH_KEYWORD_WEIGHT,
							1 - OPENSEARCH_KEYWORD_WEIGHT
						]
					}
				}
			}
		}
	]
}
response = requests.put(
url=f"{OPENSEARCH_ENDPOINT}/_search/pipeline/"+OPENSEARCH_SEARCH_PIPELINE_NAME,
		json=payload,
		headers={"Content-Type": "application/json"},
		auth=open_search_auth
)

Create an OpenSearch Service index for storing the multimodal embeddings and metadata

For this post, we use the Amazon Berkley Objects Dataset, which is a collection of 147,702 product listings with multilingual metadata and 398,212 unique catalog images. In this example, we only use Shoes and listings that are in en_US as shown in section Prepare listings dataset for Amazon OpenSearch ingestion of the notebook.

Use the following code to create an OpenSearch index to ingest the sample data:

response = opensearch_client.indices.create(
	index=OPENSEARCH_INDEX_NAME,
	body={
		"settings": {
			"index.knn": True,
			"number_of_shards": 2
		},
		"mappings": {
			"properties": {
				"amazon_titan_multimodal_embeddings": {
					"type": "knn_vector",
					"dimension": 1024,
					"method": {
						"name": "hnsw",
						"engine": "lucene",
						"parameters": {}
					}
				}
			}
		}
	}
)

Ingest sample data to the OpenSearch Service index

In this step, you select the relevant features used for generating embeddings. The images are converted to Base64. The combination of a selected feature and a Base64 image is used to generate multimodal embeddings, which are stored in the OpenSearch Service index along with the metadata using a OpenSearch bulk operation, and ingest listings in batches.

Create OpenSearch Service query functions to test search functionality

With the sample data ingested, you can run queries against this data to test the hybrid search functionality. To facilitate this process, we created helper functions to perform the queries in the query workflow section of the notebook. In this section, you explore specific parts of the functions that differentiate the search methods.

Keyword search

For keyword search, send the following payload to the OpenSearch domain search endpoint:

payload = {
	"query": {
		"multi_match": { 
			"query": query_text,
		}
	},
}

Semantic search

For semantic search, you can send the text and image as part of the payload. Model_id in the request is the external embeddings model that you connected earlier. OpenSearch will invoke the model and convert text and image to embeddings.

payload = {
	"query": {
		"neural": {
			"vector_embedding": {
				"query_text": query_text,
				"query_image": query_jpg_image,
				"model_id": model_id,
				"k": 5
			}
		}
	}
}

Hybrid search

This method uses the OpenSearch pipeline you created. The payload has both the semantic and neural search.

payload = {
"query": {
	"hybrid": {
		"queries": [
				{
					"multi_match": { 
							"query": query_text,
						}
				},
				{
					"neural": {
						"vector_embedding": {
							"query_text": query_text,
							"query_image": query_jpg_image,
							"model_id": model_id,
							"k": 5
						}
					}
				}
			]
		}
	}
}

Test search methods

To compare the multiple search methods, you can query the index using query_text which provides specific information about the desired output, and query_jpg_image which provides the overall abstraction of the desired style of the output.

query_text = "leather sandals in Petal Blush"
search_image_path = '16/16e48774.jpg'

Keyword search

The following output lists the top three keyword search results. The keyword search successfully located leather sandals in the color Petal Blush, but it didn’t take the desired style into consideration.

--------------------------------------------------------------------------------------------------------------------------------
Score: 8.4351 	 Item ID: B01MYDNG7C
Item Name: Amazon Brand - The Fix Women's Cantu Ruffle Ankle Wrap Dress Sandal, Petal Blush, 9.5 B US
Fabric Type: Leather	 Material: None 	 Color: Petal Blush	 Style: Cantu Ruffle Ankle Wrap Sandal
--------------------------------------------------------------------------------------------------------------------------------
Score: 8.4351 	 Item ID: B06XH8M37Q
Item Name: Amazon Brand - The Fix Women's Farah Single Buckle Platform Dress Sandal, Petal Blush, 6.5 B US
Fabric Type: 100% Leather	 Material: None 	 Color: Petal Blush	 Style: Farah Single Buckle Platform Sandal
--------------------------------------------------------------------------------------------------------------------------------
Score: 8.4351 	 Item ID: B01MSCV2YB
Item Name: Amazon Brand - The Fix Women's Conley Lucite Heel Dress Sandal,Petal Blush,7.5 B US
Fabric Type: Leather	 Material: Suede 	 Color: Petal Blush	 Style: Conley Lucite Heel Sandal
--------------------------------------------------------------------------------------------------------------------------------

 

Semantic search

Semantic search successfully located leather sandal and considered the desired style. However, the similarity to the provided images took priority over the specific color provided in query_text.

--------------------------------------------------------------------------------------------------------------------------------
Score: 0.7072 	 Item ID: B01MZF96N7
Item Name: Amazon Brand - The Fix Women's Bonilla Block Heel Cutout Tribal Dress Sandal, Havana Tan, 7 B US
Fabric Type: Leather	 Material: Suede 	 Color: Havana Tan	 Style: Bonilla Block Heel Cutout Tribal Sandal
--------------------------------------------------------------------------------------------------------------------------------
Score: 0.7018 	 Item ID: B01MUG3C0Q
Item Name: Amazon Brand - The Fix Women's Farrell Triangle-Cutout Square Toe Flat Dress Sandal, Light Rose/Gold, 7.5 B US
Fabric Type: Synthetic	 Material: Leather 	 Color: Light Rose/Gold	 Style: Farrell Cutout Tribal Square Toe Flat Sandal
--------------------------------------------------------------------------------------------------------------------------------
Score: 0.6858 	 Item ID: B01MYDNG7C
Item Name: Amazon Brand - The Fix Women's Cantu Ruffle Ankle Wrap Dress Sandal, Petal Blush, 9.5 B US
Fabric Type: Leather	 Material: None 	 Color: Petal Blush	 Style: Cantu Ruffle Ankle Wrap Sandal
--------------------------------------------------------------------------------------------------------------------------------

 

Hybrid search

Hybrid search returned similar results to the semantic search because they use the same embeddings model. However, by combining the output of keyword and semantic searches, the ranking of the Petal Blush sandal that most closely matches query_jpg_image increases, moving it the top of the results list.

--------------------------------------------------------------------------------------------------------------------------------
Score: 0.6838 	 Item ID: B01MYDNG7C
Item Name: Amazon Brand - The Fix Women's Cantu Ruffle Ankle Wrap Dress Sandal, Petal Blush, 9.5 B US
Fabric Type: Leather	 Material: None 	 Color: Petal Blush	 Style: Cantu Ruffle Ankle Wrap Sandal
--------------------------------------------------------------------------------------------------------------------------------
Score: 0.6 	 Item ID: B01MZF96N7
Item Name: Amazon Brand - The Fix Women's Bonilla Block Heel Cutout Tribal Dress Sandal, Havana Tan, 7 B US
Fabric Type: Leather	 Material: Suede 	 Color: Havana Tan	 Style: Bonilla Block Heel Cutout Tribal Sandal
--------------------------------------------------------------------------------------------------------------------------------
Score: 0.5198 	 Item ID: B01MUG3C0Q
Item Name: Amazon Brand - The Fix Women's Farrell Triangle-Cutout Square Toe Flat Dress Sandal, Light Rose/Gold, 7.5 B US
Fabric Type: Synthetic	 Material: Leather 	 Color: Light Rose/Gold	 Style: Farrell Cutout Tribal Square Toe Flat Sandal
--------------------------------------------------------------------------------------------------------------------------------

 

Clean up

After you complete this walkthrough, clean up all the resources you created as part of this post. This is an important step to make sure you don’t incur any unexpected charges. If you used an existing OpenSearch Service domain, in the Cleanup section of the notebook, we provide suggested cleanup actions, including delete the index, un-deploy the model, delete the model, delete the model group, and delete the Amazon Bedrock connector. If you created an OpenSearch Service domain exclusively for this exercise, you can bypass these actions and delete the domain.

Conclusion

In this post, we explained how to implement multimodal hybrid search by combining keyword and semantic search capabilities using Amazon Bedrock and Amazon OpenSearch Service. We showcased a solution that uses Amazon Titan Multimodal Embeddings G1 to generate embeddings for text and images, enabling users to search using both modalities. The hybrid approach combines the strengths of keyword search and semantic search, delivering accurate and relevant results to customers.

We encourage you to test the notebook in your own account and get firsthand experience with hybrid search variations. In addition to the outputs shown in this post, we provide a few variations in the notebook. If you’re interested in using custom embeddings models in Amazon SageMaker AI instead, see Hybrid Search with Amazon OpenSearch Service. If you want a solution that offers semantic search only, see Build a contextual text and image search engine for product recommendations using Amazon Bedrock and Amazon OpenSearch Serverless and Build multimodal search with Amazon OpenSearch Service.


About the Authors

Renan Bertolazzi is an Enterprise Solutions Architect helping customers realize the potential of cloud computing on AWS. In this role, Renan is a technical leader advising executives and engineers on cloud solutions and strategies designed to innovate, simplify, and deliver results.

Birender Pal is a Senior Solutions Architect at AWS, where he works with strategic enterprise customers to design scalable, secure and resilient cloud architectures. He supports digital transformation initiatives with a focus on cloud-native modernization, machine learning, and Generative AI. Outside of work, Birender enjoys experimenting with recipes from around the world.

Sarath Krishnan is a Senior Solutions Architect with Amazon Web Services. He is passionate about enabling enterprise customers on their digital transformation journey. Sarath has extensive experience in architecting highly available, scalable, cost-effective, and resilient applications on the cloud. His area of focus includes DevOps, machine learning, MLOps, and generative AI.

Read More

Build an AI-powered document processing platform with open source NER model and LLM on Amazon SageMaker

Build an AI-powered document processing platform with open source NER model and LLM on Amazon SageMaker

Archival data in research institutions and national laboratories represents a vast repository of historical knowledge, yet much of it remains inaccessible due to factors like limited metadata and inconsistent labeling. Traditional keyword-based search mechanisms are often insufficient for locating relevant documents efficiently, requiring extensive manual review to extract meaningful insights.

To address these challenges, a U.S. National Laboratory has implemented an AI-driven document processing platform that integrates named entity recognition (NER) and large language models (LLMs) on Amazon SageMaker AI. This solution improves the findability and accessibility of archival records by automating metadata enrichment, document classification, and summarization. By using Mixtral-8x7B for abstractive summarization and title generation, alongside a BERT-based NER model for structured metadata extraction, the system significantly improves the organization and retrieval of scanned documents.

Designed with a serverless, cost-optimized architecture, the platform provisions SageMaker endpoints dynamically, providing efficient resource utilization while maintaining scalability. The integration of modern natural language processing (NLP) and LLM technologies enhances metadata accuracy, enabling more precise search functionality and streamlined document management. This approach supports the broader goal of digital transformation, making sure that archival data can be effectively used for research, policy development, and institutional knowledge retention.

In this post, we discuss how you can build an AI-powered document processing platform with open source NER and LLMs on SageMaker.

Solution overview

The NER & LLM Gen AI Application is a document processing solution built on AWS that combines NER and LLMs to automate document analysis at scale. The system addresses the challenges of processing large volumes of textual data by using two key models: Mixtral-8x7B for text generation and summarization, and a BERT NER model for entity recognition.

The following diagram illustrates the solution architecture.

The architecture implements a serverless design with dynamically managed SageMaker endpoints that are created on demand and destroyed after use, optimizing performance and cost-efficiency. The application follows a modular structure with distinct components handling different aspects of document processing, including extractive summarization, abstractive summarization, title generation, and author extraction. These modular pieces can be removed, replaced, duplicated, and patterned against for optimal reusability.

The processing workflow begins when documents are detected in the Extracts Bucket, triggering a comparison against existing processed files to prevent redundant operations. The system then orchestrates the creation of necessary model endpoints, processes documents in batches for efficiency, and automatically cleans up resources upon completion. Multiple specialized Amazon Simple Storage Service Buckets (Amazon S3 Bucket) store different types of outputs.

Click here to open the AWS console and follow along. 

Solution Components

Storage architecture

The application uses a multi-bucket Amazon S3 storage architecture designed for clarity, efficient processing tracking, and clear separation of document processing stages. Each bucket serves a specific purpose in the pipeline, providing organized data management and simplified access control. Amazon DynamoDB is used to track the processing of each document.

The bucket types are as follows:

  • Extracts – Source documents for processing
  • Extractive summary – Key sentence extractions
  • Abstractive summary – LLM-generated summaries
  • Generated titles – LLM-generated titles
  • Author information – Name extraction using NER
  • Model weights – ML model storage

SageMaker endpoints

The SageMaker endpoints in this application represent a dynamic, cost-optimized approach to machine learning (ML) model deployment. Rather than maintaining constantly running endpoints, the system creates them on demand when document processing begins and automatically stops them upon completion. Two primary endpoints are managed: one for the Mixtral-8x7B LLM, which handles text generation tasks including abstractive summarization and title generation, and another for the BERT-based NER model responsible for author extraction. This endpoint based architecture provides decoupling between the other processing, allowing independent scaling, versioning, and maintenance of each component. The decoupled nature of the endpoints also provides flexibility to update or replace individual models without impacting the broader system architecture.

The endpoint lifecycle is orchestrated through dedicated AWS Lambda functions that handle creation and deletion. When processing is triggered, endpoints are automatically initialized and model artifacts are downloaded from Amazon S3. The LLM endpoint is provisioned on ml.p4d.24xlarge (GPU) instances to provide sufficient computational power for the LLM operations. The NER endpoint is deployed on a ml.c5.9xlarge instance (CPU), which is sufficient to support this language model. To maximize cost-efficiency, the system processes documents in batches while the endpoints are active, allowing multiple documents to be processed during a single endpoint deployment cycle and maximizing the usage of the endpoints.

For usage awareness, the endpoint management system includes notification mechanisms through Amazon Simple Notification Service (Amazon SNS). Users receive notifications when endpoints are destroyed, providing visibility that a large instance is destroyed and not idling. The entire endpoint lifecycle is integrated into the broader workflow through AWS Step Functions, providing coordinated processing across all components of the application.

Step Functions workflow

The following figure illustrates the Step Functions workflow.

The application implements a processing pipeline through AWS Step Functions, orchestrating a series of Lambda functions that handle distinct aspects of document analysis. Multiple documents are processed in batches while endpoints are active, maximizing resource utilization. When processing is complete, the workflow automatically triggers endpoint deletion, preventing unnecessary resource consumption.

The highly modular Lambda functions are designed for flexibility and extensibility, enabling their adaptation for diverse use cases beyond their default implementations. For example, the abstractive summarization can be reused to do QnA or other forms of generation, and the NER model can be used to recognize other entity types such as organizations or locations.

Logical flow

The document processing workflow orchestrates multiple stages of analysis that operate both in parallel and sequential patterns. The Step Functions coordinates the movement of documents through extractive summarization, abstractive summarization, title generation, and author extraction processes. Each stage is managed as a discrete step, with clear input and output specifications, as illustrated in the following figure.

In the following sections, we look at each step of the logical flow in more detail.

Extractive summarization:

The extractive summarization process employs the TextRank algorithm, powered by sumy and NLTK libraries, to identify and extract the most significant sentences from source documents. This approach treats sentences as nodes within a graph structure, where the importance of each sentence is determined by its relationships and connections to other sentences. The algorithm analyzes these interconnections to identify key sentences that best represent the document’s core content, functioning similarly to how an editor would select the most important passages from a text. This method preserves the original wording while reducing the document to its most essential components.

Generate title:

The title generation process uses the Mixtral-8x7B model but focuses on creating concise, descriptive titles that capture the document’s main theme. It uses the extractive summary as input to provide efficiency and focus on key content. The LLM is prompted to analyze the main topics and themes present in the summary and generate an appropriate title that effectively represents the document’s content. This approach makes sure that generated titles are both relevant and informative, providing users with a quick understanding of the document’s subject matter without needing to read the full text.

Abstractive summarization:

Abstractive summarization also uses the Mixtral-8x7B LLM to generate entirely new text that captures the essence of the document. Unlike extractive summarization, this method doesn’t simply select existing sentences, but creates new content that paraphrases and restructures the information. The process takes the extractive summary as input, which helps reduce computation time and costs by focusing on the most relevant content. This approach results in summaries that read more naturally and can effectively condense complex information into concise, readable text.

Extract author:

Author extraction employs a BERT NER model to identify and classify author names within documents. The process specifically focuses on the first 1,500 characters of each document, where author information typically appears. The system follows a three-stage process: first, it detects potential name tokens with confidence scoring; second, it assembles related tokens into complete names; and finally, it validates the assembled names to provide proper formatting and eliminate false positives. The model can recognize various entity types (PER, ORG, LOC, MISC) but is specifically tuned to identify person names in the context of document authorship.

Cost and Performance

The solution achieves remarkable throughput by processing 100,000 documents within a 12-hour window. Key architectural decisions drive both performance and cost optimization. By implementing extractive summarization as an initial step, the system reduces input tokens by 75-90% (depending on the size of the document), substantially decreasing the workload for downstream LLM processing. The implementation of a dedicated NER model for author extraction yields an additional 33% reduction in LLM calls by bypassing the need for the more resource-intensive language model. These strategic optimizations create a compound effect – accelerating processing speeds while simultaneously reducing operational costs – establishing the platform as an efficient and cost-effective solution for enterprise-scale document processing needs. To estimate cost for processing 100,000 documents, multiply 12 by the cost per hour of the ml.p4d.24xlarge instance in your AWS region. It’s important to note that instance costs vary by region and may change over time, so current pricing should be consulted for accurate cost projections.

Deploy the Solution

To deploy follow along the instruction in the GitHub repo.

Clean up

Clean up instructions can be found in this section.

Conclusion

The NER & LLM Gen AI Application represents an organizational advancement in automated document processing, using powerful language models in an efficient serverless architecture. Through its implementation of both extractive and abstractive summarization, named entity recognition, and title generation, the system demonstrates the practical application of modern AI technologies in handling complex document analysis tasks. The application’s modular design and flexible architecture enable organizations to adapt and extend its capabilities to meet their specific needs, while the careful management of AWS resources through dynamic endpoint creation and deletion maintains cost-effectiveness. As organizations continue to face growing demands for efficient document processing, this solution provides a scalable, maintainable and customizable framework for automating and streamlining these workflows.

References:


About the Authors

Nick Biso is a Machine Learning Engineer at AWS Professional Services. He solves complex organizational and technical challenges using data science and engineering. In addition, he builds and deploys AI/ML models on the AWS Cloud. His passion extends to his proclivity for travel and diverse cultural experiences.

Dr. Ian Lunsford is an Aerospace Cloud Consultant at AWS Professional Services. He integrates cloud services into aerospace applications. Additionally, Ian focuses on building AI/ML solutions using AWS services.

Max Rathmann is a Senior DevOps Consultant at Amazon Web Services, where she specializes in architecting cloud-native, server-less applications. She has a background in operationalizing AI/ML solutions and designing MLOps solutions with AWS Services.

Michael Massey is a Cloud Application Architect at Amazon Web Services, where he specializes in building frontend and backend cloud-native applications. He designs and implements scalable and highly-available solutions and architectures that help customers achieve their business goals.

Jeff Ryan is a DevOps Consultant at AWS Professional Services, specializing in AI/ML, automation, and cloud security implementations. He focuses on helping organizations leverage AWS services like Bedrock, Amazon Q, and SageMaker to build innovative solutions. His expertise spans MLOps, GenAI, serverless architectures, and Infrastructure as Code (IaC).

Dr. Brian Weston is a research manager at the Center for Applied Scientific Computing, where he is the AI/ML Lead for the Digital Twins for Additive Manufacturing Strategic Initiative, a project focused on building digital twins for certification and qualification of 3D printed components. He also holds a program liaison role between scientists and IT staff, where Weston champions the integration of cloud computing with digital engineering transformation, driving efficiency and innovation for mission science projects at the laboratory.

Ian Thompson is a Data Engineer at Enterprise Knowledge, specializing in graph application development and data catalog solutions. His experience includes designing and implementing graph architectures that improve data discovery and analytics across organizations. He is also the #1 Square Off player in the world.

Anna D’Angela is a Data Engineer at Enterprise Knowledge within the Semantic Engineering and Enterprise AI practice. She specializes in the design and implementation of knowledge graphs.

Read More

Protect sensitive data in RAG applications with Amazon Bedrock

Protect sensitive data in RAG applications with Amazon Bedrock

Retrieval Augmented Generation (RAG) applications have become increasingly popular due to their ability to enhance generative AI tasks with contextually relevant information. Implementing RAG-based applications requires careful attention to security, particularly when handling sensitive data. The protection of personally identifiable information (PII), protected health information (PHI), and confidential business data is crucial because this information flows through RAG systems. Failing to address these security considerations can lead to significant risks and potential data breaches. For healthcare organizations, financial institutions, and enterprises handling confidential information, these risks can result in regulatory compliance violations and breach of customer trust. See the OWASP Top 10 for Large Language Model Applications to learn more about the unique security risks associated with generative AI applications.

Developing a comprehensive threat model for your generative AI applications can help you identify potential vulnerabilities related to sensitive data leakage, prompt injections, unauthorized data access, and more. To assist in this effort, AWS provides a range of generative AI security strategies that you can use to create appropriate threat models.

Amazon Bedrock Knowledge Bases is a fully managed capability that simplifies the management of the entire RAG workflow, empowering organizations to give foundation models (FMs) and agents contextual information from your private data sources to deliver more relevant and accurate responses tailored to your specific needs. Additionally, with Amazon Bedrock Guardrails, you can implement safeguards in your generative AI applications that are customized to your use cases and responsible AI policies. You can redact sensitive information such as PII to protect privacy using Amazon Bedrock Guardrails.

RAG workflow: Converting data to actionable knowledge

RAG consists of two major steps:

  • Ingestion – Preprocessing unstructured data, which includes converting the data into text documents and splitting the documents into chunks. Document chunks are then encoded with an embedding model to convert them to document embeddings. These encoded document embeddings along with the original document chunks in the text are then stored to a vector store, such as Amazon OpenSearch Service.
  • Augmented retrieval – At query time, the user’s query is first encoded with the same embedding model to convert the query into a query embedding. The generated query embedding is then used to perform a similarity search on the stored document embeddings to find and retrieve semantically similar document chunks to the query. After the document chunks are retrieved, the user prompt is augmented by passing the retrieved chunks as additional context, so that the text generation model can answer the user query using the retrieved context. If sensitive data isn’t sanitized before ingestion, this might lead to retrieving sensitive data from the vector store and inadvertently leak the sensitive data to unauthorized users as part of the model response.

The following diagram shows the architectural workflow of a RAG system, illustrating how a user’s query is processed through multiple stages to generate an informed response

Bedrock Knowledge Base Flow

Solution overview

In this post we present two architecture patterns: data redaction at storage level and role-based access, for protecting sensitive data when building RAG-based applications using Amazon Bedrock Knowledge Bases.

Data redaction at storage level – Identifying and redacting (or masking) sensitive data before storing them to the vector store (ingestion) using Amazon Bedrock Knowledge Bases. This zero-trust approach to data sensitivity reduces the risk of sensitive information being inadvertently disclosed to unauthorized users.

Role-based access to sensitive data – Controlling selective access to sensitive information based on user roles and permissions during retrieval. This approach is best in situations where sensitive data needs to be stored in the vector store, such as in healthcare settings with distinct user roles like administrators (doctors) and non-administrators (nurses or support personnel).

For all data stored in Amazon Bedrock, the AWS shared responsibility model applies.

Let’s dive in to understand how to implement the data redaction at storage level and role-based access architecture patterns effectively.

Scenario 1: Identify and redact sensitive data before ingesting into the vector store

The ingestion flow implements a four-step process to help protect sensitive data when building RAG applications with Amazon Bedrock:

  1. Source document processing – An AWS Lambda function monitors the incoming text documents landing to a source Amazon Simple Storage Service (Amazon S3) bucket and triggers an Amazon Comprehend PII redaction job to identify and redact (or mask) sensitive data in the documents. An Amazon EventBridge rule triggers the Lambda function every 5 minutes. The document processing pipeline described here only processes text documents. To handle documents containing embedded images, you should implement additional preprocessing steps to extract and analyze images separately before ingestion.
  2. PII identification and redaction – The Amazon Comprehend PII redaction job analyzes the text content to identify and redact PII entities. For example, the job identifies and redacts sensitive data entities like name, email, address, and other financial PII entities.
  3. Deep security scanning – After redaction, documents move to another folder where Amazon Macie verifies redaction effectiveness and identifies any remaining sensitive data objects. Documents flagged by Macie go to a quarantine bucket for manual review, while cleared documents move to a redacted bucket ready for ingestion. For more details on data ingestion, see Sync your data with your Amazon Bedrock knowledge base.
  4. Secure knowledge base integration – Redacted documents are ingested into the knowledge base through a data ingestion job. In case of multi-modal content, for enhanced security, consider implementing:
    • A dedicated image extraction and processing pipeline.
    • Image analysis to detect and redact sensitive visual information.
    • Amazon Bedrock Guardrails to filter inappropriate image content during retrieval.

This multi-layered approach focuses on securing text content while highlighting the importance of implementing additional safeguards for image processing. Organizations should evaluate their multi-modal document requirements and extend the security framework accordingly.

Ingestion flow

The following illustration demonstrates a secure document processing pipeline for handling sensitive data before ingestion into Amazon Bedrock Knowledge Bases.

Scenario 1 - Ingestion Flow

The high-level steps are as follows:

  1. The document ingestion flow begins when documents containing sensitive data are uploaded to a monitored inputs folder in the source bucket. An EventBridge rule triggers a Lambda function (ComprehendLambda).
  2. The ComprehendLambda function monitors for new files in the inputs folder of the source bucket and moves landed files to a processing folder. It then launches an asynchronous Amazon Comprehend PII redaction analysis job and records the job ID and status in an Amazon DynamoDB JobTracking table for monitoring job completion. The Amazon Comprehend PII redaction job automatically redacts and masks sensitive elements such as names, addresses, phone numbers, Social Security numbers, driver’s license IDs, and banking information with the entity type. The job replaces these identified PII entities with placeholder tokens, such as [NAME], [SSN] etc. The entities to mask can be configured using RedactionConfig. For more information, see Redacting PII entities with asynchronous jobs (API). The MaskMode in RedactionConfig is set to REPLACE_WITH_PII_ENTITY_TYPE instead of MASK; redacting with a MaskCharacter would affect the quality of retrieved documents because many documents could contain the same MaskCharacter, thereby affecting the retrieval quality. After completion, the redacted files move to the for_macie_scan folder for secondary scanning.
  3. The secondary verification phase employs Macie for additional sensitive data detection on the redacted files. Another Lambda function (MacieLambda) monitors the completion of the Amazon Comprehend PII redaction job. When the job is complete, the function triggers a Macie one-time sensitive data detection job with files in the for_macie_scan folder.
  4. The final stage integrates with the Amazon Bedrock knowledge base. The findings from Macie determine the next steps: files with high severity ratings (3 or higher) are moved to a quarantine folder for human review by authorized personnel with appropriate permissions and access controls, whereas files with low severity ratings are moved to a designated redacted bucket, which then triggers a data ingestion job to the Amazon Bedrock knowledge base.

This process helps prevent sensitive details from being exposed when the model generates responses based on retrieved data.

Augmented retrieval flow

The augmented retrieval flow diagram shows how user queries are processed securely. It illustrates the complete workflow from user authentication through Amazon Cognito to response generation with Amazon Bedrock, including guardrail interventions that help prevent policy violations in both inputs and outputs.

Scenario 1 - Retrieval Flow

The high-level steps are as follows:

  1. For our demo, we use a web application UI built using Streamlit. The web application launches with a login form with user name and password fields.
  2. The user enters the credentials and logs in. User credentials are authenticated using Amazon Cognito user pools. Amazon Cognito acts as our OpenID connect (OIDC) identity provider (IdP) to provide authentication and authorization services for this application. After authentication, Amazon Cognito generates and returns identity, access and refresh tokens in JSON web token (JWT) format back to the web application. Refer to Understanding user pool JSON web tokens (JWTs) for more information.
  3. After the user is authenticated, they are logged in to the web application, where an AI assistant UI is presented to the user. The user enters their query (prompt) in the assistant’s text box. The query is then forwarded using a REST API call to an Amazon API Gateway endpoint along with the access tokens in the header.
  4. API Gateway forwards the payload along with the claims included in the header to a conversation orchestrator Lambda function.
  5. The conversation orchestrator Lambda function processes the user prompt and model parameters received from the UI and calls the RetrieveAndGenerate API to the Amazon Bedrock knowledge base. Input guardrails are first applied to this request to perform input validation on the user query.
    • The guardrail evaluates and applies predefined responsible AI policies using content filters, denied topic filters and word filters on user input. For more information on creating guardrail filters, see Create a guardrail.
    • If the predefined input guardrail policies are triggered on the user input, the guardrails intervene and return a preconfigured message like, “Sorry, your query violates our usage policy.”
    • Requests that don’t trigger a guardrail policy will retrieve the documents from the knowledge base and generate a response using the RetrieveAndGenerate. Optionally, if users choose to run Retrieve separately, guardrails can also be applied at this stage. Guardrails during document retrieval can help block sensitive data returned from the vector store.
  6. During retrieval, Amazon Bedrock Knowledge Bases encodes the user query using the Amazon Titan Text v2 embeddings model to generate a query embedding.
  7. Amazon Bedrock Knowledge Bases performs a similarity search with the query embedding against the document embeddings in the OpenSearch Service vector store and retrieves top-k chunks. Optionally, post-retrieval, you can incorporate a reranking model to improve the retrieved results quality from the OpenSearch vector store. Refer to Improve the relevance of query responses with a reranker model in Amazon Bedrock for more details.
  8. Finally, the user prompt is augmented with the retrieved document chunks from the vector store as context and the final prompt is sent to an Amazon Bedrock foundation model (FM) for inference. Output guardrail policies are again applied post-response generation. If the predefined output guardrail policies are triggered, the model generates a predefined response like “Sorry, your query violates our usage policy.” If no policies are triggered, then the large language model (LLM) generated response is sent to the user.

To deploy Scenario 1, find the instructions here on Github

Scenario 2: Implement role-based access to PII data during retrieval

In this scenario, we demonstrate a comprehensive security approach that combines role-based access control (RBAC) with intelligent PII guardrails for RAG applications. It integrates Amazon Bedrock with AWS identity services to automatically enforce security through different guardrail configurations for admin and non-admin users.

The solution uses the metadata filtering capabilities of Amazon Bedrock Knowledge Bases to dynamically filter documents during similarity searches using metadata attributes assigned before ingestion. For example, admin and non-admin metadata attributes are created and attached to relevant documents before the ingestion process. During retrieval, the system returns only the documents with metadata matching the user’s security role and permissions and applies the relevant guardrail policies to either mask or block sensitive data detected on the LLM output.

This metadata-driven approach, combined with features like custom guardrails, real-time PII detection, masking, and comprehensive access logging creates a robust framework that maintains the security and utility of the RAG application while enforcing RBAC.

The following diagram illustrates how RBAC works with metadata filtering in the vector database.

Amazon Bedrock Knowledge Bases metadata filtering

For a detailed understanding of how metadata filtering works, see Amazon Bedrock Knowledge Bases now supports metadata filtering to improve retrieval accuracy.

Augmented retrieval flow

The augmented retrieval flow diagram shows how user queries are processed securely based on role-based access.

Scenario 2 - Retrieval flow

The workflow consists of the following steps:

  1. The user is authenticated using an Amazon Cognito user pool. It generates a validation token after successful authentication.
  2. The user query is sent using an API call along with the authentication token through Amazon API Gateway.
  3. Amazon API Gateway forwards the payload and claims to an integration Lambda function.
  4. The Lambda function extracts the claims from the header and checks for user role and determines whether to use an admin guardrail or a non-admin guardrail based on the access level.
  5. Next, the Amazon Bedrock Knowledge Bases RetrieveAndGenerate API is invoked along with the guardrail applied on the user input.
  6. Amazon Bedrock Knowledge Bases embeds the query using the Amazon Titan Text v2 embeddings model.
  7. Amazon Bedrock Knowledge Bases performs similarity searches on the OpenSearch Service vector database and retrieves relevant chunks (optionally, you can improve the relevance of query responses using a reranker model in the knowledge base).
  8. The user prompt is augmented with the retrieved context from the previous step and sent to the Amazon Bedrock FM for inference.
  9. Based on the user role, the LLM output is evaluated against defined Responsible AI policies using either admin or non-admin guardrails.
  10. Based on guardrail evaluation, the system either returns a “Sorry! Cannot Respond” message if the guardrail intervenes, or delivers an appropriate response with no masking on the output for admin users or sensitive data masked for non-admin users.

To deploy Scenario 2, find the instructions here on Github

This security architecture combines Amazon Bedrock guardrails with granular access controls to automatically manage sensitive information exposure based on user permissions. The multi-layered approach makes sure organizations maintain security compliance while fully utilizing their knowledge base, proving security and functionality can coexist.

Customizing the solution

The solution offers several customization points to enhance its flexibility and adaptability:

  • Integration with external APIs – You can integrate existing PII detection and redaction solutions with this system. The Lambda function can be modified to use custom APIs for PHI or PII handling before calling the Amazon Bedrock Knowledge Bases API.
  • Multi-modal processing – Although the current solution focuses on text, it can be extended to handle images containing PII by incorporating image-to-text conversion and caption generation. For more information about using Amazon Bedrock for processing multi-modal content during ingestion, see Parsing options for your data source.
  • Custom guardrails – Organizations can implement additional specialized security measures tailored to their specific use cases.
  • Structured data handling – For queries involving structured data, the solution can be customized to include Amazon Redshift as a structured data store as opposed to OpenSearch Service. Data masking and redaction on Amazon Redshift can be achieved by applying dynamic data masking (DDM) policies, including fine-grained DDM policies like role-based access control and column-level policies using conditional dynamic data masking.
  • Agentic workflow integration – When incorporating an Amazon Bedrock knowledge base with an agentic workflow, additional safeguards can be implemented to protect sensitive data from external sources, such as API calls, tool use, agent action groups, session state, and long-term agentic memory.
  • Response streaming support – The current solution uses a REST API Gateway endpoint that doesn’t support streaming. For streaming capabilities, consider WebSocket APIs in API Gateway, Application Load Balancer (ALB), or custom solutions with chunked responses using client-side reassembly or long-polling techniques.

With these customization options, you can tailor the solution to your specific needs, providing a robust and flexible security framework for your RAG applications. This approach not only protects sensitive data but also maintains the utility and efficiency of the knowledge base, allowing users to interact with the system while automatically enforcing role-appropriate information access and PII handling.

Shared security responsibility: The customer’s role

At AWS, security is our top priority and security in the cloud is a shared responsibility between AWS and our customers. With AWS, you control your data by using AWS services and tools to determine where your data is stored, how it is secured, and who has access to it. Services such as AWS Identity and Access Management (IAM) provide robust mechanisms for securely controlling access to AWS services and resources.

To enhance your security posture further, services like AWS CloudTrail and Amazon Macie offer advanced compliance, detection, and auditing capabilities. When it comes to encryption, AWS CloudHSM and AWS Key Management Service (KMS) enable you to generate and manage encryption keys with confidence.

For organizations seeking to establish governance and maintain data residency controls, AWS Control Tower offers a comprehensive solution. For more information on Data protection and Privacy, refer to Data Protection and Privacy at AWS.

While our solution demonstrates the use of PII detection and redaction techniques, it does not provide an exhaustive list of all PII types or detection methods. As a customer, you bear the responsibility for implementing the appropriate PII detection types and redaction methods using AWS services, including Amazon Bedrock Guardrails and other open-source libraries. The regular expressions configured in Bedrock Guardrails within this solution serve as a reference example only and do not cover all possible variations for detecting PII types. For instance, date of birth (DOB) formats can vary widely. Therefore, it falls on you to configure Bedrock Guardrails and policies to accurately detect the PII types relevant to your use case. Amazon Bedrock maintains strict data privacy standards. The service does not store or log your prompts and completions, nor does it use them to train AWS models or share them with third parties. We implement this through our Model Deployment Account architecture – each AWS Region where Amazon Bedrock is available has a dedicated deployment account per model provider, managed exclusively by the Amazon Bedrock service team. Model providers have no access to these accounts. When a model is delivered to AWS, Amazon Bedrock performs a deep copy of the provider’s inference and training software into these controlled accounts for deployment, making sure that model providers cannot access Amazon Bedrock logs or customer prompts and completions.

Ultimately, while we provide the tools and infrastructure, the responsibility for securing your data using AWS services rests with you, the customer. This shared responsibility model makes sure that you have the flexibility and control to implement security measures that align with your unique requirements and compliance needs, while we maintain the security of the underlying cloud infrastructure. For comprehensive information about Amazon Bedrock security, please refer to the Amazon Bedrock Security documentation.

Conclusion

In this post, we explored two approaches for securing sensitive data in RAG applications using Amazon Bedrock. The first approach focused on identifying and redacting sensitive data before ingestion into an Amazon Bedrock knowledge base, and the second demonstrated a fine-grained RBAC pattern for managing access to sensitive information during retrieval. These solutions represent just two possible approaches among many for securing sensitive data in generative AI applications.

Security is a multi-layered concern that requires careful consideration across all aspects of your application architecture. Looking ahead, we plan to dive deeper into RBAC for sensitive data within structured data stores when used with Amazon Bedrock Knowledge Bases. This can provide additional granularity and control over data access patterns while maintaining security and compliance requirements. Securing sensitive data in RAG applications requires ongoing attention to evolving security best practices, regular auditing of access patterns, and continuous refinement of your security controls as your applications and requirements grow.

To enhance your understanding of Amazon Bedrock security implementation, explore these additional resources:

The complete source code and deployment instructions for these solutions are available in our GitHub repository.

We encourage you to explore the repository for detailed implementation guidance and customize the solutions based on your specific requirements using the customization points discussed earlier.


About the authors

Praveen Chamarthi brings exceptional expertise to his role as a Senior AI/ML Specialist at Amazon Web Services, with over two decades in the industry. His passion for Machine Learning and Generative AI, coupled with his specialization in ML inference on Amazon SageMaker and Amazon Bedrock, enables him to empower organizations across the Americas to scale and optimize their ML operations. When he’s not advancing ML workloads, Praveen can be found immersed in books or enjoying science fiction films. Connect with him on LinkedIn to follow his insights.

Srikanth Reddy is a Senior AI/ML Specialist with Amazon Web Services. He is responsible for providing deep, domain-specific expertise to enterprise customers, helping them use AWS AI and ML capabilities to their fullest potential. You can find him on LinkedIn.

Dhawal Patel is a Principal Machine Learning Architect at AWS. He has worked with organizations ranging from large enterprises to mid-sized startups on problems related to distributed computing and artificial intelligence. He focuses on deep learning, including NLP and computer vision domains. He helps customers achieve high-performance model inference on Amazon SageMaker.

Vivek Bhadauria is a Principal Engineer at Amazon Bedrock with almost a decade of experience in building AI/ML services. He now focuses on building generative AI services such as Amazon Bedrock Agents and Amazon Bedrock Guardrails. In his free time, he enjoys biking and hiking.

Brandon Rooks Sr. is a Cloud Security Professional with 20+ years of experience in the IT and Cybersecurity field. Brandon joined AWS in 2019, where he dedicates himself to helping customers proactively enhance the security of their cloud applications and workloads. Brandon is a lifelong learner, and holds the CISSP, AWS Security Specialty, and AWS Solutions Architect Professional certifications. Outside of work, he cherishes moments with his family, engaging in various activities such as sports, gaming, music, volunteering, and traveling.

Vikash Garg is a Principal Engineer at Amazon Bedrock with almost 4 years of experience in building AI/ML services. He has a decade of experience in building large-scale systems. He now focuses on building the generative AI service AWS Bedrock Guardrails. In his free time, he enjoys hiking and traveling.

Read More