Effective cross-lingual LLM evaluation with Amazon Bedrock

Effective cross-lingual LLM evaluation with Amazon Bedrock

Evaluating the quality of AI responses across multiple languages presents significant challenges for organizations deploying generative AI solutions globally. How can you maintain consistent performance when human evaluations require substantial resources, especially across diverse languages? Many companies find themselves struggling to scale their evaluation processes without compromising quality or breaking their budgets.

Amazon Bedrock Evaluations offers an efficient solution through its LLM-as-a-judge capability, so you can assess AI outputs consistently across linguistic barriers. This approach reduces the time and resources typically required for multilingual evaluations while maintaining high-quality standards.

In this post, we demonstrate how to use the evaluation features of Amazon Bedrock to deliver reliable results across language barriers without the need for localized prompts or custom infrastructure. Through comprehensive testing and analysis, we share practical strategies to help reduce the cost and complexity of multilingual evaluation while maintaining high standards across global large language model (LLM) deployments.

Solution overview

To scale and streamline the evaluation process, we used Amazon Bedrock Evaluations, which offers both automatic and human-based methods for assessing model and RAG system quality. To learn more, see Evaluate the performance of Amazon Bedrock resources.

Automatic evaluations

Amazon Bedrock supports two modes of automatic evaluation:

For LLM-as-a-judge evaluations, you can choose from a set of built-in metrics or define your own custom metrics tailored to your specific use case. You can run these evaluations on models hosted in Amazon Bedrock or on external models by uploading your own prompt-response pairs.

Human evaluations

For use cases that require subject-matter expert judgment, Amazon Bedrock also supports human evaluation jobs. You can assign evaluations to human experts, and Amazon Bedrock manages task distribution, scoring, and result aggregation.

Human evaluations are especially valuable for establishing a baseline against which automated scores, like those from judge model evaluations, can be compared.

Evaluation dataset preparation

We used the Indonesian splits from the SEA-MTBench dataset. It is based on MT-Bench, a widely used benchmark for conversational AI assessment. The Indonesian version was manually translated by native speakers and consisted of 58 records covering a diverse range of categories such as math, reasoning, and writing.

We converted multi-turn conversations into single-turn interactions while preserving context. This allows each turn to be evaluated independently with consistent context. This conversion process resulted in 116 records for evaluation. Here’s how we approached this conversion:

Original row: {"prompts: [{ "text": "prompt 1"}, {"text": "prompt 2"}]}
Converted into 2 rows in the evaluation dataset:
Human: {prompt 1}nnAssistant: {response 1}
Human: {prompt 1}nnAssistant: {response 1}nnHuman: {prompt 2}nnAssistant: {response 2}

For each record, we generated responses using a stronger LLM (Model Strong-A) and a relatively weaker LLM (Model Weak-A). These outputs were later evaluated by both human annotators and LLM judges.

Establishing a human evaluation baseline

To assess evaluation quality, we first established a set of human evaluations as the baseline for comparing LLM-as-a-judge scores. A native-speaking evaluator rated each response from Model Strong-A and Model Weak-A on a 1–5 Likert helpfulness scale, using the same rubric applied in our LLM evaluator prompts.

We conducted manual evaluations on the full evaluation dataset using the human evaluation feature in Amazon Bedrock. Setting up human evaluations in Amazon Bedrock is straightforward: you upload a dataset and define the worker group, and Amazon Bedrock automatically generates the annotation UI and manages the scoring workflow and result aggregation.

The following screenshot shows a sample result from an Amazon Bedrock human evaluation job.

Custom evaluation dashboard showing distribution of 116 ratings across 5-point scale with prompt category filter options

LLM-as-a-judge evaluation setup

We evaluated responses from Model Strong-A and Model Weak-A using four judge models: Model Strong-A, Model Strong-B, Model Weak-A, and Model Weak-B. These evaluations were run using custom metrics in an LLM-as-a-judge evaluation in Amazon Bedrock, which allows flexible prompt definition and scoring without the need to manage your own infrastructure.

Each judge model was given a custom evaluation prompt aligned with the same helpfulness rubric used in the human evaluation. The prompt asked the evaluator to rate each response on a 1–5 Likert scale based on clarity, task completion, instruction adherence, and factual accuracy. We prepared both English and Indonesian versions to support multilingual testing. The following table compares the English and Indonesian prompts.

English prompt Indonesian prompt
">You are given a user task and a candidate completion from an AI assistant.
Your job is to evaluate how helpful the completion is — with special attention to whether it follows the user’s instructions and produces the correct or appropriate output.


A helpful response should:


- Accurately solve the task (math, formatting, generation, extraction, etc.)
- Follow all explicit and implicit instructions
- Use appropriate tone, clarity, and structure
- Avoid hallucination, false claims, or harmful implications


Even if the response is well-written or polite, it should be rated low if it:
- Produces incorrect results or misleading explanations
- Fails to follow core instructions
- Makes basic reasoning mistakes


Scoring Guide (1–5 scale):
5 – Very Helpful
The response is correct, complete, follows instructions fully, and could be used directly by the end user with confidence.


4 – Somewhat Helpful
Minor errors, omissions, or ambiguities, but still mostly correct and usable with small modifications or human verification.


3 – Neutral / Mixed
Either (a) the response is generally correct but doesn’t really follow the user’s instruction, or (b) it follows instructions but contains significant flaws that reduce trust.


2 – Somewhat Unhelpful
The response is incorrect or irrelevant in key areas, or fails to follow instructions, but shows some effort or structure.


1 – Very Unhelpful
The response is factually wrong, ignores the task, or shows fundamental misunderstanding or no effort.


Instructions:
You will be shown:
- The user’s task
- The AI assistant’s completion


Evaluate the completion on the scale above, considering both accuracy and instruction-following as primary criteria.


Task:
{{prompt}}


Candidate Completion:
{{prediction}}

Anda diberikan instruksi dari pengguna beserta jawaban/penyelesaian instruksi tersebut oleh asisten AI.
Tugas Anda adalah mengevaluasi seberapa membantu jawaban tersebut — dengan fokus utama pada apakah jawaban tersebut mengikuti instruksi pengguna dengan benar dan menghasilkan output yang akurat serta sesuai.


Sebuah jawaban dianggap membantu jika:
- Menyelesaikan instruksi dengan akurat (perhitungan matematika, pemformatan, pembuatan konten, ekstraksi data, dll.)
- Mengikuti semua instruksi eksplisit maupun implisit dari pengguna
- Menggunakan nada, kejelasan, dan struktur yang sesuai
- Menghindari halusinasi, klaim yang salah, atau implikasi yang berbahaya


Meskipun jawaban terdengar baik atau sopan, tetap harus diberi nilai rendah jika:
- Memberikan hasil yang salah atau penjelasan yang menyesatkan
- Gagal mengikuti inti dari instruksi pengguna
- Membuat kesalahan penalaran yang mendasar


Panduan Penilaian (Skala 1–5):
5 – Sangat Membantu
Jawaban benar, lengkap, mengikuti instruksi pengguna sepenuhnya, dan dapat langsung digunakan oleh pengguna dengan percaya diri.


4 – Cukup Membantu
Ada sedikit kesalahan, kekurangan, atau ambiguitas, tetapi jawaban secara umum benar dan masih dapat digunakan dengan sedikit perbaikan atau verifikasi manual.


3 – Netral
Baik (a) jawabannya secara umum benar tetapi tidak sepenuhnya mengikuti instruksi pengguna, atau (b) jawabannya mengikuti instruksi tetapi mengandung kesalahan besar yang mengurangi tingkat kepercayaan.


2 – Kurang Membantu
Jawaban salah atau tidak relevan pada bagian-bagian penting, atau tidak mengikuti instruksi pengguna, tetapi masih menunjukkan upaya atau struktur penyelesaian.


1 – Sangat Tidak Membantu
Jawaban salah secara fakta, mengabaikan instruksi pengguna, menunjukkan kesalahpahaman mendasar, atau tidak menunjukkan adanya upaya untuk menyelesaikan instruksi.


Petunjuk penilaian:
Anda akan diberikan:
- Instruksi dari pengguna
- Jawaban dari asisten AI


Evaluasilah jawaban tersebut menggunakan skala di atas, dengan mempertimbangkan akurasi dan kepatuhan terhadap instruksi pengguna sebagai kriteria utama.


Instruksi pengguna:
{{prompt}}


Jawaban asisten AI:
{{prediction}}

To measure alignment, we used two standard metrics:

  • Pearson correlation – Measures the linear relationship between score values. Useful for detecting overall similarity in score trends.
  • Cohen’s kappa (linear weighted) – Captures agreement between evaluators, adjusted for chance. Especially useful for discrete scales like Likert scores.

Alignment between LLM judges and human evaluations

We began by comparing the average helpfulness scores given by each evaluator using the English judge prompt. The following chart shows the evaluation results.

Comparative analysis of helpfulness scores between Human evaluator and Models (Strong-A/B, Weak-A/B), showing ratings between 4.11-4.93

When evaluating responses from the stronger model, LLM judges tended to agree with human ratings. But on responses from the weaker model, most LLMs gave noticeably higher scores than humans. This suggests that LLM judges tend to be more generous when response quality is lower.

We designed the evaluation prompt to guide models toward scoring behavior similar to human annotators, but score patterns still showed signs of potential bias. Model Strong-A rated its own outputs highly (4.93), whereas Model Weak-A gave its own responses a higher score than humans did. In contrast, Model Strong-B, which didn’t evaluate its own outputs, gave scores that were closer to human ratings.

To better understand alignment between LLM judges and human preferences, we analyzed Pearson and Cohen’s kappa correlations between them. On responses from Model Weak-A, alignment was strong. Model Strong-A and Model Strong-B achieved Pearson correlations of 0.45 and 0.61, with kappa scores of 0.33 and 0.4.

LLM judges and human alignment on responses from Model Strong-A was more moderate. All evaluators had Pearson correlations between 0.26–0.33 and weighted Kappa scores between 0.2–0.22. This might be due to limited variation in either human or model scores, which reduces the ability to detect strong correlation patterns.

To complete our analysis, we also conducted a qualitative deep dive. Amazon Bedrock makes this straightforward by providing JSONL outputs from each LLM-as-a-judge run that include both the evaluation score and the model’s reasoning. This helped us review evaluator justifications and identify cases where scores were incorrectly extracted or parsed.

From this review, we identified several factors behind the misalignment between LLM and human judgments:

  • Evaluator capability ceiling – In some cases, especially in reasoning tasks, the LLM evaluator couldn’t solve the original task itself. This made its evaluations flawed and unreliable at identifying whether a response was correct.
  • Evaluation hallucination – In other cases, the LLM evaluator assigned low scores to correct answers not because of reasoning failure, but because it imagined errors or flawed logic in responses that were actually valid.
  • Overriding instructions – Certain models occasionally overrode explicit instructions based on ethical judgment. For example, two evaluator models rated a response that created misleading political campaign content as very unhelpful (even though the response included its own warnings), whereas human evaluators rated it very helpful for following the task.

These problems highlight the importance of using human evaluations as a baseline and performing qualitative deep dives to fully understand LLM-as-a-judge results.

Cross-lingual evaluation capabilities

After analyzing evaluation results from the English judge prompt, we moved to the final step of our analysis: comparing evaluation results between English and Indonesian judge prompts.

We began by comparing overall helpfulness scores and alignment with human ratings. Helpfulness scores remained nearly identical for all models, with most shifts within ±0.05. Alignment with human ratings was also similar: Pearson correlations between human scores and LLM-as-a-judge using Indonesian judge prompts closely matched those using English judge prompts. In statistically meaningful cases, correlation score differences were typically within ±0.1.

To further assess cross-language consistency, we computed Pearson correlation and Cohen’s kappa directly between LLM-as-a-judge evaluation scores generated using English and Indonesian judge prompts on the same response set. The following tables show correlation between scores from Indonesian and English judge prompts for each evaluator LLM, on responses generated by Model Weak-A and Model Strong-A.

The first table summarizes the evaluation of Model Weak-A responses.

Metric Model Strong-A Model Strong-B Model Weak-A Model Weak-B
Pearson correlation 0.73 0.79 0.64 0.64
Cohen’s Kappa 0.59 0.69 0.42 0.49

The next table summarizes the evaluation of Model Strong-A responses.

Metric Model Strong-A Model Strong-B Model Weak-A Model Weak-B
Pearson correlation 0.41 0.8 0.51 0.7
Cohen’s Kappa 0.36 0.65 0.43 0.61

Correlation between evaluation results from both judge prompt languages was strong across all evaluator models. On average, Pearson correlation was 0.65 and Cohen’s kappa was 0.53 across all models.

We also conducted a qualitative review comparing evaluations from both evaluation prompt languages for Model Strong-A and Model Strong-B. Overall, both models showed consistent reasoning across languages in most cases. However, occasional hallucinated errors or flawed logic occurred at similar rates across both languages (we should note that humans make occasional mistakes as well).

One interesting pattern we observed with one of the stronger evaluator models was that it tended to follow the evaluation prompt more strictly in the Indonesian version. For example, it rated a response as unhelpful when it refused to generate misleading political content, even though the task explicitly asked for it. This behavior differed from the English prompt evaluation. In a few cases, it also assigned a noticeably stricter score compared to the English evaluator prompt even though the reasoning across both languages was similar, better matching how humans typically evaluate.

These results confirm that although prompt translation remains a useful option, it is not required to achieve consistent evaluation. You can rely on English evaluator prompts even for non-English outputs, for example by using Amazon Bedrock LLM-as-a-judge predefined and custom metrics to make multilingual evaluation simpler and more scalable.

Takeaways

The following are key takeaways for building a robust LLM evaluation framework:

  • LLM-as-a-judge is a practical evaluation method – It offers faster, cheaper, and scalable assessments while maintaining reasonable judgment quality across languages. This makes it suitable for large-scale deployments.
  • Choose a judge model based on practical evaluation needs – Across our experiments, stronger models aligned better with human ratings, especially on weaker outputs. However, even top models can misjudge harder tasks or show self-bias. Use capable, neutral evaluators to facilitate fair comparisons.
  • Manual human evaluations remain essential – Human evaluations provide the reference baseline for benchmarking automated scoring and understanding model judgment behavior.
  • Prompt design meaningfully shapes evaluator behavior – Aligning your evaluation prompt with how humans actually score improves quality and trust in LLM-based evaluations.
  • Translated evaluation prompts are helpful but not required – English evaluator prompts reliably judge non-English responses, especially for evaluator models that support multilingual input.
  • Always be ready to deep dive with qualitative analysis – Reviewing evaluation disagreements by hand helps uncover hidden model behaviors and makes sure that statistical metrics tell the full story.
  • Simplify your evaluation workflow using Amazon Bedrock evaluation features – Amazon Bedrock built-in human evaluation and LLM-as-a-judge evaluation capabilities simplify iteration and streamline your evaluation workflow.

Conclusion

Through our experiments, we demonstrated that LLM-as-a-judge evaluations can deliver consistent and reliable results across languages, even without prompt translation. With properly designed evaluation prompts, LLMs can maintain high alignment with human ratings regardless of evaluator prompt language. Though we focused on Indonesian, the results indicate similar techniques are likely effective for other non-English languages, but you are encouraged to assess for yourself on any language you choose. This reduces the need to create localized evaluation prompts for every target audience.

To level up your evaluation practices, consider the following ways to extend your approach beyond foundation model scoring:

  • Evaluate your Retrieval Augmented Generation (RAG) pipeline, assessing not just LLM responses but also retrieval quality using Amazon Bedrock RAG evaluation capabilities
  • Evaluate and monitor continuously, and run evaluations before production launch, during live operation, and ahead of any major system upgrades

Begin your cross-lingual evaluation journey today with Amazon Bedrock Evaluations and scale your AI solutions confidently across global landscapes.


About the authors

Riza Saputra is a Senior Solutions Architect at AWS, working with startups of all stages to help them grow securely, scale efficiently, and innovate faster. His current focus is on generative AI, guiding organizations in building and scaling AI solutions securely and efficiently. With experience across roles, industries, and company sizes, he brings a versatile perspective to solving technical and business challenges. Riza also shares his knowledge through public speaking and content to support the broader tech community.

Read More

Cohere Embed 4 multimodal embeddings model is now available on Amazon SageMaker JumpStart

Cohere Embed 4 multimodal embeddings model is now available on Amazon SageMaker JumpStart

This post is co-written with Payal Singh from Cohere.

The Cohere Embed 4 multimodal embeddings model is now generally available on Amazon SageMaker JumpStart. The Embed 4 model is built for multimodal business documents, has leading multilingual capabilities, and offers notable improvement over Embed 3 across key benchmarks.

In this post, we discuss the benefits and capabilities of this new model. We also walk you through how to deploy and use the Embed 4 model using SageMaker JumpStart.

Cohere Embed 4 overview

Embed 4 is the most recent addition to the Cohere Embed family of enterprise-focused large language models (LLMs). It delivers state-of-the-art multimodality. This is useful because businesses continue to store the majority of important data in an unstructured format. Document formats include intricate PDF reports, presentation slides, as well as text-based documents or design files that might include images, tables, graphs, code, and diagrams. Without the ability to natively understand complex multimodal documents, these types of documents become repositories of unsearchable information. With Embed 4, enterprises and their employees can search across text, image, and multimodal documents. Embed 4 also offers leading multilingual capabilities, understanding over 100 languages, including Arabic, French, Japanese, and Korean. This capability is useful to global enterprises that handle documents in multiple languages. Employees can also find critical data even if the information isn’t stored using a language they speak. Overall, Embed 4 empowers global enterprises to break down language barriers and manage information in the languages most familiar to their customers.

In the following diagram (source), each language category represents a blend of public and proprietary benchmarks (see more details). Tasks ranged from monolingual to cross-lingual (English as the query language and the respective monolingual non-English language as the corpus). Dataset performance metrics are measured by NDCG@10.

Embeddings models are already being used to handle documents with both text and images. However, optimal performance usually requires additional complexity because a multimodal generative model must preprocess documents into a format that is suitable for the embeddings model. Embed 4 can transform different modalities such as images, texts, and interleaved images and texts into a single vector representation. Processing a single payload of images and text decreases the operational burden associated with handling documents.

Embed 4 can also generate embeddings for documents up to 128,000 tokens, approximately 200 pages in length. This extended capacity alleviates the need for custom logic to split lengthy documents, making it straightforward to process financial reports, product manuals, and detailed legal contracts. In contrast, models with shorter context lengths force developers to create complex workflows to split documents while preserving their logical structure. With Embed 4, as long as a document fits within the 128,000-token limit, it can be converted into a high-quality, unified vector representation.

Embed 4 also has enhancements for security-minded industries such as finance, healthcare, and manufacturing. Businesses in these regulated industries need models that have both strong general business knowledge as well as domain-specific understanding. Business data also tends to be imperfect. Documents often come with spelling mistakes and formatting issues such as improper portrait and landscape orientation. Embed 4 was trained to be robust against noisy real-world data that also includes scanned documents and handwriting. This further alleviates the need for complex and expensive data preprocessing pipelines.

Use cases

In this section, we discuss several possible use cases for Embed 4. Embed 4 unlocks a range of capabilities for enterprises seeking to streamline information discovery, enhance generative AI workflows, and optimize storage efficiency. Below, we highlight several keys use cases that demonstrate the versatility of Embed 4 in a range of regulated industries.

Simplifying multimodal search

You can take advantage of Embed 4 multimodal capabilities in use cases that require point semantic search. For example, in the retail industry, it might be helpful to search with both text and image. An example search phrase can even include a modifier (for example, “Same style pants but with no stripes”). The same logic can be applied to an analyst’s workflow where users might need to find the right charts and diagrams to explain trends. This is traditionally a time-consuming process that requires manual effort to sift through documents and contextualize information. Because Embed 4 has enhancements for finance, healthcare, and manufacturing, users in these industries can take advantage of built-in domain-specific understanding as well as strong general knowledge to accelerate the time to value. With Embed 4, it’s straightforward to conduct research and turn data into actionable insights.

Powering Retrieval Augmented Generation workflows

Another use case is Retrieval Augmented Generation (RAG) applications that require access to internal information. With the 128,000 context length of Embed 4, businesses can use existing long-form documents that include images without the need to implement complex preprocessing pipelines. An example might be a generative AI application built to assist analysts with M&A due diligence. Having access to a broader repository of information increases the likelihood of making informed decisions.

Optimizing agentic AI workflows with compressed embeddings

Businesses can use intelligent AI agents to reduce unnecessary costs, automate repetitive tasks, and reduce human errors. AI agents need relevant and contextual information to perform tasks accurately. This is done through RAG. The generative model as part of RAG that powers the conversational experience relies on a search engine that is connected to company data sources to retrieve relevant information before providing the final response. For example, an agent might need to extract relevant conversation logs to analyze customer sentiment about a specific product and deduce the most effective next step in a customer interaction.

Embed 4 is the optimal search engine for enterprise AI assistants and agents, which improves the accuracy of responses and mitigates against hallucinations.

At scale, storing embeddings can lead to high storage costs. Embed 4 is designed to output compressed embeddings, where users can choose their own dimension size (example: 256, 512, 1024, and 1536). This helps organizations to save up to 83% on storage costs while maintaining search accuracy.

The following diagram illustrates retrieval quality vs. storage cost across different models (source). Compression can occur on the format precision of the vectors (binary, int8, and fp32) and the dimension of the vectors. Dataset performance metrics are measured by NDCG@10.

Using domain-specific understanding for regulated industries

With Embed 4 enhancements, you can surface relevant insights from complex financial documents like investor presentations, annual reports, and M&A due diligence files. Embed 4 can also extract key information from healthcare documents such as medical records, procedural charts, and clinical trial reports. For manufacturing use cases, Embed 4 can handle product specifications, repair guides, and supply chain plans. These capabilities unlock a broader range of use cases because enterprises can use models out of the box without costly fine-tuning efforts.

Solution overview

SageMaker JumpStart onboards and maintains foundation models (FMs) for you to access and integrate into machine learning (ML) lifecycles. The FMs available in SageMaker JumpStart include publicly available FMs as well as proprietary FMs from third-party providers.

Amazon SageMaker AI is a fully managed ML service. It helps data scientists and developers quickly and confidently build, train, and deploy ML models into a production-ready hosted environment. Amazon SageMaker Studio is a single web-based experience for running ML workflows. It provides access to your SageMaker AI resources in one interface.

In the following sections, we show how to get started with Cohere Embed 4.

Prerequisites

Make sure you meet the following prerequisites:

  • Make sure your SageMaker AWS Identity and Access Management (IAM) role has the AmazonSageMakerFullAccess permission policy attached.
  • To deploy Cohere Embed 4 successfully, confirm that your IAM role has the following permissions and you have the authority to make AWS Marketplace subscriptions in the AWS account used:
    • aws-marketplace:ViewSubscriptions
    • aws-marketplace:Unsubscribe
    • aws-marketplace:Subscribe
  • Alternatively, confirm your AWS account has a subscription to the model. If so, skip to the next section in this post.

Subscribe to the model package

To subscribe to the model package, complete the following steps:

  1. In the AWS Marketplace listing, choose Continue to Subscribe.

  1. On the Subscribe to this software page, review and choose Accept Offer if you and your organization agree with EULA, pricing, and support terms.
  2. Choose Continue to configuration and then choose an AWS Region.

You will see a product Amazon Resource Name (ARN) displayed. This is the model package ARN that you must specify while creating a deployable model using Boto3.

Deploy Cohere Embed 4 for inference through SageMaker JumpStart

If you want to start using Embed 4 immediately, you can choose from one of three available launch methods. Either use the AWS CloudFormation template, the SageMaker console, or the AWS Command Line Interface (AWS CLI). You will incur costs for software use based on hourly pricing as long as your endpoint is running. You will also incur costs for infrastructure use independent and in addition to the costs of software.

Choose the appropriate model package ARN for your Region. For example, the ARN for Cohere Embed 4 is:

 arn:aws:sagemaker:[REGION]:[ACCOUNT_ID]:model-package/cohere-embed-v4-1-04072025-17ec0571acd93686b6cfb44babe01d66

Alternatively, in SageMaker Studio, open JumpStart. Search for Cohere Embed 4. If you don’t yet have a domain, refer to Guide to getting set up with Amazon SageMaker AI to create a domain. Search for the Cohere Embed 4 model. Deployment starts when you choose Deploy.

When deployment is complete, an endpoint is created. You can test the endpoint by passing a sample inference request payload or by selecting the testing option using the SDK.

To use the Python SDK example code, choose Test inference and Open in JupyterLab. If you don’t have a JupyterLab space yet, refer to Create a space.

Clean up

After you finish running the notebook and experimenting with the Embed 4 model, it’s crucial to clean up the resources you have provisioned. Failing to do so might result in unnecessary charges accruing on your account. To use the SageMaker AI console, complete the following steps:

  1. On the SageMaker AI console, under Inference in the navigation pane, choose Endpoints.
  2. Choose the endpoint you created.
  3. On the endpoint details page, choose Delete.
  4. Choose Delete again to confirm.

Conclusion

In this post, we explored how Cohere Embed 4, now available on SageMaker JumpStart, delivers state-of-the-art multimodal embedding capabilities. These capabilities make it particularly valuable for enterprises working with unstructured data across finance, healthcare, manufacturing, and other regulated industries.

Interested in diving deeper? Check out the Cohere on AWS GitHub repo.


About the authors

James Yi is a Senior AI/ML Partner Solutions Architect at AWS. He spearheads AWS’s strategic partnerships in Emerging Technologies, guiding engineering teams to design and develop cutting-edge joint solutions in generative AI. He enables field and technical teams to seamlessly deploy, operate, secure, and integrate partner solutions on AWS. James collaborates closely with business leaders to define and execute joint Go-To-Market strategies, driving cloud-based business growth. Outside of work, he enjoys playing soccer, traveling, and spending time with his family.

Karan Singh is a Generative AI Specialist at AWS, where he works with top-tier third-party foundation model and agentic frameworks providers to develop and execute joint go-to-market strategies, enabling customers to effectively deploy and scale solutions to solve enterprise generative AI challenges.

Mehran Najafi, PhD, serves as AWS Principal Solutions Architect and leads the Generative AI Solution Architects team for AWS Canada. His expertise lies in ensuring the scalability, optimization, and production deployment of multi-tenant generative AI solutions for enterprise customers.

John Liu has 15 years of experience as a product executive and 9 years of experience as a portfolio manager. At AWS, John is a Principal Product Manager for Amazon Bedrock. Previously, he was the Head of Product for AWS Web3 / Blockchain. Prior to AWS, John held various product leadership roles at public blockchain protocols, fintech companies and also spent 9 years as a portfolio manager at various hedge funds.

Hugo Tse is a Solutions Architect at AWS, with a focus on Generative AI and Storage solutions. He is dedicated to empowering customers to overcome challenges and unlock new business opportunities using technology. He holds a Bachelor of Arts in Economics from the University of Chicago and a Master of Science in Information Technology from Arizona State University.

Payal Singh is a Solutions Architect at Cohere with over 15 years of cross-domain expertise in DevOps, Cloud, Security, SDN, Data Center Architecture, and Virtualization. She drives partnerships at Cohere and helps customers with complex GenAI solution integrations.

Read More

How INRIX accelerates transportation planning with Amazon Bedrock

How INRIX accelerates transportation planning with Amazon Bedrock

This post is co-written with Shashank Saraogi, Nat Gale, and Durran Kelly from INRIX.

The complexity of modern traffic management extends far beyond mere road monitoring, encompassing massive amounts of data collected worldwide from connected cars, mobile devices, roadway sensors, and major event monitoring systems. For transportation authorities managing urban, suburban, and rural traffic flow, the challenge lies in effectively processing and acting upon this vast network of information. The task requires balancing immediate operational needs, such as real-time traffic redirection during incidents, with strategic long-term planning for improved mobility and safety.

Traditionally, analyzing these complex data patterns and producing actionable insights has been a resource-intensive process requiring extensive collaboration. With recent advances in generative AI, there is an opportunity to transform how we process, understand, and act upon transportation data, enabling more efficient and responsive traffic management systems.

In this post, we partnered with Amazon Web Services (AWS) customer INRIX to demonstrate how Amazon Bedrock can be used to determine the best countermeasures for specific city locations using rich transportation data and how such countermeasures can be automatically visualized in street view images. This approach allows for significant planning acceleration compared to traditional approaches using conceptual drawings.

INRIX pioneered the use of GPS data from connected vehicles for transportation intelligence. For over 20 years, INRIX has been a leader for probe-based connected vehicle and device data and insights, powering automotive, enterprise, and public sector use cases. INRIX’s products range from tickerized datasets that inform investment decisions for the financial services sector to digital twins for the public rights-of-way in the cities of Philadelphia and San Francisco. INRIX was the first company to develop a crowd-sourced traffic network, and they continue to lead in real-time mobility operations.

In June 2024, the State of California’s Department of Transportation (Caltrans) selected INRIX for a proof of concept for a generative AI-powered solution to improve safety for vulnerable road users (VRUs). The problem statement sought to harness the combination of Caltrans’ asset, crash, and points-of-interest (POI) data and INRIX’s 50 petabyte (PB) data lake to anticipate high-risk locations and quickly generate empirically validated safety measures to mitigate the potential for crashes. Trained on real-time and historical data and industry research and manuals, the solution provides a new systemic, safety-based methodology for risk assessment, location prioritization, and project implementation.

Solution overview

INRIX announced INRIX Compass in November 2023. INRIX Compass is an application that harnesses generative AI and INRIX’s 50 PB data lake to solve transportation challenges. This solution uses INRIX Compass countermeasures as the input, AWS serverless architecture, and Amazon Nova Canvas as the image visualizer. Key components include:

The following diagram shows the architecture of INRIX Compass.

INRIX Compass for countermeasures

By using INRIX Compass, users can ask natural language queries such as, Where are the top five locations with the highest risk for vulnerable road users? and Can you recommend a suite of proven safety countermeasures at each of these locations? Furthermore, users can probe deeper into the roadway characteristics that contribute to risk factors, and find similar locations in the roadway network that meet those conditions. Behind the scenes, Compass AI uses RAG and Amazon Bedrock powered foundation models (FMs) to query the roadway network to identify and prioritize locations with systemic risk factors and anomalous safety patterns. The solution provides prioritized recommendations for operational and design solutions and countermeasures based on industry knowledge.

The following image shows the interface of INRIX Compass.

Image visualization for countermeasures

The generation of countermeasure suggestions represents the initial phase in transportation planning. Image visualization requires the crucial next step of preparing conceptual drawings. This process has traditionally been time-consuming due to the involvement of multiple specialized teams, including:

  • Transportation engineers who assess technical feasibility and safety standards
  • Urban planners who verify alignment with city development goals
  • Landscape architects who integrate environmental and aesthetic elements
  • CAD or visualization specialists who create detailed technical drawings
  • Safety analysts who evaluate the potential impact on road safety
  • Public works departments who oversee implementation feasibility
  • Traffic operations teams who assess impact on traffic flow and management

These teams work collaboratively, creating and iteratively refining various visualizations based on feedback from urban designers and other stakeholders. Each iteration cycle typically involves multiple rounds of reviews, adjustments, and approvals, often extending the timeline significantly. The complexity is further amplified by city-specific rules and design requirements, which often necessitate significant customization. Additionally, local regulations, environmental considerations, and community feedback must be incorporated into the design process. Consequently, this lengthy and costly process frequently leads to delays in implementing safety countermeasures. To streamline this challenge, INRIX has pioneered an innovative approach to the visualization phase by using generative AI technology. This prototyped solution enables rapid iteration of conceptual drawings that can be efficiently reviewed by various teams, potentially reducing the design cycle from weeks to days. Moreover, the system incorporates a few-shot learning approach with reference images and carefully crafted prompts, allowing for seamless integration of city-specific requirements into the generated outputs. This approach not only accelerates the design process but also supports consistency across different projects while maintaining compliance with local standards.

The following image shows the congestion insights by INRIX Compass.

Amazon Nova Canvas for conceptual visualizations

INRIX developed and prototyped this solution using Amazon Nova models. Amazon Nova Canvas delivers advanced image processing through text-to-image generation and image-to-image transformation capabilities. The model provides sophisticated controls for adjusting color schemes and manipulating layouts to achieve desired visual outcomes. To promote responsible AI implementation, Amazon Nova Canvas incorporates built-in safety measures, including watermarking and content moderation systems.

The model supports a comprehensive range of image editing operations. These operations encompass basic image generation, object removal from existing images, object replacement within scenes, creation of image variations, and modification of image backgrounds. This versatility makes Amazon Nova Canvas suitable for a wide range of professional applications requiring sophisticated image editing.

The following sample images show an example of countermeasures visualization.

In-painting implementation in Compass AI

Amazon Nova Canvas integrates with INRIX Compass’s existing natural language analytics capabilities. The original Compass system generated text-based countermeasure recommendations based on:

  • Historical transportation data analysis
  • Current environmental conditions
  • User-specified requirements

The INRIX Compass visualization feature specifically uses the image generation and in-painting capabilities of Amazon Nova Canvas. In-painting enables object replacement through two distinct approaches:

  • A binary mask precisely defines the areas targeted for replacement.
  • Text prompts identify objects for replacement, allowing the model to interpret and modify the specified elements while maintaining visual coherence with the surrounding image context. This functionality provides seamless integration of new elements while preserving the overall image composition and contextual relevance. The developed interface accommodates both image generation and in-painting approaches, providing comprehensive image editing capabilities.

The implementation follows a two-stage process for visualizing transportation countermeasures. Initially, the system employs image generation functionality to create street-view representations corresponding to specific longitude and latitude coordinates where interventions are proposed. Following the initial image creation, the in-painting capability enables precise placement of countermeasures within the generated street view scene. This sequential approach provides accurate visualization of proposed modifications within the actual geographical context.

An Amazon Bedrock API facilitates image editing and generation through the Amazon Nova Canvas model. The responses contain the generated or modified images in base64 format, which can be decoded and processed for further use in the application. The generative AI capabilities of Amazon Bedrock enable rapid iteration and simultaneous visualization of multiple countermeasures within a single image. RAG implementation can further extend the pipeline’s capabilities by incorporating county-specific regulations, standardized design patterns, and contextual requirements. The integration of these technologies significantly streamlines the countermeasure deployment workflow. Traditional manual visualization processes that previously required extensive time and resources can now be executed efficiently through automated generation and modification. This automation delivers substantial improvements in both time-to-deployment and cost-effectiveness.

Conclusion

The partnership between INRIX and AWS showcases the transformative potential of AI in solving complex transportation challenges. By using Amazon Bedrock FMs, INRIX has turned their massive 50 PB data lake into actionable insights through effective visualization solutions. This post highlighted a single specific transportation use case, but Amazon Bedrock and Amazon Nova power a wide spectrum of applications, from text generation to video creation. The combination of extensive data and advanced AI capabilities continues to pave the way for smarter, more efficient transportation systems worldwide.

For more information, check out the documentation for Amazon Nova Foundation Models, Amazon Bedrock, and INRIX Compass.


About the authors

Arun is a Senior Solutions Architect at AWS, supporting enterprise customers in the Pacific Northwest. He’s passionate about solving business and technology challenges as an AWS customer advocate, with his recent interest being AI strategy. When not at work, Arun enjoys listening to podcasts, going for short trail runs, and spending quality time with his family.

Alicja Kwasniewska, PhD, is an AI leader driving generative AI innovations in enterprise solutions and decision intelligence for customer engagements in North America, advertisement and marketing verticals at AWS. She is recognized among the top 10 women in AI and 100 women in data science. Alicja published in more than 40 peer-reviewed publications. She also serves as a reviewer for top-tier conferences, including ICML,NeurIPS,and ICCV. She advises organizations on AI adoption, bridging research and industry to accelerate real-world AI applications.

Shashank is the VP of Engineering at INRIX, where he leads multiple verticals, including generative AI and traffic. He is passionate about using technology to make roads safer for drivers, bikers, and pedestrians every day. Prior to working at INRIX, he held engineering leadership roles at Amazon and Lyft. Shashank brings deep experience in building impactful products and high-performing teams at scale. Outside of work, he enjoys traveling, listening to music, and spending time with his family.

Nat Gale is the Head of Product at INRIX, where he manages the Safety and Traffic product verticals. Nat leads the development of data products and software that help transportation professionals make smart, more informed decisions. He previously ran the City of Los Angeles’ Vision Zero program and was the Director of Capital Projects and Operations for the City of Hartford, CT.

Durran is a Lead Software Engineer at INRIX, where he designs scalable backend systems and mentors engineers across multiple product lines. With over a decade of experience in software development, he specializes in distributed systems, generative AI, and cloud infrastructure. Durran is passionate about writing clean, maintainable code and sharing best practices with the developer community. Outside of work, he enjoys spending quality time with his family and deepening his Japanese language skills.

Read More

Qwen3 family of reasoning models now available in Amazon Bedrock Marketplace and Amazon SageMaker JumpStart

Qwen3 family of reasoning models now available in Amazon Bedrock Marketplace and Amazon SageMaker JumpStart

Today, we are excited to announce that Qwen3, the latest generation of large language models (LLMs) in the Qwen family, is available through Amazon Bedrock Marketplace and Amazon SageMaker JumpStart. With this launch, you can deploy the Qwen3 models—available in 0.6B, 4B, 8B, and 32B parameter sizes—to build, experiment, and responsibly scale your generative AI applications on AWS.

In this post, we demonstrate how to get started with Qwen3 on Amazon Bedrock Marketplace and SageMaker JumpStart. You can follow similar steps to deploy the distilled versions of the models as well.

Solution overview

Qwen3 is the latest generation of LLMs in the Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Qwen3 delivers groundbreaking advancements in reasoning, instruction-following, agent capabilities, and multilingual support, with the following key features:

  • Unique support of seamless switching between thinking mode and non-thinking mode within a single model, providing optimal performance across various scenarios.
  • Significantly enhanced in its reasoning capabilities, surpassing previous QwQ (in thinking mode) and Qwen2.5 instruct models (in non-thinking mode) on mathematics, code generation, and commonsense logical reasoning.
  • Good human preference alignment, excelling in creative writing, role-playing, multi-turn dialogues, and instruction following, to deliver a more natural, engaging, and immersive conversational experience.
  • Expertise in agent capabilities, enabling precise integration with external tools in both thinking and unthinking modes and achieving leading performance among open source models in complex agent-based tasks.
  • Support for over 100 languages and dialects with strong capabilities for multilingual instruction following and translation.

Prerequisites

To deploy Qwen3 models, make sure you have access to the recommended instance types based on the model size. You can find these instance recommendations on Amazon Bedrock Marketplace or the SageMaker JumpStart console. To verify you have the necessary resources, complete the following steps:

  1. Open the Service Quotas console.
  2. Under AWS Services, select Amazon SageMaker.
  3. Check that you have sufficient quota for the required instance type for endpoint deployment.
  4. Make sure at least one of these instance types is available in your target AWS Region.

If needed, request a quota increase and contact your AWS account team for support.

Deploy Qwen3 in Amazon Bedrock Marketplace

Amazon Bedrock Marketplace gives you access to over 100 popular, emerging, and specialized foundation models (FMs) through Amazon Bedrock. To access Qwen3 in Amazon Bedrock, complete the following steps:

  1. On the Amazon Bedrock console, in the navigation pane under Foundation models, choose Model catalog.
  2. Filter for Hugging Face as a provider and choose a Qwen3 model. For this example, we use the Qwen3-32B model.

The Amazon Bedrock model catalog displays Qwen3 text generation models. The interface includes a left navigation panel with filters for model collection, providers, and modality, while the main content area shows model cards with deployment information.

The model detail page provides essential information about the model’s capabilities, pricing structure, and implementation guidelines. You can find detailed usage instructions, including sample API calls and code snippets for integration.

The page also includes deployment options and licensing information to help you get started with Qwen3-32B in your applications.

  1. To begin using Qwen3-32B, choose Deploy.

The details page displays comprehensive information about the Qwen3 32B model, including its version, delivery method, release date, model ID, and deployment status. The interface includes deployment options and playground access.

You will be prompted to configure the deployment details for Qwen3-32B. The model ID will be pre-populated.

  1. For Endpoint name, enter an endpoint name (between 1–50 alphanumeric characters).
  2. For Number of instances, enter a number of instances (between 1–100).
  3. For Instance type, choose your instance type. For optimal performance with Qwen3-32B, a GPU-based instance type like ml.g5-12xlarge is recommended.
  4. To deploy the model, choose Deploy.

he deployment configuration page displays essential settings for hosting a Bedrock model endpoint in SageMaker. It includes fields for Model ID, Endpoint name, Number of instances, and Instance type selection.

When the deployment is complete, you can test Qwen3-32B’s capabilities directly in the Amazon Bedrock playground.

  1. Choose Open in playground to access an interactive interface where you can experiment with different prompts and adjust model parameters like temperature and maximum length.

This is an excellent way to explore the model’s reasoning and text generation abilities before integrating it into your applications. The playground provides immediate feedback, helping you understand how the model responds to various inputs and letting you fine-tune your prompts for optimal results.You can quickly test the model in the playground through the UI. However, to invoke the deployed model programmatically with any Amazon Bedrock APIs, you must have the endpoint Amazon Resource Name (ARN).

Enable reasoning and non-reasoning responses with Converse API

The following code shows how to turn reasoning on and off with Qwen3 models using the Converse API, depending on your use case. By default, reasoning is left on for Qwen3 models, but you can streamline interactions by using the /no_think command within your prompt. When you add this to the end of your query, reasoning is turned off and the models will provide just the direct answer. This is particularly useful when you need quick information without explanations, are familiar with the topic, or want to maintain a faster conversational flow. At the time of writing, the Converse API doesn’t support tool use for Qwen3 models. Refer to the Invoke_Model API example later in this post to learn how to use reasoning and tools in the same completion.

import boto3
from botocore.exceptions import ClientError

# Create a Bedrock Runtime client in the AWS Region you want to use.
client = boto3.client("bedrock-runtime", region_name="us-west-2")

# Configuration
model_id = ""  # Replace with Bedrock Marketplace endpoint arn

# Start a conversation with the user message.
user_message = "hello, what is 1+1 /no_think" #remove /no_think to leave default reasoning on
conversation = [
    {
        "role": "user",
        "content": [{"text": user_message}],
    }
]

try:
    # Send the message to the model, using a basic inference configuration.
    response = client.converse(
        modelId=model_id,
        messages=conversation,
        inferenceConfig={"maxTokens": 512, "temperature": 0.5, "topP": 0.9},
    )

    # Extract and print the response text.
    #response_text = response["output"]["message"]["content"][0]["text"]
    #reasoning_content = response ["output"]["message"]["reasoning_content"][0]["text"]
    #print(response_text, reasoning_content)
    print(response)
    
except (ClientError, Exception) as e:
    print(f"ERROR: Can't invoke '{model_id}'. Reason: {e}")
    exit(1)

The following is a response using the Converse API, without default thinking:

{'ResponseMetadata': {'RequestId': 'f7f3953a-5747-4866-9075-fd4bd1cf49c4', 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Tue, 17 Jun 2025 18:34:47 GMT', 'content-type': 'application/json', 'content-length': '282', 'connection': 'keep-alive', 'x-amzn-requestid': 'f7f3953a-5747-4866-9075-fd4bd1cf49c4'}, 'RetryAttempts': 0}, 'output': {'message': {'role': 'assistant', 'content': [{'text': 'nnHello! The result of 1 + 1 is **2**. 😊'}, {'reasoningContent': {'reasoningText': {'text': 'nn'}}}]}}, 'stopReason': 'end_turn', 'usage': {'inputTokens': 20, 'outputTokens': 22, 'totalTokens': 42}, 'metrics': {'latencyMs': 1125}}

The following is an example with default thinking on; the <think> tokens are automatically parsed into the reasoningContent field for the Converse API:

{'ResponseMetadata': {'RequestId': 'b6d2ebbe-89da-4edc-9a3a-7cb3e7ecf066', 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Tue, 17 Jun 2025 18:32:28 GMT', 'content-type': 'application/json', 'content-length': '1019', 'connection': 'keep-alive', 'x-amzn-requestid': 'b6d2ebbe-89da-4edc-9a3a-7cb3e7ecf066'}, 'RetryAttempts': 0}, 'output': {'message': {'role': 'assistant', 'content': [{'text': 'nnHello! The sum of 1 + 1 is **2**. Let me know if you have any other questions or need further clarification! 😊'}, {'reasoningContent': {'reasoningText': {'text': 'nOkay, the user asked "hello, what is 1+1". Let me start by acknowledging their greeting. They might just be testing the water or actually need help with a basic math problem. Since it's 1+1, it's a very simple question, but I should make sure to answer clearly. Maybe they're a child learning math for the first time, or someone who's not confident in their math skills. I should provide the answer in a friendly and encouraging way. Let me confirm that 1+1 equals 2, and maybe add a brief explanation to reinforce their understanding. I can also offer further assistance in case they have more questions. Keeping it conversational and approachable is key here.n'}}}]}}, 'stopReason': 'end_turn', 'usage': {'inputTokens': 16, 'outputTokens': 182, 'totalTokens': 198}, 'metrics': {'latencyMs': 7805}}

Perform reasoning and function calls in the same completion using the Invoke_Model API

With Qwen3, you can stream an explicit trace and the exact JSON tool call in the same completion. Up until now, reasoning models have forced the choice to either show the chain of thought or call tools deterministically. The following code shows an example:

messages = json.dumps( {
    "messages": [
        {
            "role": "user",
            "content": "Hi! How are you doing today?"
        }, 
        {
            "role": "assistant",
            "content": "I'm doing well! How can I help you?"
        }, 
        {
            "role": "user",
            "content": "Can you tell me what the temperate will be in Dallas, in fahrenheit?"
        }
    ],
    "tools": [{
        "type": "function",
        "function": {
            "name": "get_current_weather",
            "description": "Get the current weather in a given location",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {
                        "type":
                            "string",
                        "description":
                            "The city to find the weather for, e.g. 'San Francisco'"
                    },
                    "state": {
                        "type":
                            "string",
                        "description":
                            "the two-letter abbreviation for the state that the city is in, e.g. 'CA' which would mean 'California'"
                    },
                    "unit": {
                        "type": "string",
                        "description":
                            "The unit to fetch the temperature in",
                        "enum": ["celsius", "fahrenheit"]
                    }
                },
                "required": ["city", "state", "unit"]
            }
        }
    }],
    "tool_choice": "auto"
})

response = client.invoke_model(
    modelId=model_id, 
    body=body
)
print(response)
model_output = json.loads(response['body'].read())
print(json.dumps(model_output, indent=2))

Response:

{'ResponseMetadata': {'RequestId': '5da8365d-f4bf-411d-a783-d85eb3966542', 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Tue, 17 Jun 2025 18:57:38 GMT', 'content-type': 'application/json', 'content-length': '1148', 'connection': 'keep-alive', 'x-amzn-requestid': '5da8365d-f4bf-411d-a783-d85eb3966542', 'x-amzn-bedrock-invocation-latency': '6396', 'x-amzn-bedrock-output-token-count': '148', 'x-amzn-bedrock-input-token-count': '198'}, 'RetryAttempts': 0}, 'contentType': 'application/json', 'body': <botocore.response.StreamingBody object at 0x7f7d4a598dc0>}
{
  "id": "chatcmpl-bc60b482436542978d233b13dc347634",
  "object": "chat.completion",
  "created": 1750186651,
  "model": "lmi",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "reasoning_content": "nOkay, the user is asking about the weather in San Francisco. Let me check the tools available. There's a get_weather function that requires location and unit. The user didn't specify the unit, so I should ask them if they want Celsius or Fahrenheit. Alternatively, maybe I can assume a default, but since the function requires it, I need to include it. I'll have to prompt the user for the unit they prefer.n",
        "content": "nnThe user hasn't specified whether they want the temperature in Celsius or Fahrenheit. I need to ask them to clarify which unit they prefer.nn",
        "tool_calls": [
          {
            "id": "chatcmpl-tool-fb2f93f691ed4d8ba94cadc52b57414e",
            "type": "function",
            "function": {
              "name": "get_weather",
              "arguments": "{"location": "San Francisco, CA", "unit": "celsius"}"
            }
          }
        ]
      },
      "logprobs": null,
      "finish_reason": "tool_calls",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 198,
    "total_tokens": 346,
    "completion_tokens": 148,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null
}

Deploy Qwen3-32B with SageMaker JumpStart

SageMaker JumpStart is a machine learning (ML) hub with FMs, built-in algorithms, and prebuilt ML solutions that you can deploy with just a few clicks. With SageMaker JumpStart, you can customize pre-trained models to your use case, with your data, and deploy them into production using either the UI or SDK.Deploying the Qwen3-32B model through SageMaker JumpStart offers two convenient approaches: using the intuitive SageMaker JumpStart UI or implementing programmatically through the SageMaker Python SDK. Let’s explore both methods to help you choose the approach that best suits your needs.

Deploy Qwen3-32B through SageMaker JumpStart UI

Complete the following steps to deploy Qwen3-32B using SageMaker JumpStart:

  1. On the SageMaker console, choose Studio in the navigation pane.
  2. First-time users will be prompted to create a domain.
  3. On the SageMaker Studio console, choose JumpStart in the navigation pane.

The model browser displays available models, with details like the provider name and model capabilities.

The SageMaker Studio Public Hub interface displays a grid of AI model providers, including Meta, DeepSeek, HuggingFace, and AWS, each showing their model counts and Bedrock integration status. The page includes a navigation sidebar and search functionality.

  1. Search for Qwen3 to view the Qwen3-32B model card.

Each model card shows key information, including:

  • Model name
  • Provider name
  • Task category (for example, Text Generation)
  • Bedrock Ready badge (if applicable), indicating that this model can be registered with Amazon Bedrock, so you can use Amazon Bedrock APIs to invoke the model

The SageMaker interface shows search results for "qwen3" displaying four text generation models from Qwen, each marked as Bedrock ready. Models range from 0.6B to 32B in size with consistent formatting and capabilities.

  1. Choose the model card to view the model details page.

The model details page includes the following information:

  • The model name and provider information
  • A Deploy button to deploy the model
  • About and Notebooks tabs with detailed information

The About tab includes important details, such as:

  • Model description
  • License information
  • Technical specifications
  • Usage guidelines

Screenshot of the SageMaker Studio interface displaying details about the Qwen3 32B language model, including its main features, capabilities, and deployment options. The interface shows tabs for About and Notebooks, with action buttons for Train, Deploy, Optimize, and Evaluate.

Before you deploy the model, it’s recommended to review the model details and license terms to confirm compatibility with your use case.

  1. Choose Deploy to proceed with deployment.
  2. For Endpoint name, use the automatically generated name or create a custom one.
  3. For Instance type¸ choose an instance type (default: ml.g6-12xlarge).
  4. For Initial instance count, enter the number of instances (default: 1).

Selecting appropriate instance types and counts is crucial for cost and performance optimization. Monitor your deployment to adjust these settings as needed. Under Inference type, Real-time inference is selected by default. This is optimized for sustained traffic and low latency.

  1. Review all configurations for accuracy. For this model, we strongly recommend adhering to SageMaker JumpStart default settings and making sure that network isolation remains in place.
  2. Choose Deploy to deploy the model.

A deployment configuration screen in SageMaker Studio showing endpoint settings, instance type selection, and real-time inference options. The interface includes fields for endpoint name, instance type (ml.g5.12xlarge), and initial instance count.

The deployment process can take several minutes to complete.

When deployment is complete, your endpoint status will change to InService. At this point, the model is ready to accept inference requests through the endpoint. You can monitor the deployment progress on the SageMaker console Endpoints page, which will display relevant metrics and status information. When the deployment is complete, you can invoke the model using a SageMaker runtime client and integrate it with your applications.

Deploy Qwen3-32B using the SageMaker Python SDK

To get started with Qwen3-32B using the SageMaker Python SDK, you must install the SageMaker Python SDK and make sure you have the necessary AWS permissions and environment set up. The following is a step-by-step code example that demonstrates how to deploy and use Qwen3-32B for inference programmatically:

!pip install --force-reinstall --no-cache-dir sagemaker==2.235.2

from sagemaker.serve.builder.model_builder import ModelBuilder 
from sagemaker.serve.builder.schema_builder import SchemaBuilder 
from sagemaker.jumpstart.model import ModelAccessConfig 
from sagemaker.session import Session 
import logging 

sagemaker_session = Session()
artifacts_bucket_name = sagemaker_session.default_bucket() 
execution_role_arn = sagemaker_session.get_caller_identity_arn()

# Changed to Qwen32B model
js_model_id = "huggingface-reasoning-qwen3-32b"
gpu_instance_type = "ml.g5.12xlarge"

response = "Hello, I'm a language model, and I'm here to help you with your English."

sample_input = {
    "inputs": "Hello, I'm a language model,",
    "parameters": {
        "max_new_tokens": 128, 
        "top_p": 0.9, 
        "temperature": 0.6
    }
}

sample_output = [{"generated_text": response}]

schema_builder = SchemaBuilder(sample_input, sample_output)

model_builder = ModelBuilder( 
    model=js_model_id, 
    schema_builder=schema_builder, 
    sagemaker_session=sagemaker_session, 
    role_arn=execution_role_arn, 
    log_level=logging.ERROR 
) 

model = model_builder.build() 

predictor = model.deploy(
    model_access_configs={js_model_id: ModelAccessConfig(accept_eula=True)}, 
    accept_eula=True
) 

predictor.predict(sample_input)

You can run additional requests against the predictor:

new_input = {
"inputs": "What is Amazon doing in Generative AI?",
"parameters": {"max_new_tokens": 64, "top_p": 0.8, "temperature": 0.7},
}

prediction = predictor.predict(new_input)
print(prediction)

The following are some error handling and best practices to enhance deployment code:

# Enhanced deployment code with error handling
import backoff
import botocore
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

@backoff.on_exception(backoff.expo, 
                     (botocore.exceptions.ClientError,),
                     max_tries=3)
def deploy_model_with_retries(model_builder, model_id):
    try:
        model = model_builder.build()
        predictor = model.deploy(
            model_access_configs={model_id:ModelAccessConfig(accept_eula=True)},
            accept_eula=True
        )
        return predictor
    except Exception as e:
        logger.error(f"Deployment failed: {str(e)}")
        raise

def safe_predict(predictor, input_data):
    try:
        return predictor.predict(input_data)
    except Exception as e:
        logger.error(f"Prediction failed: {str(e)}")
        return None

Clean up

To avoid unwanted charges, complete the steps in this section to clean up your resources.

Delete the Amazon Bedrock Marketplace deployment

If you deployed the model using Amazon Bedrock Marketplace, complete the following steps:

  1. On the Amazon Bedrock console, under Foundation models in the navigation pane, choose Marketplace deployments.
  2. In the Managed deployments section, locate the endpoint you want to delete.
  3. Select the endpoint, and on the Actions menu, choose Delete.
  4. Verify the endpoint details to make sure you’re deleting the correct deployment:
    1. Endpoint name
    2. Model name
    3. Endpoint status
  5. Choose Delete to delete the endpoint.
  6. In the deletion confirmation dialog, review the warning message, enter confirm, and choose Delete to permanently remove the endpoint.

Delete the SageMaker JumpStart predictor

The SageMaker JumpStart model you deployed will incur costs if you leave it running. Use the following code to delete the endpoint if you want to stop incurring charges. For more details, see Delete Endpoints and Resources.

predictor.delete_model()
predictor.delete_endpoint()

Conclusion

In this post, we explored how you can access and deploy the Qwen3 models using Amazon Bedrock Marketplace and SageMaker JumpStart. With support for both the full parameter models and its distilled versions, you can choose the optimal model size for your specific use case. Visit SageMaker JumpStart in Amazon SageMaker Studio or Amazon Bedrock Marketplace to get started. For more information, refer to Use Amazon Bedrock tooling with Amazon SageMaker JumpStart models, SageMaker JumpStart pretrained models, Amazon SageMaker JumpStart Foundation Models, Amazon Bedrock Marketplace, and Getting started with Amazon SageMaker JumpStart.

The Qwen3 family of LLMs offers exceptional versatility and performance, making it a valuable addition to the AWS foundation model offerings. Whether you’re building applications for content generation, analysis, or complex reasoning tasks, Qwen3’s advanced architecture and extensive context window make it a powerful choice for your generative AI needs.


About the authors

Niithiyn Vijeaswaran is a Generative AI Specialist Solutions Architect with the Third-Party Model Science team at AWS. His area of focus is AWS AI accelerators (AWS Neuron). He holds a Bachelor’s degree in Computer Science and Bioinformatics.

Avan Bala is a Solutions Architect at AWS. His area of focus is AI for DevOps and machine learning. He holds a bachelor’s degree in Computer Science with a minor in Mathematics and Statistics from the University of Maryland. Avan is currently working with the Enterprise Engaged East Team and likes to specialize in projects about emerging AI technologies.

Mohhid Kidwai is a Solutions Architect at AWS. His area of focus is generative AI and machine learning solutions for small-medium businesses. He holds a bachelor’s degree in Computer Science with a minor in Biological Science from North Carolina State University. Mohhid is currently working with the SMB Engaged East Team at AWS.

Yousuf Athar is a Solutions Architect at AWS specializing in generative AI and AI/ML. With a Bachelor’s degree in Information Technology and a concentration in Cloud Computing, he helps customers integrate advanced generative AI capabilities into their systems, driving innovation and competitive edge. Outside of work, Yousuf loves to travel, watch sports, and play football.

John Liu has 15 years of experience as a product executive and 9 years of experience as a portfolio manager. At AWS, John is a Principal Product Manager for Amazon Bedrock. Previously, he was the Head of Product for AWS Web3 / Blockchain. Prior to AWS, John held various product leadership roles at public blockchain protocols, fintech companies and also spent 9 years as a portfolio manager at various hedge funds.

Rohit Talluri is a Generative AI GTM Specialist at Amazon Web Services (AWS). He is partnering with top generative AI model builders, strategic customers, key AI/ML partners, and AWS Service Teams to enable the next generation of artificial intelligence, machine learning, and accelerated computing on AWS. He was previously an Enterprise Solutions Architect and the Global Solutions Lead for AWS Mergers & Acquisitions Advisory.

Varun Morishetty is a Software Engineer with Amazon SageMaker JumpStart and Bedrock Marketplace. Varun received his Bachelor’s degree in Computer Science from Northeastern University. In his free time, he enjoys cooking, baking and exploring New York City.

Read More

Build a just-in-time knowledge base with Amazon Bedrock

Build a just-in-time knowledge base with Amazon Bedrock

Software as a service (SaaS) companies managing multiple tenants face a critical challenge: efficiently extracting meaningful insights from vast document collections while controlling costs. Traditional approaches often lead to unnecessary spending on unused storage and processing resources, impacting both operational efficiency and profitability. Organizations need solutions that intelligently scale processing and storage resources based on actual tenant usage patterns while maintaining data isolation. Traditional Retrieval Augmented Generation (RAG) systems consume valuable resources by ingesting and maintaining embeddings for documents that might never be queried, resulting in unnecessary storage costs and reduced system efficiency. Systems designed to handle large amounts of small to mid-sized tenants can exceed cost structure and infrastructure limits or might need to use silo-style deployments to keep each tenant’s information and usage separate. Adding to this complexity, many projects are transitory in nature, with work being completed on an intermittent basis, leading to data occupying space in knowledge base systems that could be used by other active tenants.

To address these challenges, this post presents a just-in-time knowledge base solution that reduces unused consumption through intelligent document processing. The solution processes documents only when needed and automatically removes unused resources, so organizations can scale their document repositories without proportionally increasing infrastructure costs.

With a multi-tenant architecture with configurable limits per tenant, service providers can offer tiered pricing models while maintaining strict data isolation, making it ideal for SaaS applications serving multiple clients with varying needs. Automatic document expiration through Time-to-Live (TTL) makes sure the system remains lean and focused on relevant content, while refreshing the TTL for frequently accessed documents maintains optimal performance for information that matters. This architecture also makes it possible to limit the number of files each tenant can ingest at a specific time and the rate at which tenants can query a set of files.This solution uses serverless technologies to alleviate operational overhead and provide automatic scaling, so teams can focus on business logic rather than infrastructure management. By organizing documents into groups with metadata-based filtering, the system enables contextual querying that delivers more relevant results while maintaining security boundaries between tenants.The architecture’s flexibility supports customization of tenant configurations, query rates, and document retention policies, making it adaptable to evolving business requirements without significant rearchitecting.

Solution overview

This architecture combines several AWS services to create a cost-effective, multi-tenant knowledge base solution that processes documents on demand. The key components include:

  • Vector-based knowledge base – Uses Amazon Bedrock and Amazon OpenSearch Serverless for efficient document processing and querying
  • On-demand document ingestion – Implements just-in-time processing using the Amazon Bedrock CUSTOM data source type
  • TTL management – Provides automatic cleanup of unused documents using the TTL feature in Amazon DynamoDB
  • Multi-tenant isolation – Enforces secure data separation between users and organizations with configurable resource limits

The solution enables granular control through metadata-based filtering at the user, tenant, and file level. The DynamoDB TTL tracking system supports tiered pricing structures, where tenants can pay for different TTL durations, document ingestion limits, and query rates.

The following diagram illustrates the key components and workflow of the solution.

Multi-tier AWS serverless architecture diagram showcasing data flow and integration of various AWS services

The workflow consists of the following steps:

  1. The user logs in to the system, which attaches a tenant ID to the current user for calls to the Amazon Bedrock knowledge base. This authentication step is crucial because it establishes the security context and makes sure subsequent interactions are properly associated with the correct tenant. The tenant ID becomes the foundational piece of metadata that enables proper multi-tenant isolation and resource management throughout the entire workflow.
  2. After authentication, the user creates a project that will serve as a container for the files they want to query. This project creation step establishes the organizational structure needed to manage related documents together. The system generates appropriate metadata and creates the necessary database entries to track the project’s association with the specific tenant, enabling proper access control and resource management at the project level.
  3. With a project established, the user can begin uploading files. The system manages this process by generating pre-signed URLs for secure file upload. As files are uploaded, they are stored in Amazon Simple Storage Service (Amazon S3), and the system automatically creates entries in DynamoDB that associate each file with both the project and the tenant. This three-way relationship (file-project-tenant) is essential for maintaining proper data isolation and enabling efficient querying later.
  4. When a user requests to create a chat with a knowledge base for a specific project, the system begins ingesting the project files using the CUSTOM data source. This is where the just-in-time processing begins. During ingestion, the system applies a TTL value based on the tenant’s tier-specific TTL interval. The TTL makes sure project files remain available during the chat session while setting up the framework for automatic cleanup later. This step represents the core of the on-demand processing strategy, because files are only processed when they are needed.
  5. Each chat session actively updates the TTL for the project files being used. This dynamic TTL management makes sure frequently accessed files remain in the knowledge base while allowing rarely used files to expire naturally. The system continually refreshes the TTL values based on actual usage, creating an efficient balance between resource availability and cost optimization. This approach maintains optimal performance for actively used content while helping to prevent resource waste on unused documents.
  6. After the chat session ends and the TTL value expires, the system automatically removes files from the knowledge base. This cleanup process is triggered by Amazon DynamoDB Streams monitoring TTL expiration events, which activate an AWS Lambda function to remove the expired documents. This final step reduces the load on the underlying OpenSearch Serverless cluster and optimizes system resources, making sure the knowledge base remains lean and efficient.

Prerequisites

You need the following prerequisites before you can proceed with solution. For this post, we use the us-east-1 AWS Region.

Deploy the solution

Complete the following steps to deploy the solution:

  1. Download the AWS CDK project from the GitHub repo.
  2. Install the project dependencies:
    npm run install:all

  3. Deploy the solution:
    npm run deploy

  4. Create a user and log in to the system after validating your email.

Validate the knowledge base and run a query

Before allowing users to chat with their documents, the system performs the following steps:

  • Performs a validation check to determine if documents need to be ingested. This process happens transparently to the user and includes checking document status in DynamoDB and the knowledge base.
  • Validates that the required documents are successfully ingested and properly indexed before allowing queries.
  • Returns both the AI-generated answers and relevant citations to source documents, maintaining traceability and empowering users to verify the accuracy of responses.

The following screenshot illustrates an example of chatting with the documents.

AWS Just In Time Knowledge Base interface displaying project files and AI-powered question-answering feature

Looking at the following example method for file ingestion, note how file information is stored in DynamoDB with a TTL value for automatic expiration. The ingest knowledge base documents call includes essential metadata (user ID, tenant ID, and project), enabling precise filtering of this tenant’s files in subsequent operations.

# Ingesting files with tenant-specific TTL values
def ingest_files(user_id, tenant_id, project_id, files):
    # Get tenant configuration and calculate TTL
    tenants = json.loads(os.environ.get('TENANTS'))['Tenants']
    tenant = find_tenant(tenant_id, tenants)
    ttl = int(time.time()) + (int(tenant['FilesTTLHours']) * 3600)
    
    # For each file, create a record with TTL and start ingestion
    for file in files:
        file_id = file['id']
        s3_key = file.get('s3Key')
        bucket = file.get('bucket')
        
        # Create a record in the knowledge base files table with TTL
        knowledge_base_files_table.put_item(
            Item={
                'id': file_id,
                'userId': user_id,
                'tenantId': tenant_id,
                'projectId': project_id,
                'documentStatus': 'ready',
                'createdAt': int(time.time()),
                'ttl': ttl  # TTL value for automatic expiration
            }
        )
        
        # Start the ingestion job with tenant, user, and project metadata for filtering
        bedrock_agent.ingest_knowledge_base_documents(
            knowledgeBaseId=KNOWLEDGE_BASE_ID,
            dataSourceId=DATA_SOURCE_ID,
            clientToken=str(uuid.uuid4()),
            documents=[
                {
                    'content': {
                        'dataSourceType': 'CUSTOM',
                        'custom': {
                            'customDocumentIdentifier': {
                                'id': file_id
                            },
                            's3Location': {
                                'uri': f"s3://{bucket}/{s3_key}"
                            },
                            'sourceType': 'S3_LOCATION'
                        }
                    },
                    'metadata': {
                        'type': 'IN_LINE_ATTRIBUTE',
                        'inlineAttributes': [
                            {'key': 'userId', 'value': {'stringValue': user_id, 'type': 'STRING'}},
                            {'key': 'tenantId', 'value': {'stringValue': tenant_id, 'type': 'STRING'}},
                            {'key': 'projectId', 'value': {'stringValue': project_id, 'type': 'STRING'}},
                            {'key': 'fileId', 'value': {'stringValue': file_id, 'type': 'STRING'}}
                        ]
                    }
                }
            ]
        )

During a query, you can use the associated metadata to construct parameters that make sure you only retrieve files belonging to this specific tenant. For example:

    filter_expression = {
        "andAll": [
            {
                "equals": {
                    "key": "tenantId",
                    "value": tenant_id
                }
            },
            {
                "equals": {
                    "key": "projectId",
                    "value": project_id
                }
            },
            {
                "in": {
                    "key": "fileId",
                    "value": file_ids
                }
            }
        ]
    }

    # Create base parameters for the API call
    retrieve_params = {
        'input': {
            'text': query
        },
        'retrieveAndGenerateConfiguration': {
            'type': 'KNOWLEDGE_BASE',
            'knowledgeBaseConfiguration': {
                'knowledgeBaseId': knowledge_base_id,
                'modelArn': 'arn:aws:bedrock:us-east-1::foundation-model/amazon.nova-pro-v1:0',
                'retrievalConfiguration': {
                    'vectorSearchConfiguration': {
                        'numberOfResults': limit,
                        'filter': filter_expression
                    }
                }
            }
        }
    }
    response = bedrock_agent_runtime.retrieve_and_generate(**retrieve_params)

Manage the document lifecycle with TTL

To further optimize resource usage and costs, you can implement an intelligent document lifecycle management system using the DynamoDB (TTL) feature. This consists of the following steps:

  1. When a document is ingested into the knowledge base, a record is created with a configurable TTL value.
  2. This TTL is refreshed when the document is accessed.
  3. DynamoDB Streams with specific filters for TTL expiration events is used to trigger a cleanup Lambda function.
  4. The Lambda function removes expired documents from the knowledge base.

See the following code:

# Lambda function triggered by DynamoDB Streams when TTL expires items
def lambda_handler(event, context):
    """
    This function is triggered by DynamoDB Streams when TTL expires items.
    It removes expired documents from the knowledge base.
    """
    
    # Process each record in the event
    for record in event.get('Records', []):
        # Check if this is a TTL expiration event (REMOVE event from DynamoDB Stream)
        if record.get('eventName') == 'REMOVE':
            # Check if this is a TTL expiration
            user_identity = record.get('userIdentity', {})
            if user_identity.get('type') == 'Service' and user_identity.get('principalId') == 'dynamodb.amazonaws.com':
                # Extract the file ID and tenant ID from the record
                keys = record.get('dynamodb', {}).get('Keys', {})
                file_id = keys.get('id', {}).get('S')
                
                # Delete the document from the knowledge base
                bedrock_agent.delete_knowledge_base_documents(
                    clientToken=str(uuid.uuid4()),
                    knowledgeBaseId=knowledge_base_id,
                    dataSourceId=data_source_id,
                    documentIdentifiers=[
                        {
                            'custom': {
                                'id': file_id
                            },
                            'dataSourceType': 'CUSTOM'
                        }
                    ]
                )

Multi-tenant isolation with tiered service levels

Our architecture enables sophisticated multi-tenant isolation with tiered service levels:

  • Tenant-specific document filtering – Each query includes user, tenant, and file-specific filters, allowing the system to reduce the number of documents being queried.
  • Configurable TTL values – Different tenant tiers can have different TTL configurations. For example:
    • Free tier: Five documents ingested with a 7-day TTL and five queries per minute.
    • Standard tier: 100 documents ingested with a 30-day TTL and 10 queries per minute.
    • Premium tier: 1,000 documents ingested with a 90-day TTL and 50 queries per minute.
    • You can configure additional limits, such as total queries per month or total ingested files per day or month.

Clean up

To clean up the resources created in this post, run the following command from the same location where you performed the deploy step:

npm run destroy

Conclusion

The just-in-time knowledge base architecture presented in this post transforms document management across multiple tenants by processing documents only when queried, reducing the unused consumption of traditional RAG systems. This serverless implementation uses Amazon Bedrock, OpenSearch Serverless, and the DynamoDB TTL feature to create a lean system with intelligent document lifecycle management, configurable tenant limits, and strict data isolation, which is essential for SaaS providers offering tiered pricing models.

This solution directly addresses cost structure and infrastructure limitations of traditional systems, particularly for deployments handling numerous small to mid-sized tenants with transitory projects. This architecture combines on-demand document processing with automated lifecycle management, delivering a cost-effective, scalable resource that empowers organizations to focus on extracting insights rather than managing infrastructure, while maintaining security boundaries between tenants.

Ready to implement this architecture? The full sample code is available in the GitHub repository.


About the author

Steven Warwick is a Senior Solutions Architect at AWS, where he leads customer engagements to drive successful cloud adoption and specializes in SaaS architectures and Generative AI solutions. He produces educational content including blog posts and sample code to help customers implement best practices, and has led programs on GenAI topics for solution architects. Steven brings decades of technology experience to his role, helping customers with architectural reviews, cost optimization, and proof-of-concept development.

Read More

Agents as escalators: Real-time AI video monitoring with Amazon Bedrock Agents and video streams

Agents as escalators: Real-time AI video monitoring with Amazon Bedrock Agents and video streams

Organizations deploying video monitoring systems face a critical challenge: processing continuous video streams while maintaining accurate situational awareness. Traditional monitoring approaches that use rule-based detection or basic computer vision frequently miss important events or generate excessive false positives, leading to operational inefficiencies and alert fatigue.

In this post, we show how to build a fully deployable solution that processes video streams using OpenCV, Amazon Bedrock for contextual scene understanding and automated responses through Amazon Bedrock Agents. This solution extends the capabilities demonstrated in Automate chatbot for document and data retrieval using Amazon Bedrock Agents and Knowledge Bases, which discussed using Amazon Bedrock Agents for document and data retrieval. In this post, we apply Amazon Bedrock Agents to real-time video analysis and event monitoring.

Benefits of using Amazon Bedrock Agents for video monitoring

The following figure shows example video stream inputs from different monitoring scenarios. With contextual scene understanding, users can search for specific events.

A front door camera will capture many events throughout the day, but some are more interesting than others—having context if a package is being delivered or removed (as in the following package example) limits alerts to urgent events.

Amazon Bedrock is a fully managed service that provides access to high-performing foundation models (FMs) from leading AI companies through a single API. Using Amazon Bedrock, you can build secure, responsible generative AI applications. Amazon Bedrock Agents extends these capabilities by enabling applications to execute multi-step tasks across systems and data sources, making it ideal for complex monitoring scenarios. The solution processes video streams through these key steps:

  1. Extract frames when motion is detected from live video streams or local files.
  2. Analyze context using multimodal FMs.
  3. Make decisions using agent-based logic with configurable responses.
  4. Maintain searchable semantic memory of events.

You can build this intelligent video monitoring system using Amazon Bedrock Agents and Amazon Bedrock Knowledge Bases in an automated solution. The complete code is available in the GitHub repo.

Limitations of current video monitoring systems

Organizations deploying video monitoring systems face a fundamental dilemma. Despite advances in camera technology and storage capabilities, the intelligence layer interpreting video feeds often remains rudimentary. This creates a challenging situation where security teams must make significant trade-offs in their monitoring approach. Current video monitoring solutions typically force organizations to choose between the following:

  • Simple rules that scale but generate excessive false positives
  • Complex rules that require ongoing maintenance and customization
  • Manual monitoring that relies on human attention and doesn’t scale
  • Point solutions that only handle specific scenarios but lack flexibility

These trade-offs create fundamental barriers to effective video monitoring that impact security, safety, and operational efficiency across industries. Based on our work with customers, we’ve identified three critical challenges that emerge from these limitations:

  • Alert fatigue – Traditional motion detection and object recognition systems generate alerts for any detected change or recognized object. Security teams quickly become overwhelmed by the volume of notifications for normal activities. This leads to reduced attention when genuinely critical events occur, diminishing security effectiveness and increasing operational costs from constant human verification of false alarms.
  • Limited contextual understanding – Rule-based systems fundamentally struggle with nuanced scene interpretation. Even sophisticated traditional systems operate with limited understanding of the environments they monitor due to a lack of contextual awareness, because they can’t easily do the following:
    • Distinguish normal from suspicious behavior
    • Understand temporal patterns like recurring weekly events
    • Consider environmental context such as time of day or location
    • Correlate multiple events that might indicate a pattern
  • Lack of semantic memory – Conventional systems lack the ability to build and use knowledge over time. They can’t do the following:
    • Establish baselines of routine versus unusual events
    • Offer natural language search capabilities across historical data
    • Support reasoning about emerging patterns

Without these capabilities, you can’t gain cumulative benefits from your monitoring infrastructure or perform sophisticated retrospective analysis. To address these challenges effectively, you need a fundamentally different approach. By combining the contextual understanding capabilities of FMs with a structured framework for event classification and response, you can build more intelligent monitoring systems. Amazon Bedrock Agents provides the ideal platform for this next-generation approach.

Solution overview

You can address these monitoring challenges by building a video monitoring solution with Amazon Bedrock Agents. The system intelligently screens events, filters routine activity, and escalates situations requiring human attention, helping reduce alert fatigue while improving detection accuracy. The solution uses Amazon Bedrock Agents to analyze detected motion from video, and alerts users when an event of interest happens according to the provided instructions. This allows the system to intelligently filter out trivial events that can trigger motion detection, such as wind or birds, and direct the user’s attention only to events of interest. The following diagram illustrates the solution architecture.

The solution uses three primary components to address the core challenges: agents as escalators, a video processing pipeline, and Amazon Bedrock Agents. We discuss these components in more detail in the following sections.

The solution uses the AWS Cloud Development Kit (AWS CDK) to deploy the solution components. The AWS CDK is an open source software development framework for defining cloud infrastructure as code and provisioning it through AWS CloudFormation.

Agent as escalators

The first component uses Amazon Bedrock Agents to examine detected motion events with the following capabilities:

  • Provides natural language understanding of scenes and activities for contextual interpretation
  • Maintains temporal awareness across frame sequences to understand event progression
  • References historical patterns to distinguish unusual from routine events
  • Applies contextual reasoning about behaviors, considering factors like time of day, location, and action sequences

We implement a graduated response framework that categorizes events by severity and required action:

  • Level 0: Log only – The system logs normal or expected activities. For example, when a delivery person arrives during business hours or a recognized vehicle enters the driveway, these events are documented for pattern analysis and future reference but require no immediate action. They remain searchable in the event history.
  • Level 1: Human notification – This level handles unusual but non-critical events that warrant human attention. An unrecognized vehicle parked nearby, an unexpected visitor, or unusual movement patterns trigger a notification to security personnel. These events require human verification and assessment.
  • Level 2: Immediate response – Reserved for critical security events. Unauthorized access attempts, detection of smoke or fire, or suspicious behavior trigger automatic response actions through API calls. The system notifies personnel through SMS or email alerts with event information and context.

The solution provides an interactive processing and monitoring interface through a Streamlit application. With the Streamlit UI, users can provide instructions and interact with the agent.

The application consists of the following key features:

  • Live stream or video file input – The application accepts M3U8 stream URLs from webcams or security feeds, or local video files in common formats (MP4, AVI). Both are processed using the same motion detection pipeline that saves triggered events to Amazon Simple Storage Service (Amazon S3) for agent analysis.
  • Custom instructions – Users can provide specific monitoring guidance, such as “Alert me about unknown individuals near the loading dock after hours” or “Focus on vehicle activity in the parking area.” These instructions adjust how the agent interprets detected motion events.
  • Notification configuration – Users can specify contact information for different alert levels. The system uses Amazon Simple Notification Service (Amazon SNS) to send emails or text messages based on event severity, so different personnel can be notified for potential issues vs. critical situations.
  • Natural language queries about past events – The interface includes a chat component for historical event retrieval. Users can ask “What vehicles have been in the driveway this week?” or “Show me any suspicious activity from last night,” receiving responses based on the system’s event memory.

Video processing pipeline

The solution uses several AWS services to capture and prepare video data for analysis through a modular processing pipeline. The solution supports multiple types of video sources:

When using streams, OpenCV’s VideoCapture component handles the connection and frame extraction. For testing, we’ve included sample event videos demonstrating different scenarios. The core of the video processing is a modular pipeline implemented in Python. Key components include:

  • SimpleMotionDetection – Identifies movement in the video feed
  • FrameSampling – Captures sequences of frames over time when motion is detected
  • GridAggregator – Organizes multiple frames into a visual grid for context
  • S3Storage – Stores captured frame sequences in Amazon S3

This multi-process framework optimizes performance by running components concurrently and maintaining a queue of frames to process. The video processing pipeline organizes captured frame data in a structured way before passing it to the Amazon Bedrock agent for analysis:

  • Frame sequence storage – When motion is detected, the system captures a sequence of frames over 10 seconds. These frames are stored in Amazon S3 using a timestamp-based path structure (YYYYMMDD-HHMMSS) that allows for efficient retrieval by date and time. In the case where motions exceed 10 seconds, multiple events are created.
  • Image grid format – Rather than processing individual frames separately, the system arranges multiple sequential frames into a grid format (typically 3×4 or 4×5). This presentation provides temporal context and is sent to the Amazon Bedrock agent for analysis. The grid format enables understanding of how motion progresses over time, which is critical for accurate scene interpretation.

The following figure is an example of an image grid sent to the agent. Package theft is difficult to identify with classic image models. The large language model’s (LLM’s) ability to reason over a sequence of image allows it to make observations about intent.

The video processing pipeline’s output—timestamped frame grids stored in Amazon S3—serves as the input for the Amazon Bedrock agent components, which we discuss in the next section.

Amazon Bedrock agent components

The solution integrates multiple Amazon Bedrock services to create an intelligent analysis system:

  • Core agent architecture – The agent orchestrates these key workflows:
    • Receives frame grids from Amazon S3 on motion detection
    • Coordinates multi-step analysis processes
    • Makes classification decisions
    • Triggers appropriate response actions
    • Maintains event context and state
  • Knowledge management – The solution uses Amazon Bedrock Knowledge Bases with Amazon OpenSearch Serverless to:
    • Store and index historical events
    • Build baseline activity patterns
    • Enable natural language querying
    • Track temporal patterns
    • Support contextual analysis
  • Action groups – The agent has access to several actions defined through API schemas:
    • Analyze grid – Process incoming frame grids from Amazon S3
    • Alert – Send notifications through Amazon SNS based on severity
    • Log – Record event details for future reference
    • Search events by date – Retrieve past events based on a date range
    • Look up vehicle (Text-to-SQL) – Query the vehicle database for information

For structured data queries, the system uses the FM’s ability to convert natural language to SQL. This enables the following:

  • Querying Amazon Athena tables containing event records
  • Retrieving information about registered vehicles
  • Generating reports from structured event data

These components work together to create a comprehensive system that can analyze video content, maintain event history, and support both real-time alerting and retrospective analysis through natural language interaction.

Video processing framework

The video processing framework implements a multi-process architecture for handling video streams through composable processing chains.

Modular pipeline architecture

The framework uses a composition-based approach built around the FrameProcessor abstract base class.

Processing components implement a consistent interface with a process(frame) method that takes a Frame and returns a potentially modified Frame:

```
class FrameProcessor(ABC):
    @abstractmethod
    def process(self, frame: Frame) -> Optional[Frame]: ...
```

The Frame class encapsulates the image data along with timestamps, indexes, and extensible metadata:

```
@dataclass
class Frame:
    buffer: ndarray  # OpenCV image array
    timestamp: float
    index: float
    fps: float
    metadata: dict = field(default_factory=dict)
```

Customizable processing chains

The architecture supports configuring multiple processing chains that can be connected in sequence. The solution uses two primary chains. The detection and analysis chain processes incoming video frames to identify events of interest:

```
chain = FrameProcessorChain([
    SimpleMotionDetection(motion_threshold=10_000, frame_skip_size=1),
    FrameSampling(timedelta(milliseconds=250), threshold_time=timedelta(seconds=2)),
    GridAggregator(shape=(13, 3))
])
```

The storage and notification chain handles the storage of identified events and invocation of the agent:

```
storage_chain = FrameProcessorChain([
    S3Storage(bucket_name=TARGET_S3_BUCKET, prefix=S3_PREFIX, s3_client_provider=s3_client_provider),
    LambdaProcessor(get_response=get_response, monitoring_instructions=config.monitoring_instructions)
])
```

You can modify these changes independently to add or replace components based on specific monitoring requirements.

Component implementation

The solution includes several processing components that demonstrate the framework’s capabilities. You can modify each processing step or add new ones. For example, for simple motion detection, we use a simple pixel difference, but you can refine the motion detection functionality as needed, or follow the format to implement other detection algorithms, such as object detection or scene segmentation.

Additional components include the FrameSampling processor to control capture timing, GridAggregator to create visual frame grids, and storage processors that save event data and trigger agent analysis, and these can be customized and replaced as needed. For example:

  • Modify existing components – Adjust thresholds or parameters to tune for specific environments
  • Create alternative storage backends – Direct output to different storage services or databases
  • Implement preprocessing and postprocessing steps – Add image enhancement, data filtering, or additional context generation

Finally, the LambdaProcessor serves as the bridge to the Amazon Bedrock agent by invoking an AWS Lambda function that sends the information in a request to the deployed agent. From there, the Amazon Bedrock agent takes over and analyzes the event and takes action accordingly.

Agent implementation

After you deploy the solution, an Amazon Bedrock agent alias becomes available. This agent functions as an intelligent analysis layer, processing captured video events and executing appropriate actions based on its analysis. You can test the agent and view its reasoning trace directly on the Amazon Bedrock console, as shown in the following screenshot.

This agent will lack some of the metadata supplied by the Streamlit application (such as current time) and might not give the same answers as the full application.

Invocation flow

The agent is invoked through a Lambda function that handles the request-response cycle and manages session state. It finds the highest published version ID and uses it to invoke the agent and parses the response.

Action groups

The agent’s capabilities are defined through action groups implemented using the BedrockAgentResolver framework. This approach automatically generates the OpenAPI schema required by the agent.

When the agent is invoked, it receives an event object that includes the API path and other parameters that inform the agent framework how to route the request to the appropriate handler based. You can add new actions by defining additional endpoint handlers following the same pattern and generating a new OpenAPI schema:

```
if __name__ == "__main__":
    print(app.get_openapi_json_schema())
```

Text-to-SQL integration

Through its action group, the agent is able to translate natural language queries into SQL for structured data analysis. The system reads data from assets/data_query_data_source, which can include various formats like CSV, JSON, ORC, or Parquet.

This capability enables users to query structured data using natural language. As demonstrated in the following example, the system translates natural language queries about vehicles into SQL, returning structured information from the database.

The database connection is configured through a SQL Alchemy engine. Users can connect to existing databases by updating the create_sql_engine() function to use their connection parameters.

Event memory and semantic search

The agent maintains a detailed memory of past events, storing event logs with rich descriptions in Amazon S3. These events become searchable through both vector-based semantic search and date-based filtering. As shown in the following example, temporal queries make it possible to retrieve information about events within specific time periods, such as vehicles observed in the past 72 hours.

The system’s semantic memory capabilities enable queries based on abstract concepts and natural language descriptions. As shown in the following example, the agent can understand abstract concepts like “funny” and retrieve relevant events, such as a person dropping a birthday cake.

Events can be linked together by the agent to identify patterns or related incidents. For example, the system can correlate separate sightings of individuals with similar characteristics. In the following screenshots, the agent connects related incidents by identifying common attributes like clothing items across different events.

This event memory store allows the system to build knowledge over time, providing increasingly valuable insights as it accumulates data. The combination of structured database querying and semantic search across event descriptions creates an agent with a searchable memory of all past events.

Prerequisites

Before you deploy the solution, complete the following prerequisites:

  1. Configure AWS credentials using aws configure. Use either the us-west-2 or us-east-1 AWS Region.
  2. Enable access to Anthropic’s Claude 3.x models, or another supported Amazon Bedrock Agents model you want to use.
  3. Make sure you have the following dependencies:

Deploy the solution

The AWS CDK deployment creates the following resources:

  • Storage – S3 buckets for assets and query results
  • Amazon Bedrock resources – Agent and knowledge base
  • Compute – Lambda functions for actions, invocation, and updates
  • Database – Athena database for structured queries, and an AWS Glue crawler for data discovery

Deploy the solution with the following commands:

```
#1. Clone the repository and navigate to folder
git clone https://github.com/aws-samples/sample-video-monitoring-agent.git && cd sample-video-monitoring-agent
#2. Set up environment and install dependencies
python3 -m venv .venv && source .venv/bin/activate && pip install -r requirements.txt
#3. Deploy AWS resources
cdk bootstrap && cdk deploy
#4. Run the streamlit app
cd code/streamlit_app && streamlit run app.py
```

On Windows, replace the second line with the following code:

```
python3 -m venv .venv && % .venvScriptsactivate.bat && pip install -r requirements.txt
```

Clean up

To destroy the resources you created and stop incurring charges, run the following command:

```
cdk destroy
```

Future enhancements

The current implementation demonstrates the potential of agent-based video monitoring in a home security setting, but there are many potential applications.

Sample Use Cases

The following showcases the application of the solution to various scenarios.

Small business

{ “alert_level”: 0, “timestamp”: “2024-11-20T15:24:15Z”, “reason”: “Vehicle arrival in driveway”, “description”: ”Standard vehicle arrival and parking sequence. Vehicles present: Black Nissan Frontier pickup (parked), silver Honda CR-V (arriving), and partial view of blue vehicle in foreground. Area features: Gravel driveway surface, two waste bins (County Waste and recycling), evergreen trees in background. Sequence shows Honda CR-V executing normal parking maneuver: approaches from east, performs standard three-point turn, achieves final position next to pickup truck. Daytime conditions, clear visibility. Vehicle condition: Clean, well-maintained CR-V appears to be 2012-2016 model year, no visible damage or unusual modifications. Movement pattern indicates familiar driver performing routine parking. No suspicious behavior or safety concerns observed. Timestamp indicates standard afternoon arrival time. Waste bins properly positioned and undisturbed during parking maneuver.” }

Industrial

{ “alert_level”: 2, “timestamp”: “2024-11-20T15:24:15Z”, “reason”: “Warehouse product spill/safety hazard”,”description”: ”Significant product spill incident in warehouse storage aisle. Location: Main warehouse aisle between high-bay racking systems containing boxed inventory. Sequence shows what appears to be liquid or container spill, likely water/beverage products based on blue colored containers visible. Infrastructure: Professional warehouse setup with multi-level blue metal racking, concrete flooring, overhead lighting. Incident progression: Initial frames show clean aisle, followed by product falling/tumbling, resulting in widespread dispersal of items across aisle floor. Hazard assessment: Creates immediate slip/trip hazard, blocks emergency egress path, potential damage to inventory. Area impact: Approximately 15-20 feet of aisle space affected. Facility type appears to be distribution center or storage warehouse. Multiple cardboard boxes visible on surrounding shelves potentially at risk from liquid damage.” }

Backyard

{ “alert_level”: 1, “timestamp”: “2024-11-20T15:24:15Z”, “reason”: “Wildlife detected on property”, “description”: ”Adult raccoon observed investigating porch/deck area with white railings. Night vision/IR camera provides clear footage of animal. Subject animal characteristics: medium-sized adult raccoon, distinctive facial markings clearly visible, healthy coat condition, normal movement patterns. Sequence shows animal approaching camera (15:42PM), investigating area near railing (15:43-15:44PM), with close facial examination (15:45PM). Final frame shows partial view as animal moves away. Environment: Location appears to be elevated deck/porch with white painted wooden railings and balusters. Lighting conditions: Nighttime, camera operating in infrared/night vision mode providing clear black and white footage. Animal behavior appears to be normal nocturnal exploration, no signs of aggression or disease.” }

Home safety

{ “alert_level”: 2, “timestamp”: “2024-11-20T15:24:15Z”, “reason”: “Smoke/possible fire detected”, “description”: ”Rapid development of white/grey smoke visible in living room area. Smoke appears to be originating from left side of frame, possibly near electronics/TV area. Room features: red/salmon colored walls, grey couch, illuminated aquarium, table lamps, framed artwork. Sequence shows progressive smoke accumulation over 4-second span (15:42PM – 15:46PM).Notable smoke density increase in upper left corner of frame with potential light diffusion indicating particulate matter in air. Smoke pattern suggests active fire development rather than residual smoke. Blue light from aquarium remains visible throughout sequence providing contrast reference for smoke density.”

Further extensions

In addition, you can extend the FM capabilities using the following methods:

  • Fine-tuning for specific monitoring contexts – Adapting the models to recognize domain-specific objects, behaviors, and scenarios
  • Refined prompts for specific use cases – Creating specialized instructions that optimize the agent’s performance for particular environments like industrial facilities, retail spaces, or residential settings

You can expand the agent’s ability to take action, for example:

  • Direct control of smart home and smart building systems – Integrating with Internet of Things (IoT) device APIs to control lights, locks, or alarm systems
  • Integration with security and safety protocols – Connecting to existing security infrastructure to follow established procedures
  • Automated response workflows – Creating multi-step action sequences that can be triggered by specific events

You can also consider enhancing the event memory system:

  • Long-term pattern recognition – Identifying recurring patterns over extended time periods
  • Cross-camera correlation – Linking observations from multiple cameras to track movement through a space
  • Anomaly detection based on historical patterns – Automatically identifying deviations from established baselines

Lastly, consider extending the monitoring capabilities beyond fixed cameras:

  • Monitoring for robotic vision systems – Applying the same intelligence to mobile robots that patrol or inspect areas
  • Drone-based surveillance – Processing aerial footage for comprehensive site monitoring
  • Mobile security applications – Extending the platform to process feeds from security personnel body cameras or mobile devices

These enhancements can transform the system from a passive monitoring tool into an active participant in security operations, with increasingly sophisticated understanding of normal patterns and anomalous events.

Conclusion

The approach of using agents as escalators represents a significant advancement in video monitoring, using the contextual understanding capabilities of FMs with the action-oriented framework of Amazon Bedrock Agents. By filtering the signal from the noise, this solution addresses the critical problem of alert fatigue while enhancing security and safety monitoring capabilities.With this solution, you can:

  • Reduce false positives while maintaining high detection sensitivity
  • Provide human-readable descriptions and classifications of events
  • Maintain searchable records of all activity
  • Scale monitoring capabilities without proportional human resources

The combination of intelligent screening, graduated responses, and semantic memory enables a more effective and efficient monitoring system that enhances human capabilities rather than replacing them. Try the solution today and experience how Amazon Bedrock Agents can transform your video monitoring capabilities from simple motion detection to intelligent scene understanding.


About the authors

Kiowa Jackson is a Senior Machine Learning Engineer at AWS ProServe, specializing in computer vision and agentic systems for industrial applications. His work bridges classical machine learning approaches with generative AI to enhance industrial automation capabilities. His past work includes collaborations with Amazon Robotics, NFL, and Koch Georgia Pacific.

Piotr Chotkowski is a Senior Cloud Application Architect at AWS Generative AI Innovation Center. He has experience in hands-on software engineering as well as software architecture design. In his role at AWS, he helps customers design and build production grade generative AI applications in the cloud.

Read More

Transforming network operations with AI: How Swisscom built a network assistant using Amazon Bedrock

Transforming network operations with AI: How Swisscom built a network assistant using Amazon Bedrock

In the telecommunications industry, managing complex network infrastructures requires processing vast amounts of data from multiple sources. Network engineers often spend considerable time manually gathering and analyzing this data, taking away valuable hours that could be spent on strategic initiatives. This challenge led Swisscom, Switzerland’s leading telecommunications provider, to explore how AI can transform their network operations.

Swisscom’s Network Assistant, built on Amazon Bedrock, represents a significant step forward in automating network operations. This solution combines generative AI capabilities with a sophisticated data processing pipeline to help engineers quickly access and analyze network data. Swisscom used AWS services to create a scalable solution that reduces manual effort and provides accurate and timely network insights.

In this post, we explore how Swisscom developed their Network Assistant. We discuss the initial challenges and how they implemented a solution that delivers measurable benefits. We examine the technical architecture, discuss key learnings, and look at future enhancements that can further transform network operations. We highlight best practices for handling sensitive data for Swisscom to comply with the strict regulations governing the telecommunications industry. This post provides telecommunications providers or other organizations managing complex infrastructure with valuable insights into how you can use AWS services to modernize operations through AI-powered automation.

The opportunity: Improve network operations

Network engineers at Swisscom faced the daily challenge to manage complex network operations and maintain optimal performance and compliance. These skilled professionals were tasked to monitor and analyze vast amounts of data from multiple and decoupled sources. The process was repetitive and demanded considerable time and attention to detail. In certain scenarios, fulfilling the assigned tasks consumed more than 10% of their availability. The manual nature of their work presented several critical pain points. The data consolidation process from multiple network entities into a coherent overview was particularly challenging, because engineers had to navigate through various tools and systems to retrieve telemetry information about data sources and network parameters from extensive documentation, verify KPIs through complex calculations, and identify potential issues of diverse nature. This fragmented approach consumed valuable time and introduced the risk of human error in data interpretation and analysis. The situation called for a solution to address three primary concerns:

  • Efficiency in data retrieval and analysis
  • Accuracy in calculations and reporting
  • Scalability to accommodate growing data sources and use cases

The team required a streamlined approach to access and analyze network data, maintain compliance with defined metrics and thresholds, and deliver fast and accurate responses to events while maintaining the highest standards of data security and sovereignty.

Solution overview

Swisscom’s approach to develop the Network Assistant was methodical and iterative. The team chose Amazon Bedrock as the foundation for their generative AI application and implemented a Retrieval Augmented Generation (RAG) architecture using Amazon Bedrock Knowledge Bases to enable precise and contextual responses to engineer queries. The RAG approach is implemented in three distinct phases:

  • Retrieval – User queries are matched with relevant knowledge base content through embedding models
  • Augmentation – The context is enriched with retrieved information
  • Generation – The large language model (LLM) produces informed responses

The following diagram illustrates the solution architecture.

Network Assistant Architecture

The solution architecture evolved through several iterations. The initial implementation established basic RAG functionality by feeding the Amazon Bedrock knowledge base with tabular data and documentation. However, the Network Assistant struggled to manage large input files containing thousands of rows with numerical values across multiple parameter columns. This complexity highlighted the need for a more selective approach that could identify only the rows relevant for specific KPI calculations. At that point, the retrieval process wasn’t returning the precise number of vector embeddings required to calculate the formulas, prompting the team to refine the solution for greater accuracy.

Next iterations enhanced the assistant with agent-based processing and action groups. The team implemented AWS Lambda functions using Pandas or Spark for data processing, facilitating accurate numerical calculations retrieval using natural language from the user input prompt.

A significant advancement was introduced with the implementation of a multi-agent approach, using Amazon Bedrock Agents, where specialized agents handle different aspects of the system:

  • Supervisor agent – Orchestrates interactions between documentation management and calculator agents to provide comprehensive and accurate responses.
  • Documentation management agent – Helps the network engineers access information in large volumes of data efficiently and extract insights about data sources, network parameters, configuration, or tooling.
  • Calculator agent – Supports the network engineers to understand complex network parameters and perform precise data calculations out of telemetry data. This produces numerical insights that help perform network management tasks; optimize performance; maintain network reliability, uptime, and compliance; and assist in troubleshooting.

This following diagram illustrates the enhanced data extract, transform, and load (ETL) pipeline interaction with Amazon Bedrock.

Data pipeline

To achieve the desired accuracy in KPI calculations, the data pipeline was refined to achieve consistent and precise performance, which leads to meaningful insights. The team implemented an ETL pipeline with Amazon Simple Storage Service (Amazon S3) as the data lake to store input files following a daily batch ingestion approach, AWS Glue for automated data crawling and cataloging, and Amazon Athena for SQL querying. At this point, it became possible for the calculator agent to forego the Pandas or Spark data processing implementation. Instead, by using Amazon Bedrock Agents, the agent translates natural language user prompts into SQL queries. In a subsequent step, the agent runs the relevant SQL queries selected dynamically through analysis of various input parameters, providing the calculator agent an accurate result. This serverless architecture supports scalability, cost-effectiveness, and maintains high accuracy in KPI calculations. The system integrates with Swisscom’s on-premises data lake through daily batch data ingestion, with careful consideration of data security and sovereignty requirements.

To enhance data security and appropriate ethics in the Network Assistant responses, a series of guardrails were defined in Amazon Bedrock. The application implements a comprehensive set of data security guardrails to protect against malicious inputs and safeguard sensitive information. These include content filters that block harmful categories such as hate, insults, violence, and prompt-based threats like SQL injection. Specific denied topics and sensitive identifiers (for example, IMSI, IMEI, MAC address, or GPS coordinates) are filtered through manual word filters and pattern-based detection, including regular expressions (regex). Sensitive data such as personally identifiable information (PII), AWS access keys, and serial numbers are blocked or masked. The system also uses contextual grounding and relevance checks to verify model responses are factually accurate and appropriate. In the event of restricted input or output, standardized messaging notifies the user that the request can’t be processed. These guardrails help prevent data leaks, reduce the risk of DDoS-driven cost spikes, and maintain the integrity of the application’s outputs.

Results and benefits

The implementation of the Network Assistant is set to deliver substantial and measurable benefits to Swisscom’s network operations. The most significant impact is time savings. Network engineers are estimated to experience 10% reduction in time spent on routine data retrieval and analysis tasks. This efficiency gain translates to nearly 200 hours per engineer saved annually, and represents a significant improvement in operational efficiency. The financial impact is equally impressive. The solution is projected to provide substantial cost savings per engineer annually, with minimal operational costs at less than 1% of the total value generated. The return on investment increases as additional teams and use cases are incorporated into the system, demonstrating strong scalability potential.

Beyond the quantifiable benefits, the Network Assistant is expected to transform how engineers interact with network data. The enhanced data pipeline supports accuracy in KPI calculations, critical for network health tracking, and the multi-agent approach provides orchestrated and comprehensive responses to complex queries out of user natural language.

As a result, engineers can have instant access to a wide range of network parameters, data source information, and troubleshooting guidance from an individual personalized endpoint with which they can quickly interact and obtain insights through natural language. This enables them to focus on strategic tasks rather than routine data gathering and analysis, leading to a significant work reduction that aligns with Swisscom SRE principles.

Lessons learned

Throughout the development and implementation of the Swisscom Network Assistant, several learnings emerged that shaped the solution. The team needed to address data sovereignty and security requirements for the solution, particularly when processing data on AWS. This led to careful consideration of data classification and compliance with applicable regulatory requirements in the telecommunications sector, to make sure that sensitive data is handled appropriately. In this regard, the application underwent a strict threat model evaluation, verifying the robustness of its interfaces against vulnerabilities and acting proactively towards securitization. The threat model was applied to assess doomsday scenarios, and data flow diagrams were created to depict major data flows inside and beyond the application boundaries. The AWS architecture was specified in detail, and trust boundaries were set to indicate which portions of the application trusted each other. Threats were identified following the STRIDE methodology (Spoofing, Tampering, Repudiation, Information disclosure, Denial of service, Elevation of privilege), and countermeasures, including Amazon Bedrock Guardrails, were defined to avoid or mitigate threats in advance.

A critical technical insight was that complex calculations involving significant data volume management required a different approach than mere AI model interpretation. The team implemented an enhanced data processing pipeline that combines the contextual understanding of AI models with direct database queries for numerical calculations. This hybrid approach facilitates both accuracy in calculations and richness in contextual responses.

The choice of a serverless architecture proved to be particularly beneficial: it minimized the need to manage compute resources and provides automatic scaling capabilities. The pay-per-use model of AWS services helped keep operational costs low and maintain high performance. Additionally, the team’s decision to implement a multi-agent approach provided the flexibility needed to handle diverse types of queries and use cases effectively.

Next steps

Swisscom has ambitious plans to enhance the Network Assistant’s capabilities further. A key upcoming feature is the implementation of a network health tracker agent to provide proactive monitoring of network KPIs. This agent will automatically generate reports to categorize issues based on criticality, enable faster response time, and improve the quality of issue resolution to potential network issues. The team is also exploring the integration of Amazon Simple Notification Service (Amazon SNS) to enable proactive alerting for critical network status changes. This can include direct integration with operational tools that alert on-call engineers, to further streamline the incident response process. The enhanced notification system will help engineers address potential issues before they critically impact network performance and obtain a detailed action plan including the affected network entities, the severity of the event, and what went wrong precisely.

The roadmap also includes expanding the system’s data sources and use cases. Integration with additional internal network systems will provide more comprehensive network insights. The team is also working on developing more sophisticated troubleshooting features, using the growing knowledge base and agentic capabilities to provide increasingly detailed guidance to engineers.

Additionally, Swisscom is adopting infrastructure as code (IaC) principles by implementing the solution using AWS CloudFormation. This approach introduces automated and consistent deployments while providing version control of infrastructure components, facilitating simpler scaling and management of the Network Assistant solution as it grows.

Conclusion

The Network Assistant represents a significant advancement in how Swisscom can manage its network operations. By using AWS services and implementing a sophisticated AI-powered solution, they have successfully addressed the challenges of manual data retrieval and analysis. As a result, they have boosted both accuracy and efficiency so network engineers can respond quickly and decisively to network events. The solution’s success is aided not only by the quantifiable benefits in time and cost savings but also by its potential for future expansion. The serverless architecture and multi-agent approach provide a solid foundation for adding new capabilities and scaling across different teams and use cases.As organizations worldwide grapple with similar challenges in network operations, Swisscom’s implementation serves as a valuable blueprint for using cloud services and AI to transform traditional operations. The combination of Amazon Bedrock with careful attention to data security and accuracy demonstrates how modern AI solutions can help solve real-world engineering challenges.

As managing network operations complexity continues to grow, the lessons from Swisscom’s journey can be applied to many engineering disciplines. We encourage you to consider how Amazon Bedrock and similar AI solutions might help your organization overcome its own comprehension and process improvement barriers. To learn more about implementing generative AI in your workflows, explore Amazon Bedrock Resources or contact AWS.

Additional resources

For more information about Amazon Bedrock Agents and its use cases, refer to the following resources:


About the authors

Pablo García BenedictoPablo García Benedicto is an experienced Data & AI Cloud Engineer with strong expertise in cloud hyperscalers and data engineering. With a background in telecommunications, he currently works at Swisscom, where he leads and contributes to projects involving Generative AI applications and agents using Amazon Bedrock. Aiming for AI and data specialization, his latest projects focus on building intelligent assistants and autonomous agents that streamline business information retrieval, leveraging cloud-native architectures and scalable data pipelines to reduce toil and drive operational efficiency.

Rajesh SripathiRajesh Sripathi is a Generative AI Specialist Solutions Architect at AWS, where he partners with global Telecommunication and Retail & CPG customers to develop and scale generative AI applications. With over 18 years of experience in the IT industry, Rajesh helps organizations use cutting-edge cloud and AI technologies for business transformation. Outside of work, he enjoys exploring new destinations through his passion for travel and driving.

Ruben MerzRuben Merz Ruben Merz is a Principal Solutions Architect at AWS. With a background in distributed systems and networking, his work with customers at AWS focuses on digital sovereignty, AI, and networking.

Jordi Montoliu NerinJordi Montoliu Nerin is a Data & AI Leader currently serving as Senior AI/ML Specialist at AWS, where he helps worldwide telecommunications customers implement AI strategies after previously driving Data & Analytics business across EMEA regions. He has over 10 years of experience, where he has led multiple Data & AI implementations at scale, led executions of data strategy and data governance frameworks, and has driven strategic technical and business development programs across multiple industries and continents. Outside of work, he enjoys sports, cooking and traveling.

Read More

End-to-End model training and deployment with Amazon SageMaker Unified Studio

End-to-End model training and deployment with Amazon SageMaker Unified Studio

Although rapid generative AI advancements are revolutionizing organizational natural language processing tasks, developers and data scientists face significant challenges customizing these large models. These hurdles include managing complex workflows, efficiently preparing large datasets for fine-tuning, implementing sophisticated fine-tuning techniques while optimizing computational resources, consistently tracking model performance, and achieving reliable, scalable deployment.The fragmented nature of these tasks often leads to reduced productivity, increased development time, and potential inconsistencies in the model development pipeline. Organizations need a unified, streamlined approach that simplifies the entire process from data preparation to model deployment.

To address these challenges, AWS has expanded Amazon SageMaker with a comprehensive set of data, analytics, and generative AI capabilities. At the heart of this expansion is Amazon SageMaker Unified Studio, a centralized service that serves as a single integrated development environment (IDE). SageMaker Unified Studio streamlines access to familiar tools and functionality from purpose-built AWS analytics and artificial intelligence and machine learning (AI/ML) services, including Amazon EMR, AWS Glue, Amazon Athena, Amazon Redshift, Amazon Bedrock, and Amazon SageMaker AI. With SageMaker Unified Studio, you can discover data through Amazon SageMaker Catalog, access it from Amazon SageMaker Lakehouse, select foundation models (FMs) from Amazon SageMaker JumpStart or build them through JupyterLab, train and fine-tune them with SageMaker AI training infrastructure, and deploy and test models directly within the same environment. SageMaker AI is a fully managed service to build, train, and deploy ML models—including FMs—for different use cases by bringing together a broad set of tools to enable high-performance, low-cost ML. It’s available as a standalone service on the AWS Management Console, or through APIs. Model development capabilities from SageMaker AI are available within SageMaker Unified Studio.

In this post, we guide you through the stages of customizing large language models (LLMs) with SageMaker Unified Studio and SageMaker AI, covering the end-to-end process starting from data discovery to fine-tuning FMs with SageMaker AI distributed training, tracking metrics using MLflow, and then deploying models using SageMaker AI inference for real-time inference. We also discuss best practices to choose the right instance size and share some debugging best practices while working with JupyterLab notebooks in SageMaker Unified Studio.

Solution overview

The following diagram illustrates the solution architecture. There are three personas: admin, data engineer, and user, which can be a data scientist or an ML engineer.

 AWS SageMaker ML workflow showing data processing, model training, and deployment stages

AWS SageMaker Unified Studio ML workflow showing data processing, model training, and deployment stages

Setting up the solution consists of the following steps:

  1. The admin sets up the SageMaker Unified Studio domain for the user and sets the access controls. The admin also publishes the data to SageMaker Catalog in SageMaker Lakehouse.
  2. Data engineers can create and manage extract, transform, and load (ETL) pipelines directly within Unified Studio using Visual ETL. They can transform raw data sources into datasets ready for exploratory data analysis. The admin can then manage the publication of these assets to the SageMaker Catalog, making them discoverable and accessible to other team members or users such as data engineers in the organization.
  3. Users or data engineers can log in to the Unified Studio web-based IDE using the login provided by the admin to create a project and create a managed MLflow server for tracking experiments. Users can discover available data assets in the SageMaker Catalog and request a subscription to an asset published by the data engineer. After the data engineer approves the subscription request, the user performs an exploratory data analysis of the content of the table with the query editor or with a JupyterLab notebook, then prepares the dataset by connecting with SageMaker Catalog through an AWS Glue or Athena connection.
  4. You can explore models from SageMaker JumpStart, which hosts over 200 models for various tasks, and fine-tune directly with the UI, or develop a training script for fine-tuning the LLM in the JupyterLab IDE. SageMaker AI provides distributed training libraries and supports various distributed training options for deep learning tasks. For this post, we use the PyTorch framework and use Hugging Face open source FMs for fine-tuning. We will show you how you can use parameter efficient fine-tuning (PEFT) with Low-Rank Adaptation (LoRa), where you freeze the model weights, train the model with modifying weight metrics, and then merge these LoRa adapters back to the base model after distributed training.
  5. You can track and monitor fine-tuning metrics directly in SageMaker Unified Studio using MLflow, by analyzing metrics such as loss to make sure the model is correctly fine-tuned.
  6. You can deploy the model to a SageMaker AI endpoint after the fine-tuning job is complete and test it directly from SageMaker Unified Studio.

Prerequisites

Before starting this tutorial, make sure you have the following:

Set up SageMaker Unified Studio and configure user access

SageMaker Unified Studio is built on top of Amazon DataZone capabilities such as domains to organize your assets and users, and projects to collaborate with others users, securely share artifacts, and seamlessly work across compute services.

To set up Unified Studio, complete the following steps:

  1. As an admin, create a SageMaker Unified Studio domain, and note the URL.
  2. On the domain’s details page, on the User management tab, choose Configure SSO user access. For this post, we recommend setting up using single sign-on (SSO) access using the URL.

For more information about setting up user access, see Managing users in Amazon SageMaker Unified Studio.

Log in to SageMaker Unified Studio

Now that you have created your new SageMaker Unified Studio domain, complete the following steps to access SageMaker Unified Studio:

  1. On the SageMaker console, open the details page of your domain.
  2. Choose the link for the SageMaker Unified Studio URL.
  3. Log in with your SSO credentials.

Now you’re signed in to SageMaker Unified Studio.

Create a project

The next step is to create a project. Complete the following steps:

  1. In SageMaker Unified Studio, choose Select a project on the top menu, and choose Create project.
  2. For Project name, enter a name (for example, demo).
  3. For Project profile, choose your profile capabilities. A project profile is a collection of blueprints, which are configurations used to create projects. For this post, we choose All capabilities, then choose Continue.
Create project

Creating a project in Amazon SageMaker Unified Studio

Create a compute space

SageMaker Unified Studio provides compute spaces for IDEs that you can use to code and develop your resources. By default, it creates a space for you to get started with you project. You can find the default space by choosing Compute in the navigation pane and choosing the Spaces tab. You can then choose Open to go to the JuypterLab environment and add members to this space. You can also create a new space by choosing Create space on the Spaces tab.

To use SageMaker Studio notebooks cost-effectively, use smaller, general-purpose instances (like the T or M families) for interactive data exploration and prototyping. For heavy lifting like training or large-scale processing or deployment, use SageMaker AI training jobs and SageMaker AI prediction to offload the work to separate and more powerful instances such as the P5 family. We will show you in the notebook how you can run training jobs and deploy LLMs in the notebook with APIs. It is not recommended to run distributed workloads in notebook instances. The chances of kernel failures is high because JupyterLab notebooks should not be used for large distributed workloads (both for data and ML training).

The following screenshot shows the configuration options for your space. You can change your instance size from default (ml.t3.medium) to (ml.m5.xlarge) for the JupyterLab IDE. You can also increase the Amazon Elastic Block Store (Amazon EBS) volume capacity from 16 GB to 50 GB for training LLMs.

Configure space

Canfigure space in Amazon SageMaker Unified Studio

Set up MLflow to track ML experiments

You can use MLflow in SageMaker Unified Studio to create, manage, analyze, and compare ML experiments. Complete the following steps to set up MLflow:

  1. In SageMaker Unified Studio, choose Compute in the navigation pane.
  2. On the MLflow Tracking Servers tab, choose Create MLflow Tracking Server.
  3. Provide a name and create your tracking server.
  4. Choose Copy ARN to copy the Amazon Resource Name (ARN) of the tracking server.

You will need this MLflow ARN in your notebook to set up distributed training experiment tracking.

Set up the data catalog

For model fine-tuning, you need access to a dataset. After you set up the environment, the next step is to find the relevant data from the SageMaker Unified Studio data catalog and prepare the data for model tuning. For this post, we use the Stanford Question Answering Dataset (SQuAD) dataset. This dataset is a reading comprehension dataset, consisting of questions posed by crowd workers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.

Download the SQuaD dataset and upload it to SageMaker Lakehouse by following the steps in Uploading data.

Adding data to Catalog in Amazon SageMaker Unified Studio

To make this data discoverable by the users or ML engineers, the admin needs to publish this data to the Data Catalog. For this post, you can directly download the SQuaD dataset and upload it to the catalog. To learn how to publish the dataset to SageMaker Catalog, see Publish assets to the Amazon SageMaker Unified Studio catalog from the project inventory.

Query data with the query editor and JupyterLab

In many organizations, data preparation is a collaborative effort. A data engineer might prepare an initial raw dataset, which a data scientist then refines and augments with feature engineering before using it for model training. In the SageMaker Lakehouse data and model catalog, publishers set subscriptions for automatic or manual approval (wait for admin approval). Because you already set up the data in the previous section, you can skip this section showing how to subscribe to the dataset.

To subscribe to another dataset like SQuAD, open the data and model catalog in Amazon SageMaker Lakehouse, choose SQuAD, and subscribe.

Subscribing to any asset or dataset published by Admin

Subscribing to any asset or dataset published by Admin

Next, let’s use the data explorer to explore the dataset you subscribed to. Complete the following steps:

  1. On the project page, choose Data.
  2. Under Lakehouse, expand AwsDataCatalog.
  3. Expand your database starting from glue_db_.
  4. Choose the dataset you created (starting with squad) and choose Query with Athena.
Querying the data using Query Editor in Amazon SageMaker Unfied Studio

Querying the data using Query Editor in Amazon SageMaker Unfied Studio

Process your data through a multi-compute JupyterLab IDE notebook

SageMaker Unified Studio provides a unified JupyterLab experience across different languages, including SQL, PySpark, Python, and Scala Spark. It also supports unified access across different compute runtimes such as Amazon Redshift and Athena for SQL, Amazon EMR Serverless, Amazon EMR on EC2, and AWS Glue for Spark.

Complete the following steps to get started with the unified JupyterLab experience:

  1. Open your SageMaker Unified Studio project page.
  2. On the top menu, choose Build, and under IDE & APPLICATIONS, choose JupyterLab.
  3. Wait for the space to be ready.
  4. Choose the plus sign and for Notebook, choose Python 3.
  5. Open a new terminal and enter git clonehttps://github.com/aws-samples/amazon-sagemaker-generativeai.
  6. Go to the folder amazon-sagemaker-generativeai/3_distributed_training/distributed_training_sm_unified_studio/ and open the distributed training in unified studio.ipynb notebook to get started.
  7. Enter the MLflow server ARN you created in the following code:
import os
os.environ["mlflow_uri"] = ""
os.environ["mlflow_experiment_name"] = "deepseek-r1-distill-llama-8b-sft"

Now you an visualize the data through the notebook.

  1. On the project page, choose Data.
  2. Under Lakehouse, expand AwsDataCatalog.
  3. Expand your database starting from glue_db, copy the name of the database, and enter it in the following code:
db_name = "<enter your db name>"
table = "sqad"
  1. You can now access the entire dataset directly by using the in-line SQL query capabilities of JupyterLab notebooks in SageMaker Unified Studio. You can follow the data preprocessing steps in the notebook.
%%sql project.athena
SELECT * FROM "<DATABASE_NAME>"."sqad";

The following screenshot shows the output.

We are going to split the dataset into a test set and training set for model training. When the data processing in done and we have split the data into test and training sets, the next step is to perform fine-tuning of the model using SageMaker Distributed Training.

Fine-tune the model with SageMaker Distributed training

You’re now ready to fine-tune your model by using SageMaker AI capabilities for training. Amazon SageMaker Training is a fully managed ML service offered by SageMaker that helps you efficiently train a wide range of ML models at scale. The core of SageMaker AI jobs is the containerization of ML workloads and the capability of managing AWS compute resources. SageMaker Training takes care of the heavy lifting associated with setting up and managing infrastructure for ML training workloads

We select one model directly from the Hugging Face Hub, DeepSeek-R1-Distill-Llama-8B, and develop our training script in the JupyterLab space. Because we want to distribute the training across all the available GPUs in our instance, by using PyTorch Fully Sharded Data Parallel (FSDP), we use the Hugging Face Accelerate library to run the same PyTorch code across distributed configurations. You can start the fine-tuning job directly in your JupyterLab notebook or use the SageMaker Python SDK to start the training job. We use the Trainer from transfomers to fine-tune our model. We prepared the script train.py, which loads the dataset from disk, prepares the model and tokenizer, and starts the training.

For configuration, we use TrlParser, and provide hyperparameters in a YAML file. You can upload this file and provide it to SageMaker similar to your datasets. The following is the config file for fine-tuning the model on ml.g5.12xlarge. Save the config file as args.yaml and upload it to Amazon Simple Storage Service (Amazon S3).

cat > ./args.yaml <<EOF
model_id: "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"       # Hugging Face model id
mlflow_uri: "${mlflow_uri}"
mlflow_experiment_name: "${mlflow_experiment_name}"
# sagemaker specific parameters
output_dir: "/opt/ml/model"                       # path to where SageMaker will upload the model 
train_dataset_path: "/opt/ml/input/data/train/"   # path to where FSx saves train dataset
test_dataset_path: "/opt/ml/input/data/test/"     # path to where FSx saves test dataset
# training parameters
lora_r: 8
lora_alpha: 16
lora_dropout: 0.1                 
learning_rate: 2e-4                    # learning rate scheduler
num_train_epochs: 1                    # number of training epochs
per_device_train_batch_size: 2         # batch size per device during training
per_device_eval_batch_size: 1          # batch size for evaluation
gradient_accumulation_steps: 2         # number of steps before performing a backward/update pass
gradient_checkpointing: true           # use gradient checkpointing
bf16: true                             # use bfloat16 precision
tf32: false                            # use tf32 precision
fsdp: "full_shard auto_wrap offload"
fsdp_config: 
    backward_prefetch: "backward_pre"
    cpu_ram_efficient_loading: true
    offload_params: true
    forward_prefetch: false
    use_orig_params: true
merge_weights: true                    # merge weights in the base model
EOF

Use the following code to use the native PyTorch container image, pre-built for SageMaker:

image_uri = sagemaker.image_uris.retrieve(
    framework="pytorch",
    region=sagemaker_session.boto_session.region_name,
    version="2.6.0",
    instance_type=instance_type,
    image_scope="training"
)

image_uri

Define the trainer as follows:

Define the ModelTrainer
model_trainer = ModelTrainer(
    training_image=image_uri,
    source_code=source_code,
    base_job_name=job_name,
    compute=compute_configs,
    distributed=Torchrun(),
    stopping_condition=StoppingCondition(
        max_runtime_in_seconds=7200
    ),
    hyperparameters={
        "config": "/opt/ml/input/data/config/args.yaml" # path to TRL config which was uploaded to s3
    },
    output_data_config=OutputDataConfig(
        s3_output_path=output_path
    ),
)

Run the trainer with the following:

# starting the train job with our uploaded datasets as input
model_trainer.train(input_data_config=data, wait=True)

You can follow the steps in the notebook.

You can explore the job execution in SageMaker Unified Studio. The training job runs on the SageMaker training cluster by distributing the computation across the four available GPUs on the selected instance type ml.g5.12xlarge. We choose to merge the LoRA adapter with the base model. This decision was made during the training process by setting the merge_weights parameter to True in our train_fn() function. Merging the weights provides a single, cohesive model that incorporates both the base knowledge and the domain-specific adaptations we’ve made through fine-tuning.

Track training metrics and model registration using MLflow

You created an MLflow server in an earlier step to track experiments and registered models, and provided the server ARN in the notebook.

You can log MLflow models and automatically register them with Amazon SageMaker Model Registry using either the Python SDK or directly through the MLflow UI. Use mlflow.register_model() to automatically register a model with SageMaker Model Registry during model training. You can explore the MLflow tracking code in train.py and the notebook. The training code tracks MLflow experiments and registers the model to the MLflow model registry. To learn more, see Automatically register SageMaker AI models with SageMaker Model Registry.

To see the logs, complete the following steps:

  1. Choose Build, then choose Spaces.
  2. Choose Compute in the navigation pane.
  3. On the MLflow Tracking Servers tab, choose Open to open the tracking server.

You can see both the experiments and registered models.

Deploy and test the model using SageMaker AI Inference

When deploying a fine-tuned model on AWS, SageMaker AI Inference offers multiple deployment strategies. In this post, we use SageMaker real-time inference. The real-time inference endpoint is designed for having full control over the inference resources. You can use a set of available instances and deployment options for hosting your model. By using the SageMaker built-in container DJL Serving, you can take advantage of the inference script and optimization options available directly in the container. In this post, we deploy the fine-tuned model to a SageMaker endpoint for running inference, which will be used for testing the model.

In SageMaker Unified Studio, in JupyterLab, we create the Model object, which is a high-level SageMaker model class for working with multiple container options. The image_uri parameter specifies the container image URI for the model, and model_data points to the Amazon S3 location containing the model artifact (automatically uploaded by the SageMaker training job). We also specify a set of environment variables to configure the specific inference backend option (OPTION_ROLLING_BATCH), the degree of tensor parallelism based on the number of available GPUs (OPTION_TENSOR_PARALLEL_DEGREE), and the maximum allowable length of input sequences (in tokens) for models during inference (OPTION_MAX_MODEL_LEN).

model = Model(
    image_uri=image_uri,
    model_data=f"s3://{bucket_name}/{job_prefix}/{job_name}/output/model.tar.gz",
    role=get_execution_role(),
    env={
        'HF_MODEL_ID': "/opt/ml/model",
        'OPTION_TRUST_REMOTE_CODE': 'true',
        'OPTION_ROLLING_BATCH': "vllm",
        'OPTION_DTYPE': 'bf16',
        'OPTION_TENSOR_PARALLEL_DEGREE': 'max',
        'OPTION_MAX_ROLLING_BATCH_SIZE': '1',
        'OPTION_MODEL_LOADING_TIMEOUT': '3600',
        'OPTION_MAX_MODEL_LEN': '4096'
    }
)

After you create the model object, you can deploy it to an endpoint using the deploy method. The initial_instance_count and instance_type parameters specify the number and type of instances to use for the endpoint. We selected the ml.g5.4xlarge instance for the endpoint. The container_startup_health_check_timeout and model_data_download_timeout parameters set the timeout values for the container startup health check and model data download, respectively.

model_id = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"
endpoint_name = f"{model_id.split('/')[-1].replace('.', '-')}-sft-djl"
predictor = model.deploy(
    initial_instance_count=instance_count,
    instance_type=instance_type,
    container_startup_health_check_timeout=1800,
    model_data_download_timeout=3600
)

It takes a few minutes to deploy the model before it becomes available for inference and evaluation. You can test the endpoint invocation in JupyterLab, by using the AWS SDK with the boto3 client for sagemaker-runtime, or by using the SageMaker Python SDK and the predictor previously created, by using the predict API.

base_prompt = f"""<s> [INST] {{question}} [/INST] """

prompt = base_prompt.format(
    question="What statue is in front of the Notre Dame building?"
)

predictor.predict({
    "inputs": prompt,
    "parameters": {
        "max_new_tokens": 300,
        "temperature": 0.2,
        "top_p": 0.9,
        "return_full_text": False,
        "stop": ['</s>']
    }
})

You can also test the model invocation in SageMaker Unified Studio, on the Inference endpoint page and Text inference tab.

Troubleshooting

You might encounter some of the following errors while running your model training and deployment:

  • Training job fails to start – If a training job fails to start, make sure your IAM role AmazonSageMakerDomainExecution has the necessary permissions, verify the instance type is available in your AWS Region, and check your S3 bucket permissions. This role is created when an admin creates the domain, and you can ask the admin to check your IAM access permissions associated with this role.
  • Out-of-memory errors during training – If you encounter out-of-memory errors during training, try reducing the batch size, use gradient accumulation to simulate larger batches, or consider using a larger instance.
  • Slow model deployment – For slow model deployment, make sure model artifacts aren’t excessively large, and use appropriate instance types for inference and capacity available for that instance in your Region.

For more troubleshooting tips, refer to Troubleshooting guide.

Clean up

SageMaker Unified Studio by default shuts down idle resources such as JupyterLab spaces after 1 hour. However, you must delete the S3 bucket and the hosted model endpoint to stop incurring costs. You can delete the real-time endpoints you created using the SageMaker console. For instructions, see Delete Endpoints and Resources.

Conclusion

This post demonstrated how SageMaker Unified Studio serves as a powerful centralized service for data and AI workflows, showcasing its seamless integration capabilities throughout the fine-tuning process. With SageMaker Unified Studio, data engineers and ML practitioners can efficiently discover and access data through SageMaker Catalog, prepare datasets, fine-tune models, and deploy them—all within a single, unified environment. The service’s direct integration with SageMaker AI and various AWS analytics services streamlines the development process, alleviating the need to switch between multiple tools and environments. The solution highlights the service’s versatility in handling complex ML workflows, from data discovery and preparation to model deployment, while maintaining a cohesive and intuitive user experience. Through features like integrated MLflow tracking, built-in model monitoring, and flexible deployment options, SageMaker Unified Studio demonstrates its capability to support sophisticated AI/ML projects at scale.

To learn more about SageMaker Unified Studio, see An integrated experience for all your data and AI with Amazon SageMaker Unified Studio.

If this post helps you or inspires you to solve a problem, we would love to hear about it! The code for this solution is available on the GitHub repo for you to use and extend. Contributions are always welcome!


About the authors

Mona Mona currently works as a Sr World Wide Gen AI Specialist Solutions Architect at Amazon focusing on Gen AI Solutions. She was a Lead Generative AI specialist in Google Public Sector at Google before joining Amazon. She is a published author of two books – Natural Language Processing with AWS AI Services and Google Cloud Certified Professional Machine Learning Study Guide. She has authored 19 blogs on AI/ML and cloud technology and a co-author on a research paper on CORD19 Neural Search which won an award for Best Research Paper at the prestigious AAAI (Association for the Advancement of Artificial Intelligence) conference.

Bruno Pistone is a Senior Generative AI and ML Specialist Solutions Architect for AWS based in Milan. He works with large customers helping them to deeply understand their technical needs and design AI and Machine Learning solutions that make the best use of the AWS Cloud and the Amazon Machine Learning stack. His expertise include: Machine Learning end to end, Machine Learning Industrialization, and Generative AI. He enjoys spending time with his friends and exploring new places, as well as travelling to new destinations.

Lauren MullennexLauren Mullennex is a Senior GenAI/ML Specialist Solutions Architect at AWS. She has a decade of experience in DevOps, infrastructure, and ML. Her areas of focus include MLOps/LLMOps, generative AI, and computer vision.

Read More

Optimize RAG in production environments using Amazon SageMaker JumpStart and Amazon OpenSearch Service

Optimize RAG in production environments using Amazon SageMaker JumpStart and Amazon OpenSearch Service

Generative AI has revolutionized customer interactions across industries by offering personalized, intuitive experiences powered by unprecedented access to information. This transformation is further enhanced by Retrieval Augmented Generation (RAG), a technique that allows large language models (LLMs) to reference external knowledge sources beyond their training data. RAG has gained popularity for its ability to improve generative AI applications by incorporating additional information, often preferred by customers over techniques like fine-tuning due to its cost-effectiveness and faster iteration cycles.

The RAG approach excels in grounding language generation with external knowledge, producing more factual, coherent, and relevant responses. This capability proves invaluable in applications such as question answering, dialogue systems, and content generation, where accuracy and informative outputs are crucial. For businesses, RAG offers a powerful way to use internal knowledge by connecting company documentation to a generative AI model. When an employee asks a question, the RAG system retrieves relevant information from the company’s internal documents and uses this context to generate an accurate, company-specific response. This approach enhances the understanding and usage of internal company documents and reports. By extracting relevant context from corporate knowledge bases, RAG models facilitate tasks like summarization, information extraction, and complex question answering on domain-specific materials, enabling employees to quickly access vital insights from vast internal resources. This integration of AI with proprietary information can significantly improve efficiency, decision-making, and knowledge sharing across the organization.

A typical RAG workflow consists of four key components: input prompt, document retrieval, contextual generation, and output. The process begins with a user query, which is used to search a comprehensive knowledge corpus. Relevant documents are then retrieved and combined with the original query to provide additional context for the LLM. This enriched input allows the model to generate more accurate and contextually appropriate responses. RAG’s popularity stems from its ability to use frequently updated external data, providing dynamic outputs without the need for costly and compute-intensive model retraining.

To implement RAG effectively, many organizations turn to platforms like Amazon SageMaker JumpStart. This service offers numerous advantages for building and deploying generative AI applications, including access to a wide range of pre-trained models with ready-to-use artifacts, a user-friendly interface, and seamless scalability within the AWS ecosystem. By using pre-trained models and optimized hardware, SageMaker JumpStart enables rapid deployment of both LLMs and embedding models, minimizing the time spent on complex scalability configurations.

In the previous post, we showed how to build a RAG application on SageMaker JumpStart using Facebook AI Similarity Search (Faiss). In this post, we show how to use Amazon OpenSearch Service as a vector store to build an efficient RAG application.

Solution overview

To implement our RAG workflow on SageMaker, we use a popular open source Python library known as LangChain. With LangChain, the RAG components are simplified into independent blocks that you can bring together using a chain object that will encapsulate the entire workflow. The solution consists of the following key components:

  • LLM (inference) – We need an LLM that will do the actual inference and answer the end-user’s initial prompt. For our use case, we use Meta Llama3 for this component. LangChain comes with a default wrapper class for SageMaker endpoints with which we can simply pass in the endpoint name to define an LLM object in the library.
  • Embeddings model – We need an embeddings model to convert our document corpus into textual embeddings. This is necessary for when we’re doing a similarity search on the input text to see what documents share similarities or contain the information to help augment our response. For this post, we use the BGE Hugging Face Embeddings model available in SageMaker JumpStart.
  • Vector store and retriever – To house the different embeddings we have generated, we use a vector store. In this case, we use OpenSearch Service, which allows for similarity search using k-nearest neighbors (k-NN) as well as traditional lexical search. Within our chain object, we define the vector store as the retriever. You can tune this depending on how many documents you want to retrieve.

The following diagram illustrates the solution architecture.

In the following sections, we walk through setting up OpenSearch, followed by exploring the notebook that implements a RAG solution with LangChain, Amazon SageMaker AI, and OpenSearch Service.

Benefits of using OpenSearch Service as a vector store for RAG

In this post, we showcase how you can use a vector store such as OpenSearch Service as a knowledge base and embedding store. OpenSearch Service offers several advantages when used for RAG in conjunction with SageMaker AI:

  • Performance – Efficiently handles large-scale data and search operations
  • Advanced search – Offers full-text search, relevance scoring, and semantic capabilities
  • AWS integration – Seamlessly integrates with SageMaker AI and other AWS services
  • Real-time updates – Supports continuous knowledge base updates with minimal delay
  • Customization – Allows fine-tuning of search relevance for optimal context retrieval
  • Reliability – Provides high availability and fault tolerance through a distributed architecture
  • Analytics – Provides analytical features for data understanding and performance improvement
  • Security – Offers robust features such as encryption, access control, and audit logging
  • Cost-effectiveness – Serves as an economical solution compared to proprietary vector databases
  • Flexibility – Supports various data types and search algorithms, offering versatile storage and retrieval options for RAG applications

You can use SageMaker AI with OpenSearch Service to create powerful and efficient RAG systems. SageMaker AI provides the machine learning (ML) infrastructure for training and deploying your language models, and OpenSearch Service serves as an efficient and scalable knowledge base for retrieval.

OpenSearch Service optimization strategies for RAG

Based on our learnings from the hundreds of RAG applications deployed using OpenSearch Service as a vector store, we’ve developed several best practices:

  • If you are starting from a clean slate and want to move quickly with something simple, scalable, and high-performing, we recommend using an Amazon OpenSearch Serverless vector store collection. With OpenSearch Serverless, you benefit from automatic scaling of resources, decoupling of storage, indexing compute, and search compute, with no node or shard management, and you only pay for what you use.
  • If you have a large-scale production workload and want to take the time to tune for the best price-performance and the most flexibility, you can use an OpenSearch Service managed cluster. In a managed cluster, you pick the node type, node size, number of nodes, and number of shards and replicas, and you have more control over when to scale your resources. For more details on best practices for operating an OpenSearch Service managed cluster, see Operational best practices for Amazon OpenSearch Service.
  • OpenSearch supports both exact k-NN and approximate k-NN. Use exact k-NN if the number of documents or vectors in your corpus is less than 50,000 for the best recall. For use cases where the number of vectors is greater than 50,000, exact k-NN will still provide the best recall but might not provide sub-100 millisecond query performance. Use approximate k-NN in use cases above 50,000 vectors for the best performance.
  • OpenSearch uses algorithms from the NMSLIB, Faiss, and Lucene libraries to power approximate k-NN search. There are pros and cons to each k-NN engine, but we find that most customers choose Faiss due to its overall performance in both indexing and search as well as the variety of different quantization and algorithm options that are supported and the broad community support.
  • Within the Faiss engine, OpenSearch supports both Hierarchical Navigable Small World (HNSW) and Inverted File System (IVF) algorithms. Most customers find HNSW to have better recall than IVF and choose it for their RAG use cases. To learn more about the differences between these engine algorithms, see Vector search.
  • To reduce the memory footprint to lower the cost of the vector store while keeping the recall high, you can start with Faiss HNSW 16-bit scalar quantization. This can also reduce search latencies and improve indexing throughput when used with SIMD optimization.
  • If using an OpenSearch Service managed cluster, refer to Performance tuning for additional recommendations.

Prerequisites

Make sure you have access to one ml.g5.4xlarge and ml.g5.2xlarge instance each in your account. A secret should be created in the same region as the stack is deployed.Then complete the following prerequisite steps to create a secret using AWS Secrets Manager:

  1. On the Secrets Manager console, choose Secrets in the navigation pane.
  2. Choose Store a new secret.

  1. For Secret type, select Other type of secret.
  2. For Key/value pairs, on the Plaintext tab, enter a complete password.
  3. Choose Next.

  1. For Secret name, enter a name for your secret.
  2. Choose Next.

  1. Under Configure rotation, keep the settings as default and choose Next.

  1. Choose Store to save your secret.

  1. On the secret details page, note the secret Amazon Resource Name (ARN) to use in the next step.

Create an OpenSearch Service cluster and SageMaker notebook

We use AWS CloudFormation to deploy our OpenSearch Service cluster, SageMaker notebook, and other resources. Complete the following steps:

  1. Launch the following CloudFormation template.
  2. Provide the ARN of the secret you created as a prerequisite and keep the other parameters as default.

  1. Choose Create to create your stack, and wait for the stack to complete (about 20 minutes).
  2. When the status of the stack is CREATE_COMPLETE, note the value of OpenSearchDomainEndpoint on the stack Outputs tab.
  3. Locate SageMakerNotebookURL in the outputs and choose the link to open the SageMaker notebook.

Run the SageMaker notebook

After you have launched the notebook in JupyterLab, complete the following steps:

  1. Go to genai-recipes/RAG-recipes/llama3-RAG-Opensearch-langchain-SMJS.ipynb.

You can also clone the notebook from the GitHub repo.

  1. Update the value of OPENSEARCH_URL in the notebook with the value copied from OpenSearchDomainEndpoint in the previous step (look for os.environ['OPENSEARCH_URL'] = "").  The port needs to be 443.
  2. Run the cells in the notebook.

The notebook provides a detailed explanation of all the steps. We explain some of the key cells in the notebook in this section.

For the RAG workflow, we deploy the huggingface-sentencesimilarity-bge-large-en-v1-5 embedding model and meta-textgeneration-llama-3-8b-instruct LLM from Hugging Face. SageMaker JumpStart simplifies this process because the model artifacts, data, and container specifications are all prepackaged for optimal inference. These are then exposed using the SageMaker Python SDK high-level API calls, which let you specify the model ID for deployment to a SageMaker real-time endpoint:


 sagemaker.jumpstart.model  JumpStartModel

model_id  "meta-textgeneration-llama-3-8b-instruct"
accept_eula  
model  JumpStartModel(model_idmodel_id)
llm_predictor  modeldeploy(accept_eulaaccept_eula)

model_id  "huggingface-sentencesimilarity-bge-large-en-v1-5"
text_embedding_model  JumpStartModel(model_idmodel_id)
embedding_predictor  text_embedding_modeldeploy()

Content handlers are crucial for formatting data for SageMaker endpoints. They transform inputs into the format expected by the model and handle model-specific parameters like temperature and token limits. These parameters can be tuned to control the creativity and consistency of the model’s responses.

class Llama38BContentHandler(LLMContentHandler):
    content_type = "application/json"
    accepts = "application/json"

    def transform_input(self, prompt: str, model_kwargs: dict) -> bytes:
        payload = {
            "inputs": prompt,
            "parameters": {
                "max_new_tokens": 1000,
                "top_p": 0.9,
                "temperature": 0.6,
                "stop": ["<|eot_id|>"],
            },
        }
        input_str = json.dumps(
            payload,
        )
        #print(input_str)
        return input_str.encode("utf-8")

We use PyPDFLoader from LangChain to load PDF files, attach metadata to each document fragment, and then use RecursiveCharacterTextSplitter to break the documents into smaller, manageable chunks. The text splitter is configured with a chunk size of 1,000 characters and an overlap of 100 characters, which helps maintain context between chunks. This preprocessing step is crucial for effective document retrieval and embedding generation, because it makes sure the text segments are appropriately sized for the embedding model and the language model used in the RAG system.

import numpy as np
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
documents = []
for idx, file in enumerate(filenames):
    loader = PyPDFLoader(data_root + file)
    document = loader.load()
    for document_fragment in document:
        document_fragment.metadata = metadata[idx]
    documents += document
# - in our testing Character split works better with this PDF data set
text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size=1000,
    chunk_overlap=100,
)
docs = text_splitter.split_documents(documents)
print(docs[100])

The following block initializes a vector store using OpenSearch Service for the RAG system. It converts preprocessed document chunks into vector embeddings using a SageMaker model and stores them in OpenSearch Service. The process is configured with security measures like SSL and authentication to provide secure data handling. The bulk insertion is optimized for performance with a sizeable batch size. Finally, the vector store is wrapped with VectorStoreIndexWrapper, providing a simplified interface for operations like querying and retrieval. This setup creates a searchable database of document embeddings, enabling quick and relevant context retrieval for user queries in the RAG pipeline.

from langchain.indexes.vectorstore import VectorStoreIndexWrapper
# Initialize OpenSearchVectorSearch
vectorstore_opensearch = OpenSearchVectorSearch.from_documents(
    docs,
    sagemaker_embeddings,
    http_auth=awsauth,  # Auth will use the IAM role
    use_ssl=True,
    verify_certs=True,
    connection_class=RequestsHttpConnection,
    bulk_size=2000  # Increase this to accommodate the number of documents you have
)
# Wrap the OpenSearch vector store with the VectorStoreIndexWrapper
wrapper_store_opensearch = VectorStoreIndexWrapper(vectorstore=vectorstore_opensearch)

Next, we use the wrapper from the previous step along with the prompt template. We define the prompt template for interacting with the Meta Llama 3 8B Instruct model in the RAG system. The template uses specific tokens to structure the input in a way that the model expects. It sets up a conversation format with system instructions, user query, and a placeholder for the assistant’s response. The PromptTemplate class from LangChain is used to create a reusable prompt with a variable for the user’s query. This structured approach to prompt engineering helps maintain consistency in the model’s responses and guides it to act as a helpful assistant.

prompt_template = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a helpful assistant.
<|eot_id|><|start_header_id|>user<|end_header_id|>
{query}
<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""
PROMPT = PromptTemplate(
    template=prompt_template, input_variables=["query"]
)
query = "How did AWS perform in 2021?"

answer = wrapper_store_opensearch.query(question=PROMPT.format(query=query), llm=llm)
print(answer)

Similarly, the notebook also shows how to use Retrieval QA, where you can customize how the documents fetched should be added to prompt using the chain_type parameter.

Clean up

Delete your SageMaker endpoints from the notebook to avoid incurring costs:

# Delete resources
llm_predictor.delete_model()
llm_predictor.delete_endpoint()
embedding_predictor.delete_model()
embedding_predictor.delete_endpoint()

Next, delete your OpenSearch cluster to stop incurring additional charges:aws cloudformation delete-stack --stack-name rag-opensearch

Conclusion

RAG has revolutionized how businesses use AI by enabling general-purpose language models to work seamlessly with company-specific data. The key benefit is the ability to create AI systems that combine broad knowledge with up-to-date, proprietary information without expensive model retraining. This approach transforms customer engagement and internal operations by delivering personalized, accurate, and timely responses based on the latest company data. The RAG workflow—comprising input prompt, document retrieval, contextual generation, and output—allows businesses to tap into their vast repositories of internal documents, policies, and data, making this information readily accessible and actionable. For businesses, this means enhanced decision-making, improved customer service, and increased operational efficiency. Employees can quickly access relevant information, while customers receive more accurate and personalized responses. Moreover, RAG’s cost-efficiency and ability to rapidly iterate make it an attractive solution for businesses looking to stay competitive in the AI era without constant, expensive updates to their AI systems. By making general-purpose LLMs work effectively on proprietary data, RAG empowers businesses to create dynamic, knowledge-rich AI applications that evolve with their data, potentially transforming how companies operate, innovate, and engage with both employees and customers.

SageMaker JumpStart has streamlined the process of developing and deploying generative AI applications. It offers pre-trained models, user-friendly interfaces, and seamless scalability within the AWS ecosystem, making it straightforward for businesses to harness the power of RAG.

Furthermore, using OpenSearch Service as a vector store facilitates swift retrieval from vast information repositories. This approach not only enhances the speed and relevance of responses, but also helps manage costs and operational complexity effectively.

By combining these technologies, you can create robust, scalable, and efficient RAG systems that provide up-to-date, context-aware responses to customer queries, ultimately enhancing user experience and satisfaction.

To get started with implementing this Retrieval Augmented Generation (RAG) solution using Amazon SageMaker JumpStart and Amazon OpenSearch Service, check out the example notebook on GitHub. You can also learn more about Amazon OpenSearch Service in the developer guide.


About the authors

Vivek Gangasani is a Lead Specialist Solutions Architect for Inference at AWS. He helps emerging generative AI companies build innovative solutions using AWS services and accelerated compute. Currently, he is focused on developing strategies for fine-tuning and optimizing the inference performance of large language models. In his free time, Vivek enjoys hiking, watching movies, and trying different cuisines.

Harish Rao is a Senior Solutions Architect at AWS, specializing in large-scale distributed AI training and inference. He empowers customers to harness the power of AI to drive innovation and solve complex challenges. Outside of work, Harish embraces an active lifestyle, enjoying the tranquility of hiking, the intensity of racquetball, and the mental clarity of mindfulness practices.

Raghu Ramesha is an ML Solutions Architect. He specializes in machine learning, AI, and computer vision domains, and holds a master’s degree in Computer Science from UT Dallas. In his free time, he enjoys traveling and photography.

Sohaib Katariwala is a Sr. Specialist Solutions Architect at AWS focused on Amazon OpenSearch Service. His interests are in all things data and analytics. More specifically he loves to help customers use AI in their data strategy to solve modern day challenges.

Karan JainKaran Jain is a Senior Machine Learning Specialist at AWS, where he leads the worldwide Go-To-Market strategy for Amazon SageMaker Inference. He helps customers accelerate their generative AI and ML journey on AWS by providing guidance on deployment, cost-optimization, and GTM strategy. He has led product, marketing, and business development efforts across industries for over 10 years, and is passionate about mapping complex service features to customer solutions.

Read More

Advancing AI agent governance with Boomi and AWS: A unified approach to observability and compliance

Advancing AI agent governance with Boomi and AWS: A unified approach to observability and compliance

Just as APIs became the standard for integration, AI agents are transforming workflow automation through intelligent task coordination. AI agents are already enhancing decision-making and streamlining operations across enterprises. But as adoption accelerates, organizations face growing complexity in managing them at scale. Organizations struggle with observability and lifecycle management, finding it difficult to monitor performance and manage versions effectively. Governance and security concerns arise as these agents process sensitive data, which requires strict compliance and access controls. Perhaps most concerningly, without proper management, organizations face the risk of agent sprawl—the unchecked proliferation of AI agents leading to inefficiency and security vulnerabilities.

Boomi and AWS have collaborated to address the complexity surrounding AI agents with Agent Control Tower, an AI agent management solution developed by Boomi and tightly integrated with Amazon Bedrock. Agent Control Tower, part of the Boomi Agentstudio solution, provides the governance framework to manage this transformation, with capabilities that address both current and emerging compliance needs.

As a leader in enterprise iPaaS per Gartner’s Magic Quadrant, based on Completeness of Vision and Ability to Execute, Boomi serves over 20,000 enterprise customers, with three-quarters of these customers operating on AWS. This includes a significant presence among Fortune 500 and Global 2000 organizations across critical sectors such as healthcare, finance, technology, and manufacturing. Boomi is innovating with generative AI, with more than 2,000 customers using its AI agents. The convergence of capabilities that Boomi provides—spanning AI, integration, automation, API management, and data management—with AWS and its proven track record in reliability, security, and AI innovation creates a compelling foundation for standardized AI agent governance at scale. In this post, we share how Boomi partnered with AWS to help enterprises accelerate and scale AI adoption with confidence using Agent Control Tower.

A unified AI management solution

Built on AWS, Agent Control Tower uniquely delivers a single control plane for managing AI agents across multiple systems, including other cloud providers and on-premises environments. At its core, it offers comprehensive observability and monitoring, providing real-time performance tracking and deep visibility into agent decision-making and behavior.

The following screenshot showcases how users can view summary data across agent providers and add or manage providers.

AWS Agent Control Tower dashboard with color-coded provider clusters, node-size relationships, and integrated filtering for agent management

The following screenshot shows an example of the Monitoring and Compliance dashboard.

Good: Monitoring dashboard displaying key AI agent performance indicators including active agents (2134), total tokens, average response time, and error rates. Features radar charts for AAGE1 scoring and graphs tracking invocations, token usage, and errors over time.

Agent Control Tower also provides a single pane of glass for visibility into the tools used by each agent, as illustrated in the following screenshot.

Split-screen view of agent visualization map and Dynamic Pricing Agent configuration panel

Agent Control Tower provides key governance and security controls such as centralized policy enforcement and role-based access control, and enables meeting regulatory compliance with frameworks like GDPR and HIPAA. Furthermore, its lifecycle management capabilities enable automated agent discovery, version tracking, and operational control through features such as pause and resume functionality. Agent Control Tower is positioned as one of the first, if not the first, unified solutions that provides full lifecycle AI agent management with integrated governance and orchestration features. Although many vendors focus on releasing AI agents, there are few that focus on solutions for managing, deploying, and governing AI agents at scale.

The following screenshot shows an example of how users can review agent details and disable or enable an agent.

Comprehensive agent configuration interface with a button to disable the agent and displaying integrated tools, monitoring tasks, security controls, and knowledge base

As shown in the following screenshot, users can drill down into details for each part of the agent.

Interactive workflow diagram showing relationships between tasks, instructions, and monitoring criteria for fraud detection

Amazon Bedrock: Enabling and enhancing AI governance

Using Amazon Bedrock, organizations can implement security guardrails and content moderation while maintaining the flexibility to select and switch between AI models for optimized performance and accuracy. Organizations can create and enable access to curated knowledge bases and predefined action groups, enabling sophisticated multi-agent collaboration. Amazon Bedrock also provides comprehensive metrics and trace logs for agents to help facilitate complete transparency and accountability in agent operations. Through deep integration with Amazon Bedrock, Boomi’s Agent Control Tower enhances agent transparency and governance, offering a unified, actionable view of agent configurations and activities across environments.

The following diagram illustrates the Agent Control Tower architecture on AWS.

AWS architecture for observability: Bedrock, CloudWatch, Data Firehose, Timestream, and Agent Control Tower across customer and Boomi accounts

Business impact: Transforming enterprise AI operations

Consider a global manufacturer using AI agents for supply chain optimization. With Agent Control Tower, they can monitor agent performance across regions in real time, enforce consistent security policies, and enable regulatory compliance. When issues arise, they can quickly identify and resolve them while maintaining the ability to scale AI operations confidently. With this level of control and visibility, organizations can deploy AI agents more effectively while maintaining robust security and compliance standards.

Conclusion

Boomi customers have already deployed more than 33,000 agents and are seeing up to 80% less time spent on documentation and 50% faster issue resolution. With Boomi and AWS, enterprises can accelerate and scale AI adoption with confidence, backed by a product that puts visibility, governance, and security first. Discover how Agent Control Tower can help your organization manage AI agent sprawl and take advantage of scalable, compliance-aligned innovation. Take a guided tour and learn more about Boomi Agent Control Tower and Amazon Bedrock integration. Or, you can get started today with AI FastTrack.


About the authors

Friendly professional headshot of person with dark hair and beard in blue quilted jacket Deepak Chandrasekar is the VP of Software Engineering & User Experience and leads multidisciplinary teams at Boomi. He oversees flagship initiatives like Boomi’s Agent Control Tower, Task Automation, and Market Reach, while driving a cohesive and intelligent experience layer across products. Previously, Deepak held a key leadership role at Unifi Software, which was acquired by Boomi. With a passion for building scalable, and intuitive AI-powered solutions, he brings a commitment to engineering excellence and responsible innovation.

Professional ID photo of person wearing dark collared shirtSandeep Singh is Director of Engineering at Boomi, where he leads global teams building solutions that enable enterprise integration and automation at scale. He drives initiatives like Boomi Agent Control Tower, Marketplace, and Labs, empowering partners and customers with intelligent, trusted solutions. With leadership experience at GE and Fujitsu, Sandeep brings expertise in API strategy, product engineering, and AI/ML solutions. A former solution architect, he is passionate about designing mission-critical systems and driving innovation through scalable, intelligent solutions.

Formal portrait photo with studio lighting on dark background Santosh Ameti is a seasoned Engineering leader in the Amazon Bedrock team and has built Agents, Evaluation, Guardrails, and Prompt Management solutions. His team continuously innovates in the agentic space, delivering one of the most secure and managed agentic solutions for enterprises.

Informal portrait of person with gray beard, glasses, and plaid button-up shirtGreg Sligh is a Senior Solutions Architect at AWS with more than 25 years of experience in software engineering, software architecture, consulting, and IT and Engineering leadership roles across multiple industries. For the majority of his career, he has focused on creating and delivering distributed, data-driven applications with particular focus on scale, performance, and resiliency. Now he helps ISVs meet their objectives across technologies, with particular focus on AI/ML.

Confident professional headshot featuring person in black and white patterned blouse Padma Iyer is a Senior Customer Solutions Manager at Amazon Web Services, where she specializes in supporting ISVs. With a passion for cloud transformation and financial technology, Padma works closely with ISVs to guide them through successful cloud transformations, using best practices to optimize their operations and drive business growth. Padma has over 20 years of industry experience spanning banking, tech, and consulting.

Read More