Deploy Qwen models with Amazon Bedrock Custom Model Import

June 13, 2025

by Ajit Mahareddy Amazon AWS

We’re excited to announce that Amazon Bedrock Custom Model Import now supports Qwen models. You can now import custom weights for Qwen2, Qwen2_VL, and Qwen2_5_VL architectures, including models like Qwen 2, 2.5 Coder, Qwen 2.5 VL, and QwQ 32B. You can bring your own customized Qwen models into Amazon Bedrock and deploy them in a fully managed, serverless environment—without having to manage infrastructure or model serving.

In this post, we cover how to deploy Qwen 2.5 models with Amazon Bedrock Custom Model Import, making them accessible to organizations looking to use state-of-the-art AI capabilities within the AWS infrastructure at an effective cost.

Overview of Qwen models

Qwen 2 and 2.5 are families of large language models, available in a wide range of sizes and specialized variants to suit diverse needs:

General language models: Models ranging from 0.5B to 72B parameters, with both base and instruct versions for general-purpose tasks
Qwen 2.5-Coder: Specialized for code generation and completion
Qwen 2.5-Math: Focused on advanced mathematical reasoning
Qwen 2.5-VL (vision-language): Image and video processing capabilities, enabling multimodal applications

Overview of Amazon Bedrock Custom Model Import

Amazon Bedrock Custom Model Import enables the import and use of your customized models alongside existing foundation models (FMs) through a single serverless, unified API. You can access your imported custom models on-demand and without the need to manage the underlying infrastructure. Accelerate your generative AI application development by integrating your supported custom models with native Amazon Bedrock tools and features like Amazon Bedrock Knowledge Bases, Amazon Bedrock Guardrails, and Amazon Bedrock Agents. Amazon Bedrock Custom Model Import is generally available in the US-East (N. Virginia), US-West (Oregon), and Europe (Frankfurt) AWS Regions. Now, we’ll explore how you can use Qwen 2.5 models for two common use cases: as a coding assistant and for image understanding. Qwen2.5-Coder is a state-of-the-art code model, matching capabilities of proprietary models like GPT-4o. It supports over 90 programming languages and excels at code generation, debugging, and reasoning. Qwen 2.5-VL brings advanced multimodal capabilities. According to Qwen, Qwen 2.5-VL is not only proficient at recognizing objects such as flowers and animals, but also at analyzing charts, extracting text from images, interpreting document layouts, and processing long videos.

Prerequisites

Before importing the Qwen model with Amazon Bedrock Custom Model Import, make sure that you have the following in place:

An active AWS account
An Amazon Simple Storage Service (Amazon S3) bucket to store the Qwen model files
Sufficient permissions to create Amazon Bedrock model import jobs
Verified that your Region supports Amazon Bedrock Custom Model Import

Use case 1: Qwen coding assistant

In this example, we will demonstrate how to build a coding assistant using the Qwen2.5-Coder-7B-Instruct model

Go to to Hugging Face and search for and copy the Model ID Qwen/Qwen2.5-Coder-7B-Instruct:

You will use Qwen/Qwen2.5-Coder-7B-Instruct for the rest of the walkthrough. We don’t demonstrate fine-tuning steps, but you can also fine-tune before importing.

Use the following command to download a snapshot of the model locally. The Python library for Hugging Face provides a utility called snapshot download for this:

from huggingface_hub import snapshot_download

snapshot_download(repo_id=" Qwen/Qwen2.5-Coder-7B-Instruct", 
                local_dir=f"./extractedmodel/")

Depending on your model size, this could take a few minutes. When completed, your Qwen Coder 7B model folder will contain the following files.

Configuration files: Including config.json, generation_config.json, tokenizer_config.json, tokenizer.json, and vocab.json
Model files: Four safetensor files and model.safetensors.index.json
Documentation: LICENSE, README.md, and merges.txt

Upload the model to Amazon S3, using boto3 or the command line:

aws s3 cp ./extractedfolder s3://yourbucket/path/ --recursive

Start the import model job using the following API call:

response = self.bedrock_client.create_model_import_job(
                jobName="uniquejobname",
                importedModelName="uniquemodelname",
                roleArn="fullrolearn",
                modelDataSource={
                    's3DataSource': {
                        's3Uri': "s3://yourbucket/path/"
                    }
                }
            )

You can also do this using the AWS Management Console for Amazon Bedrock.

In the Amazon Bedrock console, choose Imported models in the navigation pane.
Choose Import a model.

Enter the details, including a Model name, Import job name, and model S3 location.

Create a new service role or use an existing service role. Then choose Import model

After you choose Import on the console, you should see status as importing when model is being imported:

If you’re using your own role, make sure you add the following trust relationship as describes in Create a service role for model import.

After your model is imported, wait for model inference to be ready, and then chat with the model on the playground or through the API. In the following example, we append Python to prompt the model to directly output Python code to list items in an S3 bucket. Remember to use the right chat template to input prompts in the format required. For example, you can get the right chat template for any compatible model on Hugging Face using below code:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-7B-Instruct")

# Instead of using model.chat(), we directly use model.generate()
# But you need to use tokenizer.apply_chat_template() to format your inputs as shown below
prompt = "Write sample boto3 python code to list files in a bucket stored in the variable `my_bucket`"
messages = [
    {"role": "system", "content": "You are a helpful coding assistant."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

Note that when using the invoke_model APIs, you must use the full Amazon Resource Name (ARN) for the imported model. You can find the Model ARN in the Bedrock console, by navigating to the Imported models section and then viewing the Model details page, as shown in the following figure

After the model is ready for inference, you can use Chat Playground in Bedrock console or APIs to invoke the model.

Use case 2: Qwen 2.5 VL image understanding

Qwen2.5-VL-* offers multimodal capabilities, combining vision and language understanding in a single model. This section demonstrates how to deploy Qwen2.5-VL using Amazon Bedrock Custom Model Import and test its image understanding capabilities.

Import Qwen2.5-VL-7B to Amazon Bedrock

Download the model from Huggingface Face and upload it to Amazon S3:

from huggingface_hub import snapshot_download

hf_model_id = "Qwen/Qwen2.5-VL-7B-Instruct"

# Enable faster downloads
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"

# Download model locally
snapshot_download(repo_id=hf_model_id, local_dir=f"./{local_directory}")

Next, import the model to Amazon Bedrock (either via Console or API):

response = bedrock.create_model_import_job(
    jobName=job_name,
    importedModelName=imported_model_name,
    roleArn=role_arn,
    modelDataSource={
        's3DataSource': {
            's3Uri': s3_uri
        }
    }
)

Test the vision capabilities

After the import is complete, test the model with an image input. The Qwen2.5-VL-* model requires proper formatting of multimodal inputs:

def generate_vl(messages, image_base64, temperature=0.3, max_tokens=4096, top_p=0.9):
    processor = AutoProcessor.from_pretrained("Qwen/QVQ-72B-Preview")
    prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    
    response = client.invoke_model(
        modelId=model_id,
        body=json.dumps({
            'prompt': prompt,
            'temperature': temperature,
            'max_gen_len': max_tokens,
            'top_p': top_p,
            'images': [image_base64]
        }),
        accept='application/json',
        contentType='application/json'
    )
    
    return json.loads(response['body'].read().decode('utf-8'))

# Using the model with an image
file_path = "cat_image.jpg"
base64_data = image_to_base64(file_path)

messages = [
    {
        "role": "user",
        "content": [
            {"image": base64_data},
            {"text": "Describe this image."}
        ]
    }
]

response = generate_vl(messages, base64_data)

# Print response
print("Model Response:")
if 'choices' in response:
    print(response['choices'][0]['text'])
elif 'outputs' in response:
    print(response['outputs'][0]['text'])
else:
    print(response)

When provided with an example image of a cat (such the following image), the model accurately describes key features such as the cat’s position, fur color, eye color, and general appearance. This demonstrates Qwen2.5-VL-* model’s ability to process visual information and generate relevant text descriptions.

The model’s response:

This image features a close-up of a cat lying down on a soft, textured surface, likely a couch or a bed. The cat has a tabby coat with a mix of dark and light brown fur, and its eyes are a striking green with vertical pupils, giving it a captivating look. The cat's whiskers are prominent and extend outward from its face, adding to the detailed texture of the image. The background is softly blurred, suggesting a cozy indoor setting with some furniture and possibly a window letting in natural light. The overall atmosphere of the image is warm and serene, highlighting the cat's relaxed and content demeanor.

Pricing

You can use Amazon Bedrock Custom Model Import to use your custom model weights within Amazon Bedrock for supported architectures, serving them alongside Amazon Bedrock hosted FMs in a fully managed way through On-Demand mode. Custom Model Import doesn’t charge for model import. You are charged for inference based on two factors: the number of active model copies and their duration of activity. Billing occurs in 5-minute increments, starting from the first successful invocation of each model copy. The pricing per model copy per minute varies based on factors including architecture, context length, Region, and compute unit version, and is tiered by model copy size. The custom model unites required for hosting depends on the model’s architecture, parameter count, and context length. Amazon Bedrock automatically manages scaling based on your usage patterns. If there are no invocations for 5 minutes, it scales to zero and scales up when needed, though this might involve cold-start latency of up to a minute. Additional copies are added if inference volume consistently exceeds single-copy concurrency limits. The maximum throughput and concurrency per copy is determined during import, based on factors such as input/output token mix, hardware type, model size, architecture, and inference optimizations.

For more information, see Amazon Bedrock pricing.

Clean up

To avoid ongoing charges after completing the experiments:

Delete your imported Qwen models from Amazon Bedrock Custom Model Import using the console or the API.
Optionally, delete the model files from your S3 bucket if you no longer need them.

Remember that while Amazon Bedrock Custom Model Import doesn’t charge for the import process itself, you are billed for model inference usage and storage.

Conclusion

Amazon Bedrock Custom Model Import empowers organizations to use powerful publicly available models like Qwen 2.5, among others, while benefiting from enterprise-grade infrastructure. The serverless nature of Amazon Bedrock eliminates the complexity of managing model deployments and operations, allowing teams to focus on building applications rather than infrastructure. With features like auto scaling, pay-per-use pricing, and seamless integration with AWS services, Amazon Bedrock provides a production-ready environment for AI workloads. The combination of Qwen 2.5’s advanced AI capabilities and Amazon Bedrock managed infrastructure offers an optimal balance of performance, cost, and operational efficiency. Organizations can start with smaller models and scale up as needed, while maintaining full control over their model deployments and benefiting from AWS security and compliance capabilities.

For more information, refer to the Amazon Bedrock User Guide.

About the Authors

Ajit Mahareddy is an experienced Product and Go-To-Market (GTM) leader with over 20 years of experience in Product Management, Engineering, and Go-To-Market. Prior to his current role, Ajit led product management building AI/ML products at leading technology companies, including Uber, Turing, and eHealth. He is passionate about advancing Generative AI technologies and driving real-world impact with Generative AI.

Shreyas Subramanian is a Principal Data Scientist and helps customers by using generative AI and deep learning to solve their business challenges using AWS services. Shreyas has a background in large-scale optimization and ML and in the use of ML and reinforcement learning for accelerating optimization tasks.

Yanyan Zhang is a Senior Generative AI Data Scientist at Amazon Web Services, where she has been working on cutting-edge AI/ML technologies as a Generative AI Specialist, helping customers use generative AI to achieve their desired outcomes. Yanyan graduated from Texas A&M University with a PhD in Electrical Engineering. Outside of work, she loves traveling, working out, and exploring new things.

Dharinee Gupta is an Engineering Manager at AWS Bedrock, where she focuses on enabling customers to seamlessly utilize open source models through serverless solutions. Her team specializes in optimizing these models to deliver the best cost-performance balance for customers. Prior to her current role, she gained extensive experience in authentication and authorization systems at Amazon, developing secure access solutions for Amazon offerings. Dharinee is passionate about making advanced AI technologies accessible and efficient for AWS customers.

Lokeshwaran Ravi is a Senior Deep Learning Compiler Engineer at AWS, specializing in ML optimization, model acceleration, and AI security. He focuses on enhancing efficiency, reducing costs, and building secure ecosystems to democratize AI technologies, making cutting-edge ML accessible and impactful across industries.

June Won is a Principal Product Manager with Amazon SageMaker JumpStart. He focuses on making foundation models easily discoverable and usable to help customers build generative AI applications. His experience at Amazon also includes mobile shopping applications and last mile delivery.

Build generative AI solutions with Amazon Bedrock

June 13, 2025

by Sajjan AVS Amazon AWS

Generative AI is revolutionizing how businesses operate, interact with customers, and innovate. If you’re embarking on the journey to build a generative AI-powered solution, you might wonder how to navigate the complexities involved from selecting the right models to managing prompts and enforcing data privacy.

In this post, we show you how to build generative AI applications on Amazon Web Services (AWS) using the capabilities of Amazon Bedrock, highlighting how Amazon Bedrock can be used at each step of your generative AI journey. This guide is valuable for both experienced AI engineers and newcomers to the generative AI space, helping you use Amazon Bedrock to its fullest potential.

Amazon Bedrock is a fully managed service that provides a unified API to access a wide range of high-performing foundation models (FMs) from leading AI companies like Anthropic, Cohere, Meta, Mistral AI, AI21 Labs, Stability AI, and Amazon. It offers a robust set of tools and features designed to help you build generative AI applications efficiently while adhering to best practices in security, privacy, and responsible AI.

Calling an LLM with an API

You want to integrate a generative AI feature into your application through a straightforward, single-turn interaction with a large language model (LLM). Perhaps you need to generate text, answer a question, or provide a summary based on user input. Amazon Bedrock simplifies generative AI application development and scaling through a unified API for accessing diverse, leading FMs. With support for Amazon models and leading AI providers, you have the freedom to experiment without being locked into a single model or provider. With the rapid pace of development in AI, you can seamlessly switch models for optimized performance with no application rewrite required.

Beyond direct model access, Amazon Bedrock expands your options with the Amazon Bedrock Marketplace. This marketplace gives you access to over 100 specialized FMs; you can discover, test, and integrate new capabilities all through fully managed endpoints. Whether you need the latest innovation in text generation, image synthesis, or domain-specific AI, Amazon Bedrock provides the flexibility to adapt and scale your solution with ease.

With one API, you stay agile and can effortlessly switch between models, upgrade to the latest versions, and future-proof your generative AI applications with minimal code changes. To summarize, Amazon Bedrock offers the following benefits:

Simplicity: No need to manage infrastructure or deal with multiple APIs
Flexibility: Experiment with different models to find the best fit
Scalability: Scale your application without worrying about underlying resources

To get started, use the Chat or Text playground to experiment with different FMs, and use the Converse API to integrate FMs into your application.

After you’ve integrated a basic LLM feature, the next step is optimizing the performance and making sure you’re using the right model for your requirements. This brings us to the importance of evaluating and comparing models.

Choosing the right model for your use case

Selecting the right FM for your use case is crucial, but with so many options available, how do you know which one will give you the best performance for your application? Whether it’s for generating more relevant responses, summarizing information, or handling nuanced queries, choosing the best model is key to providing optimal performance.

You can use Amazon Bedrock model evaluation to rigorously test different FMs to find the one that delivers the best results for your use case. Whether you’re in the early stages of development or preparing for launch, selecting the right model can make a significant difference in the effectiveness of your generative AI solutions.

The model evaluation process consists of the following components:

Automatic and human evaluation: Begin by experimenting with different models using automated evaluation metrics like accuracy, robustness, or toxicity. You can also bring in human evaluators to measure more subjective factors, such as friendliness, style, or how well the model aligns with your brand voice.
Custom datasets and metrics: Evaluate the performance of models using your own datasets or pre-built options. Customize the metrics that matter most for your project, making sure the selected model aligns with your business or operational goals.
Iterative feedback: Throughout the development process, run evaluations iteratively, allowing for faster refinement. This helps you compare models side by side, so you can make a data-driven decision when selecting the FM that fits your use case.

Imagine you’re building a customer support AI assistant for an ecommerce service. You can model evaluation to test multiple FMs with real customer queries, evaluating which model provides the most accurate, friendly, and contextually appropriate responses. By comparing models side by side, you can choose the model that will deliver the best possible user experience for your customers. After you’ve evaluated and selected the ideal model, the next step is making sure it aligns with your business needs. Off-the-shelf models might perform well, but for a truly tailored experience, you need more customization. This leads to the next important step in your generative AI journey: personalizing models to reflect your business context. You need to make sure the model generates the most accurate and contextually relevant responses. Even the best FMs will not have access to the latest or domain-specific information critical to your business. To solve this, the model needs to use your proprietary data sources, making sure its outputs reflect the most up-to-date and relevant information. This is where you can use Retrieval Augmented Generation (RAG) to enrich the model’s responses by incorporating your organization’s unique knowledge base.

Enriching model responses with your proprietary data

A publicly available LLM might perform well on general knowledge tasks, but struggle with outdated information or lack context from your organization’s proprietary data. You need a way to provide the model with the most relevant, up-to-date insights to provide accuracy and contextual depth. There are two key approaches that you can use to enrich model responses:

RAG: Use RAG to dynamically retrieve relevant information at query time, enriching model responses without requiring retraining
Fine-tuning: Use RAG to customize your chosen model by training it on proprietary data, improving its ability to handle organization-specific tasks or domain knowledge

We recommend starting with RAG because of its flexible and straightforward to implement. You can then fine-tune the model for deeper domain adaptation if needed. RAG dynamically retrieves relevant information at query time, making sure model responses stay accurate and context aware. In this approach, data is first processed and indexed in a vector database or similar retrieval system. When a user submits a query, Amazon Bedrock searches this indexed data to find relevant context, which is injected into the prompt. The model then generates a response based on both the original query and the retrieved insights without requiring additional training.

Amazon Bedrock Knowledge Bases automates the RAG pipeline—including data ingestion, retrieval, prompt augmentation, and citations—reducing the complexity of setting up custom integrations. By seamlessly integrating proprietary data, you can make sure that the models generate accurate, contextually rich, and continuously updated responses.

Bedrock Knowledge Bases supports various data types to tailor AI-generated responses to business-specific needs:

Unstructured data: Extract insights from text-heavy sources like documents, PDFs, and emails
Structured data: Enable natural language queries on databases, data lakes, and warehouses without moving or preprocessing data
Multimodal data: Process both text and visual elements in documents and images using Amazon Bedrock Data Automation
GraphRAG: Enhance knowledge retrieval with graph-based relationships, enabling AI to understand entity connections for more context-aware responses

With these capabilities, Amazon Bedrock reduces data silos, making it straightforward to enrich AI applications with both real-time and historical knowledge. Whether working with text, images, structured datasets, or interconnected knowledge graphs, Amazon Bedrock provides a fully managed, scalable solution without the need for complex infrastructure. To summarize, using RAG with Amazon Bedrock offers the following benefits:

Up-to-date information: Responses include the latest data from your knowledge bases
Accuracy: Reduces the risk of incorrect or irrelevant answers
No extra infrastructure: You can avoid setting up and managing your own vector databases or custom integrations

When your model is pulling from the most accurate and relevant data, you might find that its general behavior still needs some refinement perhaps in its tone, style, or understanding of industry-specific language. This is where you can further fine-tune the model to align it even more closely with your business needs.

Tailoring models to your business needs

Out-of-the-box FMs provide a strong starting point, but they often lack the precision, brand voice, or industry-specific expertise required for real-world applications. Maybe the language doesn’t align with your brand, or the model struggles with specialized terminology. You might have experimented with prompt engineering and RAG to enhance responses with additional context. Although these techniques help, they have limitations (for example, longer prompts can increase latency and cost), and models might still lack deep domain expertise needed for domain-specific tasks. To fully harness generative AI, businesses need a way to securely adapt models, making sure AI-generated responses are not only accurate but also relevant, reliable, and aligned with business goals.

Amazon Bedrock simplifies model customization, enabling businesses to fine-tune FMs with proprietary data without building models from scratch or managing complex infrastructure.

Rather than retraining an entire model, Amazon Bedrock provides a fully managed fine-tuning process that creates a private copy of the base FM. This makes sure your proprietary data remains confidential and isn’t used to train the original model. Amazon Bedrock offers two powerful techniques to help businesses refine models efficiently:

Fine-tuning: You can train an FM with labeled datasets to improve accuracy in industry-specific terminology, brand voice, and company workflows. This allows the model to generate more precise, context-aware responses without relying on complex prompts.
Continued pre-training: If you have unlabeled domain-specific data, you can use continued pre-training to further train an FM on specialized industry knowledge without manual labeling. This approach is especially useful for regulatory compliance, domain-specific jargon, or evolving business operations.

By combining fine-tuning for core domain expertise with RAG for real-time knowledge retrieval, businesses can create highly specialized AI models that stay accurate and adaptable, and make sure the style of responses align with business goals. To summarize, Amazon Bedrock offers the following benefits:

Privacy-preserved customization: Fine-tune models securely while making sure that your proprietary data remains private
Efficiency: Achieve high accuracy and domain relevance without the complexity of building models from scratch

As your project evolves, managing and optimizing prompts becomes critical, especially when dealing with different iterations or testing multiple prompt versions. The next step is refining your prompts to maximize model performance.

Managing and optimizing prompts

As your AI projects scale, managing multiple prompts efficiently becomes a growing challenge. Tracking versions, collaborating with teams, and testing variations can quickly become complex. Without a structured approach, prompt management can slow down innovation, increase costs, and make iteration cumbersome. Optimizing a prompt for one FM doesn’t always translate well to another. A prompt that performs well with one FM might produce inconsistent or suboptimal outputs with another, requiring significant rework. This makes switching between models time-consuming and inefficient, limiting your ability to experiment with different AI capabilities effectively. Without a centralized way to manage, test, and refine prompts, AI development becomes slower, more costly, and less adaptable to evolving business needs.

Amazon Bedrock simplifies prompt engineering with Amazon Bedrock Prompt Management, an integrated system that helps teams create, refine, version, and share prompts effortlessly. Instead of manually adjusting prompts for months, Amazon Bedrock accelerates experimentation and enhances response quality without additional code. Bedrock Prompt Management introduces the following capabilities:

Versioning and collaboration: Manage prompt iterations in a shared workspace, so teams can track changes and reuse optimized prompts.
Side-by-side testing: Compare up to two prompt variations simultaneously to analyze model behavior and identify the most effective format.
Automated prompt optimization: Fine-tune and rewrite prompts based on the selected FM to improve response quality. You can select a model, apply optimization, and generate a more accurate, contextually relevant prompt.

Bedrock Prompt Management offers the following benefits:

Efficiency: Quickly iterate and optimize prompts without writing additional code
Teamwork: Enhance collaboration with shared access and version control
Insightful testing: Identify which prompts perform best for your use case

After you’ve optimized your prompts for the best results, the next challenge is optimizing your application for cost and latency by choosing the most appropriate model within a family for a given task. This is where intelligent prompt routing can help.

Optimizing efficiency with intelligent model selection

Not all prompts require the same level of AI processing. Some are straightforward and need fast responses, whereas others require deeper reasoning and more computational power. Using high-performance models for every request increases costs and latency, even when a lighter, faster model could generate an equally effective response. At the same time, relying solely on smaller models might reduce accuracy for complex queries. Without an automated approach, business must manually determine which model to use for each request, leading to higher costs, inefficiencies, and slower development cycles.

Amazon Bedrock Intelligent Prompt Routing optimizes AI performance and cost by dynamically selecting the most appropriate FM for each request. Instead of manually choosing a model, Amazon Bedrock automates model selection within a model family, making sure that each prompt is routed to the best-performing model for its complexity. Bedrock Intelligent Prompt Routing offers the following capabilities:

Adaptive model routing: Automatically directs simple prompts to lightweight models and complex queries to more advanced models, providing the right balance between speed and efficiency
Performance balance: Makes sure that you use high-performance models only when necessary, reducing AI inference costs by up to 30%
Effortless integration: Automatically selects the right model within a family, simplifying deployment

By automating model selection, Amazon Bedrock removes the need for manual decision-making, reduces operational overhead, and makes sure AI applications run efficiently at scale. With Amazon Bedrock Intelligent Prompt Routing, each query is processed by the most efficient model, delivering speed, cost savings, and high-quality responses. The next step in optimizing AI efficiency is reducing redundant computations in frequently used prompts. Many AI applications require maintaining context across multiple interactions, which can lead to performance bottlenecks, increased costs, and unnecessary processing overhead.

Reducing redundant processing for faster responses

As your generative AI applications scale, efficiency becomes just as critical as accuracy. Applications that repeatedly use the same context—such as document Q&A systems (where users ask multiple questions about the same document) or coding assistants that maintain context about code files—often face performance bottlenecks and rising costs because of redundant processing. Each time a query includes long, static context, models reprocess unchanged information, leading to increased latency as models repeatedly analyze the same content and unnecessary token usage inflates compute expenses. To keep AI applications fast, cost-effective, and scalable, optimizing how prompts are reused and processed is essential.

Amazon Bedrock Prompt Caching enhances efficiency by storing frequently used portions of prompts—reducing redundant computations and improving response times. It offers the following benefits:

Faster processing: Skips unnecessary recomputation of cached prompt prefixes, boosting overall throughput
Lower latency: Reduces processing time for long, repetitive prompts, delivering a smoother user experience, and reducing latency by up to 85% for supported models
Cost-efficiency: Minimizes compute resource usage by avoiding repeated token processing, reducing costs by up to 90%

With prompt caching, AI applications respond faster, reduce operational costs, and scale efficiently while maintaining high performance. With Bedrock Prompt Caching providing faster responses and cost-efficiency, the next step is enabling AI applications to move beyond static prompt-response interactions. This is where agentic AI comes in, empowering applications to dynamically orchestrate multistep processes, automate decision-making, and drive intelligent workflows.

Automating multistep tasks with agentic AI

As AI applications grow more sophisticated, automating complex, multistep tasks become essential. You need a solution that can interact with internal systems, APIs, and databases to execute intricate workflows autonomously. The goal is to reduce manual intervention, improve efficiency, and create more dynamic, intelligent applications. Traditional AI models are reactive; they generate responses based on inputs but lack the ability to plan and execute multistep tasks. Agentic AI refers to AI systems that act with autonomy, breaking down complex tasks into logical steps, making decisions, and executing actions without constant human input. Unlike traditional models that only respond to prompts, agentic AI models have the following capabilities:

Autonomous planning and execution: Breaks complex tasks into smaller steps, makes decisions, and plans actions to complete the workflow
Chaining capabilities: Handles sequences of actions based on a single request, enabling the AI to manage intricate tasks that would otherwise require manual intervention or multiple interactions
Interaction with APIs and systems: Connects to your enterprise systems and automatically invokes necessary APIs or databases to fetch or update data

Amazon Bedrock Agents enables AI-powered task automation by using FMs to plan, orchestrate, and execute workflows. With a fully managed orchestration layer, Amazon Bedrock simplifies the process of deploying, scaling, and managing AI agents. Bedrock Agents offers the following benefits:

Task orchestration: Uses FMs’ reasoning capabilities to break down tasks, plan execution, and manage dependencies
API integration: Automatically calls APIs within enterprise systems to interact with business applications
Memory retention: Maintains context across interactions, allowing agents to remember previous steps, providing a seamless user experience

When a task requires multiple specialized agents, Amazon Bedrock supports multi-agent collaboration, making sure agents work together efficiently while alleviating manual orchestration overhead. This unlocks the following capabilities:

Supervisor-agent coordination: A supervisor agent delegates tasks to specialized subagents, providing optimal distribution of workloads
Efficient task execution: Supports parallel task execution, enabling faster processing and improved accuracy
Flexible collaboration modes: You can choose between the following modes:
- Fully orchestrated supervisor mode: A central agent manages the full workflow, providing seamless coordination
- Routing mode: Basic tasks bypass the supervisor and go directly to subagents, reducing unnecessary orchestration
Seamless integration: Works with enterprise APIs and internal knowledge bases, making it straightforward to automate business operations across multiple domains

By using multi-agent collaboration, you can increase task success rates, reduce execution time, and improve accuracy, making AI-driven automation more effective for real-world, complex workflows. To summarize, agentic AI offers the following benefits:

Automation: Reduces manual intervention in complex processes
Flexibility: Agents can adapt to changing requirements or gather additional information as needed
Transparency: You can use the trace capability to debug and optimize agent behavior

Although automating tasks with agents can streamline operations, handling sensitive information and enforcing privacy is paramount, especially when interacting with user data and internal systems. As your application grows more sophisticated, so do the security and compliance challenges.

Maintaining security, privacy, and responsible AI practices

As you integrate generative AI into your business, security, privacy, and compliance become critical concerns. AI-generated responses must be safe, reliable, and aligned with your organization’s policies to help violating brand guidelines or regulatory policies, and must not include inaccurate or misleading responses.

Amazon Bedrock Guardrails provides a comprehensive framework to enhance security, privacy, and accuracy in AI-generated outputs. With built-in safeguards, you can enforce policies, filter content, and improve trustworthiness in AI interactions. Bedrock Guardrails offers the following capabilities:

Content filtering: Block undesirable topics and harmful content in user inputs and model responses.
Privacy protection: Detect and redact sensitive information like personally identifiable information (PII) and confidential data to help prevent data leaks.
Custom policies: Define organization-specific rules to make sure AI-generated content aligns with internal policies and brand guidelines.
Hallucination detection: Identify and filter out responses not grounded in your data sources through the following capabilities:
- Contextual grounding checks: Make sure model responses are factually correct and relevant by validating them against enterprise data source. Detect hallucinations when outputs contain unverified or irrelevant information.
- Automated reasoning for accuracy: Moves beyond trust me to prove it AI outputs by applying mathematically sound logic and structured reasoning to verify factual correctness.

With security and privacy measures in place, your AI solution is not only powerful but also responsible. However, if you’ve already made significant investments in custom models, the next step is to integrate them seamlessly into Amazon Bedrock.

Using existing custom models with Amazon Bedrock Custom Model Import

Use Amazon Bedrock Custom Model Import if you’ve already invested in custom models developed outside of Amazon Bedrock and want to integrate them into your new generative AI solution without managing additional infrastructure.

Bedrock Custom Model Import includes the following capabilities:

Seamless integration: Import your custom models into Amazon Bedrock
Unified API access: Interact with models—both base and custom—through the same API
Operational efficiency: Let Amazon Bedrock handle the model lifecycle and infrastructure management

Bedrock Custom Model Import offers the following benefits:

Cost savings: Maximize the value of your existing models
Simplified management: Reduce overhead by consolidating model operations
Consistency: Maintain a unified development experience across models

By importing custom models, you can use your prior investments. To truly unlock the potential of your models and prompt structures, you can automate more complex workflows, combining multiple prompts and integrating with other AWS services.

Automating workflows with Amazon Bedrock Flows

You need to build complex workflows that involve multiple prompts and integrate with other AWS services or business logic, but you want to avoid extensive coding.

Amazon Bedrock Flows has the following capabilities:

Visual builder: Drag-and-drop components to create workflows
Workflow automation: Link prompts with AWS services and automate sequences
Testing and versioning: Test flows directly in the console and manage versions

Amazon Bedrock Flows offers the following benefits:

No-code solution: Build workflows without writing code
Speed: Accelerate development and deployment of complex applications
Collaboration: Share and manage workflows within your team

With workflows now automated and optimized, you’re nearly ready to deploy your generative AI-powered solution. The final stage is making sure that your generative AI solution can scale efficiently and maintain high performance as demand grows.

Monitoring and logging to close the loop on AI operations

As you prepare to move your generative AI application into production, it’s critical to implement robust logging and observability to monitor system health, verify compliance, and quickly troubleshoot issues. Amazon Bedrock offers built-in observability capabilities that integrate seamlessly with AWS monitoring tools, enabling teams to track performance, understand usage patterns, and maintain operational control

Model invocation logging: You can enable detailed logging of model invocations, capturing input prompts and output responses. These logs can be streamed to Amazon CloudWatch or Amazon Simple Storage Service (Amazon S3) for real-time monitoring or long-term analysis. Logging is configurable through the AWS Management Console or the CloudWatchConfig API.
CloudWatch metrics: Amazon Bedrock provides rich operational metrics out-of-the-box, including:
- Invocation count
- Token usage (input/output)
- Response latency
- Error rates (for example, invalid input and model failures)

These capabilities are essential for running generative AI solutions at scale with confidence. By using CloudWatch, you gain visibility across the full AI pipeline from input prompts to model behavior; making it straightforward to maintain uptime, performance, and compliance as your application grows.

Finalizing and scaling your generative AI solution

You’re ready to deploy your generative AI application and need to scale it efficiently while providing reliable performance. Whether you’re handling unpredictable workloads, enhancing resilience, or needing consistent throughput, you must choose the right scaling approach. Amazon Bedrock offers three flexible scaling options that you can use to tailor your infrastructure to your workload needs:

On-demand: Start with the flexibility of on-demand scaling, where you pay only for what you use. This option is ideal for early-stage deployments or applications with variable or unpredictable traffic. It offers the following benefits:
- No commitments.
- Pay only for tokens processed (input/output).
- Great for dynamic or fluctuating workloads.
Cross-Region inference: When your traffic grows or becomes unpredictable, you can use cross-Region inference to handle bursts by distributing compute across multiple AWS Regions, enhancing availability without additional cost. It offers the following benefits:
- Up to two times larger burst capacity.
- Improved resilience and availability.
- No additional charges, you have the same pricing as your primary Region.
Provisioned Throughput: For large, consistent workloads, Provisioned Throughput maintains a fixed level of performance. This option is perfect when you need predictable throughput, particularly for custom models. It offers the following benefits:
- Consistent performance for high-demand applications.
- Required for custom models.
- Flexible commitment terms (1 month or 6 months).

Conclusion

Building generative AI solutions is a multifaceted process that requires careful consideration at every stage. Amazon Bedrock simplifies this journey by providing a unified service that supports each phase, from model selection and customization to deployment and compliance. Amazon Bedrock offers a comprehensive suite of features that you can use to streamline and enhance your generative AI development process. By using its unified tools and APIs, you can significantly reduce complexity, enabling accelerated development and smoother workflows. Collaboration becomes more efficient because team members can work seamlessly across different stages, fostering a more cohesive and productive environment. Additionally, Amazon Bedrock integrates robust security and privacy measures, helping to ensure that your solutions meet industry and organization requirements. Finally, you can use its scalable infrastructure to bring your generative AI solutions to production faster while minimizing overhead. Amazon Bedrock stands out as a one-stop solution that you can use to build sophisticated, secure, and scalable generative AI applications. Its extensive capabilities alleviate the need for multiple vendors and tools, streamlining your workflow and enhancing productivity.

Explore Amazon Bedrock and discover how you can use its features to support your needs at every stage of generative AI development. To learn more, see the Amazon Bedrock User Guide.

About the authors

Venkata Santosh Sajjan Alla is a Senior Solutions Architect at AWS Financial Services, driving AI-led transformation across North America’s FinTech sector. He partners with organizations to design and execute cloud and AI strategies that speed up innovation and deliver measurable business impact. His work has consistently translated into millions in value through enhanced efficiency and additional revenue streams. With deep expertise in AI/ML, Generative AI, and cloud-native architectures, Sajjan enables financial institutions to achieve scalable, data-driven outcomes. When not architecting the future of finance, he enjoys traveling and spending time with family. Connect with him on LinkedIn.

Axel Larsson is a Principal Solutions Architect at AWS based in the greater New York City area. He supports FinTech customers and is passionate about helping them transform their business through cloud and AI technology. Outside of work, he is an avid tinkerer and enjoys experimenting with home automation.

How Netsertive built a scalable AI assistant to extract meaningful insights from real-time data using Amazon Bedrock and Amazon Nova

June 13, 2025

by Nicholas Switzer Amazon AWS

This post was co-written with Herb Brittner from Netsertive.

Netsertive is a leading digital marketing solutions provider for multi-location brands and franchises, helping businesses maximize local advertising, improve engagement, and gain deep customer insights.

With a growing demand in providing more actionable insights from their customer call tracking data, Netsertive needed a solution that could unlock business intelligence from every call, making it easier for franchises to improve customer service and boost conversion rates. The team was looking for a single, flexible system that could do several things:

Understand phone calls – Automatically create summaries of what was discussed
Gauge customer feelings – Determine if the caller was happy, upset, or neutral
Identify important topics – Pull out keywords related to frequent services, questions, problems, and mentions of competitors
Improve agent performance – Offer advice and suggestions for coaching
Track performance over time – Generate reports on trends for individual locations, regions, and the entire country

Crucially, this new system needed to work smoothly with their existing Multi-Location Experience (MLX) platform. The MLX platform is specifically designed for businesses with many locations and helps them manage both national and local marketing. It allows them to run campaigns across various online channels, including search engines, social media, display ads, videos, connected TVs, and online reviews, as well as manage SEO, business listings, reviews, social media posting, and individual location web pages.

In this post, we show how Netsertive introduced a generative AI-powered assistant into MLX, using Amazon Bedrock and Amazon Nova, to bring their next generation of the platform to life.

Solution overview

Operating a comprehensive digital marketing solution, Netsertive handles campaign execution while providing key success metrics through their Insights Manager product. The platform features location-specific content management capabilities and robust lead capture functionality, collecting data from multiple sources, including paid campaigns, organic website traffic, and attribution pro forms. With CRM integration and call tracking features, MLX creates a seamless flow of customer data and marketing insights. This combination of managed services, automated tools, and analytics makes MLX a single source of truth for businesses seeking to optimize their digital marketing efforts while taking advantage of Netsertive’s expertise in campaign management. To address their desire to provide more actionable insights on the platform from customer call tracking data, Netsertive considered various solutions. After evaluating different tools and models, they decided to use Amazon Bedrock and the Amazon Nova Micro model. This choice was driven by the API-driven approach of Amazon Bedrock, its wide selection of large language models (LLMs), and the performance of the Amazon Nova Micro model specifically. They selected Amazon Nova Micro based on its ability to deliver fast response times at a low cost, while providing consistent and intelligent insights—key factors for Netsertive. With its generation speed of over 200 tokens per second and highly performant language understanding skills, this text-only model proved ideal for Netsertive. The following diagram shows how their MLX platform receives real-time phone calls and uses Amazon Nova Micro in Amazon Bedrock for processing real-time phone calls.

The real-time call processing flow consists of the following steps:

When a call comes in, it’s immediately routed to the Lead API. This process captures both the live call transcript and important metadata about the caller. This system continuously processes new calls as they arrive, facilitating real-time handling of incoming communications.
The captured transcript is forwarded to Amazon Bedrock for analysis. The system currently uses a standardized base prompt for all customers, and the architecture is designed to allow for customer-specific prompt customization as an added layer of context.
Amazon Nova Micro processes the transcript and returns a structured JSON response. This response includes multiple analysis components: sentiment analysis of the conversation, a concise call summary, identified key terms, overall call theme classification, and specific coaching suggestions for improvement.
All analysis results are systematically stored in an Amazon Aurora database with their associated key metrics. This makes sure the processed data is properly indexed and readily available for both immediate access and future analysis.

The aggregate report schedule flow consists of the following steps:

The aggregate analysis process automatically initiates on both weekly and monthly schedules. During each run, the system gathers call data that falls within the specified time period.
This aggregate analysis uses both Amazon Bedrock and Amazon Nova Micro, applying a specialized prompt designed specifically for trend analysis. This prompt differs from the real-time analysis to focus on identifying patterns and insights across multiple calls.

The processed aggregate data from both workflows is transformed into comprehensive reports displaying trend analysis and comparative metrics through the UI. This provides stakeholders with valuable insights into performance patterns and trends over time while allowing the user to dive deeper into specific metrics.

Results

The implementation of generative AI to create a real-time call data analysis solution has been a transformative journey for Netsertive. Their new Call Insights AI feature, using Amazon Nova Micro on Amazon Bedrock, only takes minutes to create actionable insights, compared to their previous manual call review processes, which took hours or even days for customers with high call volumes. Netsertive chose Amazon Bedrock and Amazon Nova Micro for their solution after a swift evaluation period of approximately 1 week of testing different tools and models. Their development approach was methodical and customer-focused. The Call Insights AI feature was added to their platform’s roadmap based on direct customer feedback and internal marketing expertise. The entire development process, from creating and testing their Amazon Nova Micro prompts to integrating Amazon Bedrock with their MLX platform, was completed within approximately 30 days before launching in beta. The transformation of real-time call data analysis isn’t just about processing more calls—it’s about creating a more comprehensive understanding of customer interactions. By implementing Amazon Bedrock and Amazon Nova Micro, Netsertive is able to better understand call purposes and value, enhance measurement capabilities, and progress towards more automated and efficient analysis systems. This evolution can not only streamline operations but also provide customers with more actionable insights about their digital marketing performance.

Conclusion

In this post, we shared how Netsertive introduced a generative AI-powered assistant into MLX, using Amazon Bedrock and Amazon Nova. This solution helped scale their MLX platform to provide their customers with instant, actionable insights, creating a more engaging and informative user experience. By using the advanced natural language processing capabilities of Amazon Bedrock and the high-performance, low-latency Amazon Nova Micro model, Netsertive was able to build a comprehensive call intelligence system that goes beyond just transcription and sentiment analysis.

The success of this project has demonstrated the transformative potential of generative AI in driving business intelligence and operational efficiency. To learn more about building powerful, generative AI assistants and applications using Amazon Bedrock and Amazon Nova, see Generative AI on AWS.

About the authors

Nicholas Switzer is an AI/ML Specialist Solutions Architect at Amazon Web Services. He joined AWS in 2022 and specializes in AI/ML, generative AI, IoT, and edge AI. He is based in the US and enjoys building intelligent products that improve everyday life.

Jane Ridge is Senior Solutions Architect at Amazon Web Services with over 20 years of technology experience. She joined AWS in 2020 and is based in the US. She is passionate around enabling growth of her customers through innovative solutions combined with her deep technical expertise in the AWS ecosystem. She is known for her ability to guide customers through all stages of their cloud journey and deliver impactful solutions.

Herb Brittner is the Vice President of Product & Engineering at Netsertive, where he leads the development of AI-driven digital marketing solutions for multi-location brands and franchises. With a strong background in product innovation and scalable engineering, he specializes in using machine learning and cloud technologies to drive business insights and customer engagement. Herb is passionate about building data-driven platforms that enhance marketing performance and operational efficiency.

Make videos accessible with automated audio descriptions using Amazon Nova

June 13, 2025

by Dylan Martin Amazon AWS

According to the World Health Organization, more than 2.2 billion people globally have vision impairment. For compliance with disability legislation, such as the Americans with Disabilities Act (ADA) in the United States, media in visual formats like television shows or movies are required to provide accessibility to visually impaired people. This often comes in the form of audio description tracks that narrate the visual elements of the film or show. According to the International Documentary Association, creating audio descriptions can cost $25 per minute (or more) when using third parties. For building audio descriptions internally, the effort for businesses in the media industry can be significant, requiring content creators, audio description writers, description narrators, audio engineers, delivery vendors and more according to the American Council of the Blind (ACB). This leads to the natural question, can you automate this process with the help of generative AI offerings in Amazon Web Services (AWS)?

Newly announced in December at re:Invent 2024, the Amazon Nova Foundation Models family is available through Amazon Bedrock and includes three multimodal foundational models (FMs):

Amazon Nova Lite (GA) – A low-cost multimodal model that’s lightning-fast for processing image, video, and text inputs
Amazon Nova Pro (GA) – A highly capable multimodal model with a balanced combination of accuracy, speed, and cost for a wide range of tasks
Amazon Nova Premier (GA) – Our most capable model for complex tasks and a teacher for model distillation

In this post, we demonstrate how you can use services like Amazon Nova, Amazon Rekognition, and Amazon Polly to automate the creation of accessible audio descriptions for video content. This approach can significantly reduce the time and cost required to make videos accessible for visually impaired audiences. However, this post doesn’t provide a complete, deployment-ready solution. We share pseudocode snippets and guidance in sequential order, in addition to detailed explanations and links to resources. For a complete script, you can use additional resources, such as Amazon Q Developer, to build a fully functional system. The automated workflow described in the post involves analyzing video content, generating text descriptions, and narrating them using AI voice generation. In summary, while powerful, this requires careful integration and testing to deploy effectively. By the end of this post, you’ll understand the key steps, but some additional work is needed to create a production-ready solution for your specific use case.

Solution overview

The following architecture diagram demonstrates the end-to-end workflow of the proposed solution. We will describe each component in-depth in the later sections of this post, but note that you can define the logic within a single script. You can then run your script on an Amazon Elastic Compute Cloude (Amazon EC2) instance or on your local computer. For this post, we assume that you will run the script on an Amazon SageMaker notebook.

Services used

The services shown in the architecture diagram include:

Amazon S3 – Amazon Simple Storage Service (Amazon S3) is an object storage service that provides scalable, durable, and highly available storage. In this example, we use Amazon S3 to store the video files (input) and scene description (text files) and audio description (MP3 files) output generated by the solution. The script starts by fetching the source video from an S3 bucket.
Amazon Rekognition – Amazon Rekognition is a computer vision service that can detect and extract video segments or scenes by identifying technical cues such as shot boundaries, black frames, and other visual elements. To yield higher accuracy for the generated video descriptions, you use Amazon Rekognition to segment the source video into smaller chunks before passing it to Amazon Nova. These video segments can be stored in a temporary directory on your compute machine.
Amazon Bedrock – Amazon Bedrock is a managed service that provides access to large, pre-trained AI models such as the Amazon Nova Pro model, which is used in this solution to analyze the content of each video segment and generate detailed scene descriptions. You can store these text descriptions in a text file (for example, video_analysis.txt).
Amazon Polly – Amazon Polly is a text-to-speech service that is used to convert the text descriptions generated by the Amazon Nova Pro model into high-quality audio, made available using an MP3 file.

Prerequisites

To follow along with the solution outlined in this post, you should have the following in place:

A video file. For this post, we use a public domain video, This is Coffee.
An AWS account with access to the following services:
- Amazon Rekognition
- Amazon Nova Pro
- Amazon S3
- Amazon Polly
- Configure your AWS Command Line Interface (AWS CLI) or environment with valid credentials (using aws configure or environment variables)
To write the script, you need access to an AWS Software Development Kit (AWS SDK) in the language of your choice. In this post, we assume that you will use the AWS SDK for Python (Boto3). Additional information on AWS SDK for Boto3 is available in the Quickstart for Boto3.

You can use AWS SDK to create, configure, and manage AWS services. For Boto3, you can include it at the top of your script using: import boto3

Additionally, you need a mechanism to split videos. If you’re using Python, we recommend the moviepy library.
import moviepy # pip install moviepy

Solution walkthrough

The solution includes the following basic steps, which you can use as a basic structure and customize or expand to fit your use case.

Define the requirements for the AWS environment, including defining the use of the Amazon Nova Pro model for its visual support and the AWS Region you’re working in. For optimal throughput, we recommend using inference profiles when configuring Amazon Bedrock to invoke the Amazon Nova Pro model. Initialize a client for Amazon Rekognition, which you use for its support of segmentation.

CLASS VideoAnalyzer:
	FUNCTION initialize():
 		Set AWS_REGION to "us-east-1"
 		Set MODEL_ID to "amazon.nova-pro-v1:0"
 		Set chunk_delay to 20 Initialize AWS clients (Bedrock and Rekognition)

Define a function for detecting segments in the video. Amazon Rekognition supports segmentation, which means users have the option to detect and extract different segments or scenes within a video. By using the Amazon Rekognition Segment API, you can perform the following:
1. Detect technical cues such as black frames, color bars, opening and end credits, and studio logos in a video.
2. Detect shot boundaries to identify the start, end, and duration of individual shots within the video.

The solution uses Amazon Rekognition to partition the video into multiple segments and perform Amazon Nova Pro-based inference on each segment. Finally, you can piece together each segment’s inference output to return a comprehensive audio description for the entire video.

FUNCTION get_segment_results(job_id):
 	TRY:
 	   Initialize empty segments list 
 	   WHILE more results exist:
 	         Get segment detection results 
                Add segments to list 
                IF no more results THEN break
          RETURN segments 
       CATCH any errors and return null 

FUNCTION extract_scene(video_path, start_time, end_time):
       TRY: 
           Load video file 
           Validate time range
           Create temporary directory 
           Extract video segment 
           Save segment to file 
           RETURN path to saved segment 
       CATCH any errors and return null

In the preceding image, there are two scenes: a screenshot of one scene on the left followed by the scene that immediately follows it on the right. With the Amazon Rekognition segmentation API, you can identify that the scene has changed—that the content that is displayed on screen is different—and therefore you need to generate a new scene description.

Create the segmentation job and:
- Upload the video file for which you want to create an audio description to Amazon S3.
- Start the job using that video.

Setting SegmentType=[‘SHOT’] identifies the start, end, and duration of a scene. Additionally, MinSegmentConfidence sets the minimum confidence Amazon Rekognition must have to return a detected segment, with 0 being lowest confidence and 100 being highest.

Use the analyze_chunk function. This function defines the main logic of the audio description solution. Some items to note about analyze_chunk:
- For this example, we sent a video scene to Amazon Nova Pro for an analysis of the contents using the prompt Describe what is happening in this video in detail. This prompt is relatively straightforward and experimentation or customization for your use case is encouraged. Amazon Nova Pro then returned the text description for our video scene.
- For longer videos with many scenes, you might encounter throttling. This is resolved by implementing a retry mechanism. For details on throttling and quotas for Amazon Bedrock, see Quotas for Amazon Bedrock.

FUNCTION analyze_chunk(chunk_path): 
     TRY: 
        Convert video chunk to base64 
        Create request body for Bedrock 
        Set max_retries and backoff_time 

        WHILE retry_count < max_retries:
          TRY:
             Send InvokeModel request to Bedrock
             RETURN analysis results 
          CATCH throttling: 
              Wait and retry with exponential backoff 
          CATCH other errors: 
              Return null 
     CATCH any errors:
         Return null

In effect, the raw scenes are converted into rich, descriptive text. Using this text, you can generate a complete scene-by-scene walkthrough of the video and send it to Amazon Polly for audio.

Use the following code to orchestrate the process:
1. Initiate the detection of the various segments by using Amazon Rekognition.
2. Each segment is processed through a flow of:
  1. Extraction.
  2. Analysis using Amazon Nova Pro.
  3. Compiling the analysis into a video_analysis.txt file.
The analyze_video function brings together all the components and produces a text file that contains the complete, scene-by-scene analysis of the video contents, with timestamps

FUNCTION analyze_video(video_path, bucket): 
     TRY: 
         Start segment detection 
         Wait for job completion 
         Get segments 
         FOR each segment: 
             Extract scene 
             Analyze chunk 
             Save analysis results 
         Write results to file 
      CATCH any errors

If you refer back to the previous screenshot, the output—without any additional refinement—will look similar to the following image.

“Segment 103.136-126.026 seconds:
[{'text': 'The video shows a close-up of a coffee cup with steam rising from it, followed by three cups of coffee on a table with milk and sugar jars. A person then picks up a bunch of coffee beans from a plant.'}]
Segment 126.059-133.566 seconds:
[{'text': "The video starts with a person's hand, covered in dirt and holding a branch with green leaves and berries. The person then picks up some berries. The video then shows a man standing in a field with trees and plants. He is holding a bunch of red fruits in his right hand and looking at them. He is wearing a shirt and has a mustache. He seems to be picking the fruits. The fruits are probably coffee beans. The area is surrounded by green plants and trees."}]”

The following screenshot is an example is a more extensive look at the video_analysis.txt for the coffee.mp4 video:

Send the contents of the text file to Amazon Polly. Amazon Polly adds a voice to the text file, completing the workflow of the audio description solution.

FUNCTION generate_audio(text_file, output_audio_file):
     TRY:
        Read analysis text
        Set max_retries and backoff_time

        WHILE retry_count < max_retries:
           TRY:
              Initialize Polly client
              Convert text to speech
              Save audio file
              RETURN success
           CATCH throttling:
              Wait with exponential backoff
              retry_count += 1
           CATCH other errors:
              retry_count += 1
              Continue or Break based on error type
     CATCH any errors:
         RETURN error

For a list of different voices that you can use in Amazon Polly, see Available voices in the Amazon Polly Developer Guide.

Your final output with Polly should sound something like this:

Clean up

It’s a best practice to delete the resources you provisioned for this solution. If you used an EC2 or SageMaker Notebook Instance, stop or terminate it. Remember to delete unused files from your S3 bucket (eg: video_analysis.txt and video_analysis.mp3).

Conclusion

Recapping the solution at a high level, in this post, you used:

Amazon S3 to store the original video, intermediate data, and the final audio description artifacts
Amazon Rekognition to partition the video file into time-stamped scenes
Computer vision capabilities from Amazon Nova Pro (available through Amazon Bedrock) to analyze the contents of each scene

We showed you how to use Amazon Polly to create an MP3 audio file from the final scene description text file, which is what will be consumed by the audience members. The solution outlined in this post demonstrates how to fully automate the process of creating audio descriptions for video content to improve accessibility. By using Amazon Rekognition for video segmentation, the Amazon Nova Pro model for scene analysis, and Amazon Polly for text-to-speech, you can generate a comprehensive audio description track that narrates the key visual elements of a video. This end-to-end automation can significantly reduce the time and cost required to make video content accessible for visually impaired audiences, helping businesses and organizations meet their accessibility goals. With the power of AWS AI services, this solution provides a scalable and efficient way to improve accessibility and inclusion for video-based media.

This solution isn’t limited to using it for TV shows and movies. Any visual media that requires accessibility can be a candidate! For more information about the new Amazon Nova model family and the amazing things these models can do, see Introducing Amazon Nova foundation models: Frontier intelligence and industry leading price performance.

In addition to the steps described in this post, additional actions you might need to take include:

Removing a video segment analysis’s introductory text from Amazon Nova. When Amazon Nova returns a response, it might begin with something like “In this video…” or something similar. You probably want just the video description itself without this introductory text. If there is introductory text in your scene descriptions, then Amazon Polly will speak it aloud and impact the quality of your audio transcriptions. You can account for this in a few ways.
- For example, prior to sending it to Amazon Polly, you can modify the generated scene descriptions by programmatically removing that type of text from them.
- Alternatively, you can use prompt engineering to request that Amazon Bedrock return only the scene descriptions in a structured format or without any additional commentary.
- The third option is to define and use a tool when performing inference on Amazon Bedrock. This can be a more comprehensive technique of defining the format of the output that you want Amazon Bedrock to return. Using tools to shape model output, is known as function calling. For more information, see Use a tool to complete an Amazon Bedrock model response.
You should also be mindful of the architectural components of the solution. In a production environment, being mindful of any potential scaling, security, and storage elements is important because the architecture might begin to resemble something more complex than the basic solution architecture diagram that this post began with.

About the Authors

Dylan Martin is an AWS Solutions Architect, working primarily in the generative AI space helping AWS Technical Field teams build AI/ML workloads on AWS. He brings his experience as both a security solutions architect and software engineer. Outside of work he enjoys motorcycling, the French Riviera and studying languages.

Ankit Patel is an AWS Solutions Developer, part of the Prototyping And Customer Engineering (PACE) team. Ankit helps customers bring their innovative ideas to life by rapid prototyping; using the AWS platform to build, orchestrate, and manage custom applications.

Training Llama 3.3 Swallow: A Japanese sovereign LLM on Amazon SageMaker HyperPod

June 13, 2025

by Kazuki Fujii Amazon AWS

This post is based on a technical report written by Kazuki Fujii, who led the Llama 3.3 Swallow model development.

The Institute of Science Tokyo has successfully trained Llama 3.3 Swallow, a 70-billion-parameter large language model (LLM) with enhanced Japanese capabilities, using Amazon SageMaker HyperPod. The model demonstrates superior performance in Japanese language tasks, outperforming GPT-4o-mini and other leading models. This technical report details the training infrastructure, optimizations, and best practices developed during the project.

This post is organized as follows:

Overview of Llama 3.3 Swallow
Architecture for Llama 3.3 Swallow training
Software stack and optimizations employed in Llama 3.3 Swallow training
Experiment management

We discuss topics relevant to machine learning (ML) researchers and engineers with experience in distributed LLM training and familiarity with cloud infrastructure and AWS services. We welcome readers who understand model parallelism and optimization techniques, especially those interested in continuous pre-training and supervised fine-tuning approaches.

Overview of the Llama 3.3 Swallow

Llama 3.3 Swallow is a 70-billion-parameter LLM that builds upon Meta’s Llama 3.3 architecture with specialized enhancements for Japanese language processing. The model was developed through a collaboration between the Okazaki Laboratory and Yokota Laboratory at the School of Computing, Institute of Science Tokyo, and the National Institute of Advanced Industrial Science and Technology (AIST).

The model is available in two variants on Hugging Face:

Llama 3.3 Swallow 70B Base v0.4 – The pretrained base model, which serves as the foundation for Japanese language understanding
Llama 3.3 Swallow 70B Instruct v0.4 – The instruction-tuned model, optimized for dialogue and task completion

Both variants are accessible through the tokyotech-llm organization on Hugging Face, providing researchers and developers with flexible options for different application needs.

Training methodology

The base model was developed through continual pre-training from Meta Llama 3.3 70B Instruct, maintaining the original vocabulary without expansion. The training data primarily consisted of the Swallow Corpus Version 2, a carefully curated Japanese web corpus derived from Common Crawl. To secure high-quality training data, the team employed the Swallow Education Classifier to extract educationally valuable content from the corpus. The following table summarizes the training data used for the base model training with approximately 314 billion tokens. For compute, the team used 32 ml.p5.48xlarge Amazon Elastic Compute Cloud (Amazon EC2) instances (H100, 80 GB, 256 GPUs) for continual pre-training with 16 days and 6 hours.

Training Data	Number of Training Tokens
Japanese Swallow Corpus v2	210 billion
Japanese Wikipedia	5.3 billion
English Wikipedia	6.9 billion
English Cosmopedia	19.5 billion
English DCLM baseline	12.8 billion
Laboro ParaCorpus	1.4 billion
Code Swallow-Code	50.2 billion
Math Finemath-4+	7.85 billion

For the instruction-tuned variant, the team focused exclusively on Japanese dialogue and code generation tasks. This version was created through supervised fine-tuning of the base model, using the same Japanese dialogue data that proved successful in the previous Llama 3.1 Swallow v0.3 release. Notably, the team made a deliberate choice to exclude English dialogue data from the fine-tuning process to maintain focus on Japanese language capabilities. The following table summarizes the instruction-tuning data used for the instruction-tuned model.

Training Data	Number of Training Samples
Gemma-2-LMSYS-Chat-1M-Synth	240,000
Swallow-Magpie-Ultra-v0.1	42,000
Swallow-Gemma-Magpie-v0.1	99,000
Swallow-Code-v0.3-Instruct-style	380,000

Performance and benchmarks

The base model has demonstrated remarkable performance in Japanese language tasks, consistently outperforming several industry-leading models. In comprehensive evaluations, it has shown superior capabilities compared to OpenAI’s GPT-4o (gpt-4o-2024-08-06), GPT-4o-mini (gpt-4o-mini-2024-07-18), GPT-3.5 (gpt-3.5-turbo-0125), and Qwen2.5-72B. These benchmarks reflect the model’s enhanced ability to understand and generate Japanese text. The following graph illustrates the base model performance comparison across these different benchmarks (original image).

The instruction-tuned model has shown particularly strong performance on the Japanese MT-Bench, as evaluated by GPT-4o-2024-08-06, demonstrating its effectiveness in practical applications. The following graph presents the performance metrics (original image).

Licensing and usage

The model weights are publicly available on Hugging Face and can be used for both research and commercial purposes. Users must comply with both the Meta Llama 3.3 license and the Gemma Terms of Use. This open availability aims to foster innovation and advancement in Japanese language AI applications while enforcing responsible usage through appropriate licensing requirements.

Training infrastructure architecture

The training infrastructure for Llama 3.3 Swallow was built on SageMaker HyperPod, with a focus on high performance, scalability, and observability. The architecture combines compute, network, storage, and monitoring components to enable efficient large-scale model training. The base infrastructure stack is available as an AWS CloudFormation template for seamless deployment and replication. This template provisions a comprehensive foundation by creating a dedicated virtual private cloud (VPC). The networking layer is complemented by a high-performance Amazon FSx for Lustre file system, alongside an Amazon Simple Storage Service (Amazon S3) bucket configured to store lifecycle scripts, which are used to configure the SageMaker HyperPod cluster.

Before deploying this infrastructure, it’s essential to make sure the AWS account has the appropriate service quotas. The deployment of SageMaker HyperPod requires specific quota values that often exceed default limits. You should check your current quota against the requirements detailed in SageMaker HyperPod quotas and submit a quota increase request as needed.

The following diagram illustrates the high-level architecture of the training infrastructure.

Compute and network configuration

The compute infrastructure is based on SageMaker HyperPod using a cluster of 32 EC2 P5 instances, each equipped with 8 NVIDIA H100 GPUs. The deployment uses a single spine configuration to provide minimal latency between instances. All communication between GPUs is handled through NCCL over an Elastic Fabric Adapter (EFA), providing high-throughput, low-latency networking essential for distributed training. The SageMaker HyperPod Slurm configuration manages the deployment and orchestration of these resources effectively.

Storage architecture

The project implements a hierarchical storage approach that balances performance and cost-effectiveness. At the foundation is Amazon S3, providing long-term storage for training data and checkpoints. To prevent storage bottlenecks during training, the team deployed FSx for Lustre as a high-performance parallel file system. This configuration enables efficient data access patterns across all training nodes, crucial for handling the massive datasets required for the 70-billion-parameter model.

The following diagram illustrates the storage hierarchy implementation.

The integration between Amazon S3 and FSx for Lustre is managed through a data repository association, configured using the following AWS Command Line Interface (AWS CLI) command:


aws fsx create-data-repository-association 
    --file-system-id ${FSX_ID} 
    --file-system-path "/hsmtest" 
    --data-repository-path s3://${BUCKET_NAME_DATA} 
    --s3 AutoImportPolicy='{Events=[NEW,CHANGED,DELETED]}',AutoExportPolicy={Events=[NEW,CHANGED,DELETED]} 
    --batch-import-meta-data-on-create 
    --region ${AWS_REGION}

Observability stack

The monitoring infrastructure combines Amazon Managed Service for Prometheus and Amazon Managed Grafana to provide comprehensive observability. The team integrated DCGM Exporter for GPU metrics and EFA Exporter for network metrics, enabling real-time monitoring of system health and performance. This setup allows for continuous tracking of GPU health, network performance, and training progress, with automated alerting for any anomalies through Grafana Dashboards. The following screenshot shows an example of a GPU health dashboard.

Software stack and training optimizations

The training environment is built on SageMaker HyperPod DLAMI, which provides a preconfigured Ubuntu base Amazon Machine Image (AMI) with essential components for distributed training. The software stack includes CUDA drivers and libraries (such as cuDNN and cuBLAS), NCCL for multi-GPU communication, and AWS-OFI-NCCL for EFA support. On top of this foundation, the team deployed Megatron-LM as the primary framework for model training. The following diagram illustrates the software stack architecture.

Distributed training implementation

The training implementation uses Megatron-LM’s advanced features for scaling LLM training. The framework provides sophisticated model parallelism capabilities, including both tensor and pipeline parallelism, along with efficient data parallelism that supports communication overlap. These features are essential for managing the computational demands of training a 70-billion-parameter model.

Advanced parallelism and communication

The team used a comprehensive 4D parallelism strategy of Megatron-LM that maximizes GPU utilization through careful optimization of communication patterns across multiple dimensions: data, tensor, and pipeline, and sequence parallelism. Data parallelism splits the training batch across GPUs, tensor parallelism divides individual model layers, pipeline parallelism splits the model into stages across GPUs, and sequence parallelism partitions the sequence length dimension—together enabling efficient training of massive models.

The implementation overlaps communication across data parallelism, tensor parallelism, and pipeline parallelism domains, significantly reducing blocking time during computation. This optimized configuration enables efficient scaling across the full cluster of GPUs while maintaining consistently high utilization rates. The following diagram illustrates this communication and computation overlap in distributed training (original image).

Megatron-LM enables fine-grained communication overlapping through multiple configuration flags: --overlap-grad-reduce and --overlap-param-gather for data-parallel operations, --tp-comm-overlap for tensor parallel operations, and built-in pipeline-parallel communication overlap (enabled by default). These optimizations work together to improve training scalability.

Checkpointing strategy

The training infrastructure implements an optimized checkpointing strategy using Distributed Checkpoint (DCP) and asynchronous I/O operations. DCP parallelizes checkpoint operations across all available GPUs, rather than being constrained by tensor and pipeline parallel dimensions as in traditional Megatron-LM implementations. This parallelization, combined with asynchronous I/O, enables the system to:

Save checkpoints up to 10 times faster compared to synchronous approaches
Minimize training interruption by offloading I/O operations
Scale checkpoint performance with the total number of GPUs
Maintain consistency through coordinated distributed saves

The checkpointing system automatically saves model states to the FSx Lustre file system at configurable intervals, with metadata tracked in Amazon S3. For redundancy, checkpoints are asynchronously replicated to Amazon S3 storage.

For implementation details on asynchronous DCP, see Asynchronous Saving with Distributed Checkpoint (DCP).

Experiment management

In November 2024, the team introduced a systematic approach to resource optimization through the development of a sophisticated memory prediction tool. This tool accurately predicts per-GPU memory usage during training and semi-automatically determines optimal training settings by analyzing all possible 4D parallelism configurations. Based on proven algorithmic research, this tool has become instrumental in maximizing resource utilization across the training infrastructure. The team plans to open source this tool with comprehensive documentation to benefit the broader AI research community.

The following screenshot shows an example of the memory consumption prediction tool interface (original image).

Training pipeline management

The success of the training process heavily relied on maintaining high-quality data pipelines. The team implemented rigorous data curation processes and robust cleaning pipelines, maintaining a careful balance in dataset composition across different languages and domains.For experiment planning, version control was critical. The team first fixed the versions of pre-training libraries and instruction tuning libraries to be used in the next experiment cycle. For libraries without formal version releases, the team managed versions using Git branches or tags to provide reproducibility. After the versions were locked, the team conducted short-duration training runs to:

Measure throughput with different numbers of GPU nodes
Search for optimal configurations among distributed training settings identified by the memory prediction library
Establish accurate training time estimates for scheduling

The following screenshot shows an example experiment schedule showing GPU node allocation, expected training duration, and key milestones across different training phases (original image).

To optimize storage performance before beginning experiments, training data was preloaded from Amazon S3 to the FSx for Lustre file system to prevent I/O bottlenecks during training. This preloading process used parallel transfers to maximize throughput:

# Preload data to Lustre filesystem
find <data/path> -type f -print0 | xargs -0 -n 1 -P 8 sudo lfs 
hsm_restore

Monitoring and performance management

The team implemented a comprehensive monitoring system focused on real-time performance tracking and proactive issue detection. By integrating with Weights & Biases, the system continuously monitors training progress and delivers automated notifications for key events such as job completion or failure and performance anomalies. Weights & Biases provides a set of tools that enable customized alerting through Slack channels. The following screenshot shows an example of a training monitoring dashboard in Slack (original image).

The monitoring infrastructure excels at identifying both job failures and performance bottlenecks like stragglers. The following figure presents an example of straggler detection showing training throughput degradation.

Conclusion

The successful training of Llama 3.3 Swallow represents a significant milestone in the development of LLMs using cloud infrastructure. Through this project, the team has demonstrated the effectiveness of combining advanced distributed training techniques with carefully orchestrated cloud resources. The implementation of efficient 4D parallelism and asynchronous checkpointing has established new benchmarks for training efficiency, and the comprehensive monitoring and optimization tools have provided consistent performance throughout the training process.

The project’s success is built on several foundational elements: a systematic approach to resource planning and optimization, robust data pipeline management, and a comprehensive monitoring and alerting system. The efficient storage hierarchy implementation has proven particularly crucial in managing the massive datasets required for training a 70-billion-parameter model.Looking ahead, the project opens several promising directions for future development. The team plans to open source the memory prediction tools, so other researchers can benefit from the optimizations developed during this project. Further improvements to the training pipelines are under development, along with continued enhancement of Japanese language capabilities. The project’s success also paves the way for expanded model applications across various domains.

Resources and references

This section provides key resources and references for understanding and replicating the work described in this paper. The resources are organized into documentation for the infrastructure and tools used, as well as model-specific resources for accessing and working with Llama 3.3 Swallow.

Documentation

The following resources provide detailed information about the technologies and frameworks used in this project:

Model resources

For more information about Llama 3.3 Swallow and access to the model, refer to the following resources:

About the Authors

Kazuki Fujii graduated with a bachelor’s degree in Computer Science from Tokyo Institute of Technology in 2024 and is currently a master’s student there (2024–2026). Kazuki is responsible for the pre-training and fine-tuning of the Swallow model series, a state-of-the-art multilingual LLM specializing in Japanese and English as of December 2023. Kazuki focuses on distributed training and building scalable training systems to enhance the model’s performance and infrastructure efficiency.

Daisuke Miyamato is a Senior Specialist Solutions Architect for HPC at Amazon Web Services. He is mainly supporting HPC customers in drug discovery, numerical weather prediction, electronic design automation, and ML training.

Kei Sasaki is a Senior Solutions Architect on the Japan Public Sector team at Amazon Web Services, where he helps Japanese universities and research institutions navigate their cloud migration journeys. With a background as a systems engineer specializing in high-performance computing, Kei supports these academic institutions in their large language model development initiatives and advanced computing projects.

Keita Watanabe is a Senior GenAI World Wide Specialist Solutions Architect at Amazon Web Services, where he helps develop machine learning solutions using OSS projects such as Slurm and Kubernetes. His background is in machine learning research and development. Prior to joining AWS, Keita worked in the ecommerce industry as a research scientist developing image retrieval systems for product search. Keita holds a PhD in Science from the University of Tokyo.

Accelerating Articul8’s domain-specific model development with Amazon SageMaker HyperPod

June 12, 2025

by Yashesh Shroff Amazon AWS

This post was co-written with Renato Nascimento, Felipe Viana, Andre Von Zuben from Articul8.

Generative AI is reshaping industries, offering new efficiencies, automation, and innovation. However, generative AI requires powerful, scalable, and resilient infrastructures that optimize large-scale model training, providing rapid iteration and efficient compute utilization with purpose-built infrastructure and automated cluster management.

In this post, we share how Articul8 is accelerating their training and deployment of domain-specific models (DSMs) by using Amazon SageMaker HyperPod and achieving over 95% cluster utilization and a 35% improvement in productivity.

What is SageMaker HyperPod?

SageMaker HyperPod is an advanced distributed training solution designed to accelerate the development of scalable, reliable, and secure generative AI model development. Articul8 uses SageMaker HyperPod to efficiently train large language models (LLMs) on diverse, representative data and uses its observability and resiliency features to keep the training environment stable over the long duration of training jobs. SageMaker HyperPod provides the following features:

Fault-tolerant compute clusters with automated faulty node replacement during model training
Efficient cluster utilization through observability and performance monitoring
Seamless model experimentation with streamlined infrastructure orchestration using Slurm and Amazon Elastic Kubernetes Service (Amazon EKS)

Who is Articul8?

Articul8 was established to address the gaps in enterprise generative AI adoption by developing autonomous, production-ready products. For instance, they found that most general-purpose LLMs often fall short in delivering the accuracy, efficiency, and domain-specific knowledge needed for real-world business challenges. They are pioneering a set of DSMs that offer twofold better accuracy and completeness, compared to general-purpose models, at a fraction of the cost. (See their recent blog post for more details.)

The company’s proprietary ModelMesh technology serves as an autonomous layer that decides, selects, executes, and evaluates the right models at runtime. Think of it as a reasoning system that determines what to run, when to run it, and in what sequence, based on the task and context. It evaluates responses at every step to refine its decision-making, enabling more reliable and interpretable AI solutions while dramatically improving performance.

Articul8’s ModelMesh supports:

LLMs for general tasks
Domain-specific models optimized for industry-specific applications
Non-LLMs for specialized reasoning tasks or established domain-specific tasks (for example, scientific simulation)

Articul8’s domain-specific models are setting new industry standards across supply chain, energy, and semiconductor sectors. The A8-SupplyChain model, built for complex workflows, achieves 92% accuracy and threefold performance gains over general-purpose LLMs in sequential reasoning. In energy, A8-Energy models were developed with EPRI and NVIDIA as part of the Open Power AI Consortium, enabling advanced grid optimization, predictive maintenance, and equipment reliability. The A8-Semicon model has set a new benchmark, outperforming top open-source (DeepSeek-R1, Meta Llama 3.3/4, Qwen 2.5) and proprietary models (GPT-4o, Anthropic’s Claude) by twofold in Verilog code accuracy, all while running at 50–100 times smaller model sizes for real-time AI deployment.

Articul8 develops some of their domain-specific models using Meta’s Llama family as a flexible, open-weight foundation for expert-level reasoning. Through a rigorous fine-tuning pipeline with reasoning trajectories and curated benchmarks, general Llama models are transformed into domain specialists. To tailor models for areas like hardware description languages, Articul8 applies Reinforcement Learning with Verifiable Rewards (RLVR), using automated reward pipelines to specialize the model’s policy. In one case, a dataset of 50,000 documents was automatically processed into 1.2 million images, 360,000 tables, and 250,000 summaries, clustered into a knowledge graph of over 11 million entities. These structured insights fuel A8-DSMs across research, product design, development, and operations.

How SageMaker HyperPod accelerated the development of Articul8’s DSMs

Cost and time to train DSMs is critical for success for Articul8 in a rapidly evolving ecosystem. Training high-performance DSMs requires extensive experimentation, rapid iteration, and scalable compute infrastructure. With SageMaker HyperPod, Articul8 was able to:

Rapidly iterate on DSM training – SageMaker HyperPod resiliency features enabled Articul8 to train and fine-tune its DSMs in a fraction of the time required by traditional infrastructure
Optimize model training performance – By using the automated failure recovery feature in SageMaker HyperPod, Articul8 provided stable and resilient training processes
Reduce AI deployment time by four times and lower total cost of ownership by five times – The orchestration capabilities of SageMaker HyperPod alleviated the manual overhead of cluster management, allowing Articul8’s research teams to focus on model optimization rather than infrastructure upkeep

These advantages contributed to record-setting benchmark results by Articul8, proving that domain-specific models deliver superior real-world performance compared to general-purpose models.

Distributed training challenges and the role of SageMaker HyperPod

Distributed training across hundreds of nodes faces several critical challenges beyond basic resource constraints. Managing massive training clusters requires robust infrastructure orchestration and careful resource allocation for operational efficiency. SageMaker HyperPod offers both managed Slurm and Amazon EKS orchestration experience that streamlines cluster creation, infrastructure resilience, job submission, and observability. The following details focus on the Slurm implementation for reference:

Cluster setup – Although setting up a cluster is a one-time effort, the process is streamlined with a setup script that walks the administrator through each step of cluster creation. This post shows how this can be done in discrete steps.
Resiliency – Fault tolerance becomes paramount when operating at scale. SageMaker HyperPod handles node failures and network interruptions by replacing faulty nodes automatically. You can add the flag --auto-resume=1 with the Slurm srun command, and the distributed training job will recover from the last checkpoint.
Job submission – SageMaker HyperPod managed Slurm orchestration is a powerful way for data scientists to submit and manage distributed training jobs. Refer to the following example in the AWS-samples distributed training repo for reference. For instance, a distributed training job can be submitted with a Slurm sbatch command: sbatch 1.distributed-training-llama2.sbatch. You can use squeue and scancel to view and cancel jobs, respectively.
Observability – SageMaker HyperPod uses Amazon CloudWatch and open source managed Prometheus and Grafana services for monitoring and logging. Cluster administrators can view the health of the infrastructure (network, storage, compute) and utilization.

Solution overview

The SageMaker HyperPod platform enables Articul8 to efficiently manage high-performance compute clusters without requiring a dedicated infrastructure team. The service automatically monitors cluster health and replaces faulty nodes, making the deployment process frictionless for researchers.

To enhance their experimental capabilities, Articul8 integrated SageMaker HyperPod with Amazon Managed Grafana, providing real-time observability of GPU resources through a single-pane-of-glass dashboard. They also used SageMaker HyperPod lifecycle scripts to customize their cluster environment and install required libraries and packages. This comprehensive setup empowers Articul8 to conduct rapid experimentation while maintaining high performance and reliability—they reduced their customers’ AI deployment time by four times and lowered their total cost of ownership by five times.

The following diagram illustrates the observability architecture.

The platform’s efficiency in managing computational resources with minimum downtime has been particularly valuable for Articul8’s research and development efforts, empowering them to quickly iterate on their generative AI solutions while maintaining enterprise-grade performance standards. The following sections describe the setup and results in detail.

For the setup for this post, we begin with the AWS published workshop for SageMaker HyperPod, and adjust it to suit our workload.

Prerequisites

The following two AWS CloudFormation templates address the prerequisites of the solution setup.

For SageMaker HyperPod

This CloudFormation stack addresses the prerequisites for SageMaker HyperPod:

VPC and two subnets – A public subnet and a private subnet are created in an Availability Zone (provided as a parameter). The virtual private cloud (VPC) contains two CIDR blocks with 10.0.0.0/16 (for the public subnet) and 10.1.0.0/16 (for the private subnet). An internet gateway and NAT gateway are deployed in the public subnet.
Amazon FSx for Lustre file system – An Amazon FSx for Lustre volume is created in the specified Availability Zone, with a default of 1.2 TB storage, which can be overridden by a parameter. For this case study, we increased the storage size to 7.2 TB.
Amazon S3 bucket – The stack deploys endpoints for Amazon Simple Storage Service (Amazon S3) to store lifecycle scripts.
IAM role – An AWS Identity and Access Management (IAM) role is also created to help execute SageMaker HyperPod cluster operations.
Security group – The script creates a security group to enable EFA communication for multi-node parallel batch jobs.

For cluster observability

To get visibility into cluster operations and make sure workloads are running as expected, an optional CloudFormation stack has been used for this case study. This stack includes:

Node exporter – Supports visualization of CPU load averages, memory and disk usage, network traffic, file system, and disk I/O metrics
NVIDIA DCGM – Supports visualization of GPU utilization, temperatures, power usage, and memory usage
EFA metrics – Supports visualization of EFA network and error metrics, EFA RDMA performance, and so on.
FSx for Lustre – Supports visualization of file system read/write operations, free capacity, and metadata operations

Observability can be configured through YAML scripts to monitor SageMaker HyperPod clusters on AWS. Amazon Managed Service for Prometheus and Amazon Managed Grafana workspaces with associated IAM roles are deployed in the AWS account. Prometheus and exporter services are also set up on the cluster nodes.

Using Amazon Managed Grafana with SageMaker HyperPod helps you create dashboards to monitor GPU clusters and make sure they operate efficiently with minimum downtime. In addition, dashboards have become a critical tool to give you a holistic view of how specialized workloads consume different resources of the cluster, helping developers optimize their implementation.

Cluster setup

The cluster is set up with the following components (results might vary based on customer use case and deployment setup):

Head node and compute nodes – For this case study, we use a head node and SageMaker HyperPod compute nodes. The head node has an ml.m5.12xlarge instance, and the compute queue consists of ml.p4de.24xlarge instances.
Shared volume – The cluster has an FSx for Lustre file system mounted at /fsx on both the head and compute nodes.
Local storage – Each node has 8 TB local NVME volume attached for local storage.
Scheduler – Slurm is used as an orchestrator. Slurm is an open source and highly scalable cluster management tool and job scheduling system for high-performance computing (HPC) clusters.
Accounting – As part of cluster configuration, a local MariaDB is deployed that keeps track of job runtime information.

Results

During this project, Articul8 was able to confirm the expected performance of A100 with the added benefit of creating a cluster using Slurm and providing observability metrics to monitor the health of various components (storage, GPU nodes, fiber). The primary validation was on the ease of use and rapid ramp-up of data science experiments. Furthermore, they were able to demonstrate near linear scaling with distributed training, achieving a 3.78 times reduction in time to train for Meta Llama-2 13B with 4x nodes. Having the flexibility to run multiple experiments, without losing development time from infrastructure overhead was an important accomplishment for the Articul8 data science team.

Clean up

If you run the cluster as part of the workshop, you can follow the cleanup steps to delete the CloudFormation resources after deleting the cluster.

Conclusion

This post demonstrated how Articul8 AI used SageMaker HyperPod to overcome the scalability and efficiency challenges of training multiple high-performing DSMs across key industries. By alleviating infrastructure complexity, SageMaker HyperPod empowered Articul8 to focus on building AI systems with measurable business outcomes. From semiconductor and energy to supply chain, Articul8’s DSMs are proving that the future of enterprise AI is not general—it’s purpose-built. Key takeaways include:

DSMs significantly outperform general-purpose LLMs in critical domains
SageMaker HyperPod accelerated the development of Articul8’s A8-Semicon, A8-SupplyChain, and Energy DSM models
Articul8 reduced AI deployment time by four times and lowered total cost of ownership by five times using the scalable, automated training infrastructure of SageMaker HyperPod

Learn more about SageMaker HyperPod by following this workshop. Reach out to your account team on how you can use this service to accelerate your own training workloads.

About the Authors

Yashesh A. Shroff, PhD. is a Sr. GTM Specialist in the GenAI Frameworks organization, responsible for scaling customer foundational model training and inference on AWS using self-managed or specialized services to meet cost and performance requirements. He holds a PhD in Computer Science from UC Berkeley and an MBA from Columbia Graduate School of Business.

Amit Bhatnagar is a Sr Technical Account Manager with AWS, in the Enterprise Support organization, with a focus on generative AI startups. He is responsible for helping key AWS customers with their strategic initiatives and operational excellence in the cloud. When he is not chasing technology, Amit loves to cook vegan delicacies and hit the road with his family to chase the horizon.

Renato Nascimento is the Head of Technology at Articul8, where he leads the development and execution of the company’s technology strategy. With a focus on innovation and scalability, he ensures the seamless integration of cutting-edge solutions into Articul8’s products, enabling industry-leading performance and enterprise adoption.

Felipe Viana is the Head of Applied Research at Articul8, where he leads the design, development, and deployment of innovative generative AI technologies, including domain-specific models, new model architectures, and multi-agent autonomous systems.

Andre Von Zuben is the Head of Architecture at Articul8, where he is responsible for designing and implementing scalable generative AI platform elements, novel generative AI model architectures, and distributed model training and deployment pipelines.

How VideoAmp uses Amazon Bedrock to power their media analytics interface

June 12, 2025

by Suzanne Willard, Makoto Uchida Amazon AWS

This post was co-written with Suzanne Willard and Makoto Uchida from VideoAmp.

In this post, we illustrate how VideoAmp, a media measurement company, worked with the AWS Generative AI Innovation Center (GenAIIC) team to develop a prototype of the VideoAmp Natural Language (NL) Analytics Chatbot to uncover meaningful insights at scale within media analytics data using Amazon Bedrock. The AI-powered analytics solution involved the following components:

A natural language to SQL pipeline, with a conversational interface, that works with complex queries and media analytics data from VideoAmp
An automated testing and evaluation tool for the pipeline

VideoAmp background

VideoAmp is a tech-first measurement company that empowers media agencies, brands, and publishers to precisely measure and optimize TV, streaming, and digital media. With a comprehensive suite of measurement, planning, and optimization solutions, VideoAmp offers clients a clear, actionable view of audiences and attribution across environments, enabling them to make smarter media decisions that help them drive better business outcomes. VideoAmp has seen incredible adoption for its measurement and currency solutions with 880% YoY growth, 98% coverage of the TV publisher landscape, 11 agency groups, and more than 1,000 advertisers. VideoAmp is headquartered in Los Angeles and New York with offices across the United States. To learn more, visit www.videoamp.com.

VideoAmp’s AI journey

VideoAmp has embraced AI to enhance its measurement and optimization capabilities. The company has integrated machine learning (ML) algorithms into its infrastructure to analyze vast amounts of viewership data across traditional TV, streaming, and digital services. This AI-driven approach allows VideoAmp to provide more accurate audience insights, improve cross-environment measurement, and optimize advertising campaigns in real time. By using AI, VideoAmp has been able to offer advertisers and media owners more precise targeting, better attribution models, and increased return on investment for their advertising spend. The company’s AI journey has positioned it as a leader in the evolving landscape of data-driven advertising and media measurement.

To take their innovations a step further, VideoAmp is building a brand-new analytics solution powered by generative AI, which will provide their customers with accessible business insights. Their goal for a beta product is to create a conversational AI assistant powered by large language models (LLMs) that allows VideoAmp’s data analysts and non-technical users such as content researchers and publishers to perform data analytics using natural language queries.

Use case overview

VideoAmp is undergoing a transformative journey by integrating generative AI into its analytics. The company aims to revolutionize how customers, including publishers, media agencies, and brands, interact with and derive insights from VideoAmp’s vast repository of data through a conversational AI assistant interface.

Presently, analysis by data scientists and analysts is done manually, requires technical SQL knowledge, and can be time-consuming for complex and high-dimensional datasets. Acknowledging the necessity for streamlined and accessible processes, VideoAmp worked with the GenAIIC to develop an AI assistant capable of comprehending natural language queries, generating and executing SQL queries on VideoAmp’s data warehouse, and delivering natural language summaries of retrieved information. The assistant allows non-technical users to surface data-driven insights, and it reduces research and analysis time for both technical and non-technical users.

Key success criteria for the project included:

The ability to convert natural language questions into SQL statements, connect to VideoAmp’s provided database, execute statements on VideoAmp performance metrics data, and create a natural language summary of results
A UI to ask natural language questions and view assistant output, which includes generated SQL queries, reasoning for the SQL statements, retrieved data, and natural language data summaries
Conversational support for the user to iteratively refine and filter asked questions
Low latency and cost-effectiveness
An automated evaluation pipeline to assess the quality and accuracy of the assistant

The team overcame a few challenges during the development process:

Adapting LLMs to understand the domain aspects of VideoAmp’s dataset – The dataset included highly industry-specific fields and metrics, and required complex queries to effectively filter and analyze. The queries often involved multiple specialized metric calculations, filters selecting from over 30 values, and extensive grouping and ordering.
Developing an automated evaluation pipeline – The pipeline is able to correctly identify if generated outputs are equivalent to ground truth data, even if they have different column aliasing, ordering, and metric calculations.

Solution overview

The GenAIIC team worked with VideoAmp to create an AI assistant that used Anthropic’s Claude 3 LLMs through Amazon Bedrock. Amazon Bedrock was chosen for this project because it provides access to high-quality foundation models (FMs), including Anthropic’s Claude 3 series, through a unified API. This allowed the team to quickly integrate the most suitable models for different components of the solution, such as SQL generation and data summarization.

Additional features in Amazon Bedrock, including Amazon Bedrock Prompt Management, native support for Retrieval Augmented Generation (RAG) and structured data retrieval through Amazon Bedrock Knowledge Bases, Amazon Bedrock Guardrails, and fine-tuning, enable VideoAmp to quickly expand the analytics solution and take it to production. Amazon Bedrock also offers robust security and adheres to compliance certifications, allowing VideoAmp to confidently expand their AI analytics solution while maintaining data privacy and adhering to industry standards.

The solution is connected to a data warehouse. It supports a variety of database connections, such as Snowflake, SingleStore, PostgreSQL, Excel and CSV files, and more. The following diagram illustrates the high-level workflow of the solution.

The workflow consists of the following steps:

The user navigates to the frontend application and asks a question in natural language.
A Question Rewriter LLM component uses previous conversational context to augment the question with additional details if applicable. This allows follow-up questions and refinements to previous questions.
A Text-to-SQL LLM component creates a SQL query that corresponds to the user question.
The SQL query is executed in the data warehouse.
A Data-to-Text LLM component summarizes the retrieved data for the user.

The rewritten question, generated SQL, reasoning, and retrieved data are returned at each step.

AI assistant workflow details

In this section, we discuss the components of the AI assistant workflow in more detail.

Rewriter

After the user asks the question, the current question and the previous questions the user asked in the current session are sent to the Question Rewriter component, which uses Anthropic’s Claude 3 Sonnet model. If deemed necessary, the LLM uses context from the previous questions to augment the current user question to make it a standalone question with context included. This enables multi-turn conversational support for the user, allowing for natural interactions with the assistant.

For example, if a user first asked, “For the week of 09/04/2023 – 09/10/2023, what were the top 10 ranked original national broadcast shows based on viewership for households with 18+?”, followed by, “Can I have the same data for one year later”, the rewriter would rewrite the latter question as “For the week of 09/03/2024 – 09/09/2024, what were the top 10 ranked original national broadcast shows based on viewership for households with 18+?”

Text-to-SQL

The rewritten user question is sent to the Text-to-SQL component, which also uses Anthropic’s Claude 3 Sonnet model. The Text-to-SQL component uses information about the database in its prompt to generate a SQL query corresponding to the user question. It also generates an explanation of the query.

The text-to-SQL prompt addressed several challenges, such as industry-specific language in user questions, complex metrics, and several rules and defaults for filtering. The prompt was developed through several iterations, based on feedback and guidance from the VideoAmp team, and manual and automated evaluation.

The prompt consisted of four overarching sections: context, SQL instructions, task, and examples. During the development phase, database schema and domain- or task-specific knowledge were found to be critical, so one major part of the prompt was designed to incorporate them in the context. To make this solution reusable and scalable, a modularized design of the prompt/input system is employed, making it generic so it can be applied to other use cases and domains. The solution can support Q&A with multiple databases by dynamically switching/changing the corresponding context with an orchestrator if needed.

The context section contains the following details:

Database schema
Sample categories for relevant data fields such as television networks to aid the LLM in understanding what fields to use for identifiers in the question
Industry term definitions
How to calculate different types of metrics or aggregations
Default values or fields should be selected if not specified
Other domain- or task-specific knowledge

The SQL instructions contain the following details:

Dynamic insertion of today’s date as a reference for terms, such as “last 3 quarters”
Instructions on usage of sub-queries
Instructions on when to retrieve additional informational columns not specified in the user question
Known SQL syntax and database errors to avoid and potential fixes

In the task section, the LLM is given a detailed step-by-step process to formulate SQL queries based on the context. A step-by-step process is required for the LLM to correctly think through and assimilate the required context and rules. Without the step-by-step process, the team found that the LLM wouldn’t adhere to all instructions provided in the previous sections.

In the examples section, the LLM is given several examples of user questions, corresponding SQL statements, and explanations.

In addition to iterating on the prompt content, different content organization patterns were tested due to long context. The final prompt was organized with markdown and XML.

SQL execution

After the Text-to-SQL component outputs a query, the query is executed against VideoAmp’s data warehouse using database connector code. For this use case, only read queries for analytics are executed to protect the database from unexpected operations like updates or deletes. The credentials for the database are securely stored and accessed using AWS Secrets Manager and AWS Key Management Service (AWS KMS).

Data-to-Text

The data retrieved by the SQL query is sent to the Data-to-Text component, along with the rewritten user question. The Data-to-Text component, which uses Anthropic’s Claude 3 Haiku model, produces a concise summary of the retrieved data and answers the user question.

The final outputs are displayed on the frontend application as shown in the following screenshots (protected data is hidden).

Evaluation framework workflow details

The GenAIIC team developed a sophisticated automated evaluation pipeline for VideoAmp’s NL Analytics Chatbot, which directly informed prompt optimization and solution improvements and was a critical component in providing high-quality results.

The evaluation framework comprises of two categories:

SQL query evaluation – Generated SQL queries are evaluated for overall closeness to the ground truth SQL query. A key feature of the SQL evaluation component was the ability to account for column aliasing and ordering differences when comparing statements and determine equivalency.
Retrieved data evaluation – The retrieved data is compared to ground truth data to determine an exact match, after a few processing steps to account for column, formatting, and system differences.

The evaluation pipeline also produces detailed reports of the results and discrepancies between generated data and ground truth data.

Dataset

The dataset used for the prototype solution was hosted in a data warehouse and consisted of performance metrics data such as viewership, ratings, and rankings for television networks and programs. The field names were industry-specific, so a data dictionary was included in the text-to-SQL prompt as part of the schema. The credentials for the database are securely stored and accessed using Secrets Manager and AWS KMS.

Results

A set of test questions were evaluated by the GenAIIC and VideoAmp teams, focusing on three metrics:

Accuracy – Different accuracy metrics were analyzed, but exact matches between retrieved data and ground truth data were prioritized
Latency – LLM generation latency, excluding the time taken to query the database
Cost – Average cost per user question

Both the evaluation pipeline and human review reported high accuracies on the dataset, whereas costs and latencies remained low. Overall, the results were well-aligned with VideoAmp expectations. VideoAmp anticipates this solution will make it simple for users to handle complex data queries with confidence through intuitive natural language interactions, reducing the time to business insights.

Conclusion

In this post, we shared how the GenAIIC team worked with VideoAmp to build a prototype of the VideoAmp NL Analytics Chatbot, an end-to-end generative AI data analytics interface using Amazon Bedrock and Anthropic’s Claude 3 LLMs. The solution is equipped with a variety of state-of-the-art LLM-based techniques, such as question rewriting, text-to-SQL query generation, and summarization of data in natural language. It also includes an automated evaluation module for evaluating the correctness of generated SQL statements and retrieved data. The solution achieved high accuracy on VideoAmp’s evaluation samples. Users can interact with the solution through an intuitive AI assistant interface with conversational capabilities.

VideoAmp will soon be launching their new generative AI-powered analytics interface, which enables customers to analyze data and gain business insights through natural language conversation. Their successful work with the GenAIIC team will allow VideoAmp to use generative AI technology to swiftly deliver valuable insights for both technical and non-technical customers.

This is just one of the ways AWS enables builders to deliver generative AI-based solutions. You can get started with Amazon Bedrock and see how it can be integrated in example code bases. The GenAIIC is a group of science and strategy experts with comprehensive expertise spanning the generative AI journey, helping you prioritize use cases, build a roadmap, and move solutions into production. If you’re interested in working with the GenAIIC, reach out to them today.

About the authors

Suzanne Willard is the VP of Engineering at VideoAmp where she founded and leads the GenAI program, establishing the strategic vision and execution roadmap. With over 20 years experience she is driving innovation in AI technologies, creating transformative solutions that align with business objectives and set the company apart in the market.

Makoto Uchida is a senior architect at VideoAmp in the AI domain, acting as area technical lead of AI portfolio, responsible for defining and driving AI product and technical strategy in the content and ads measurement platform PaaS product. Previously, he was a software engineering lead in generative and predictive AI Platform at a major hyperscaler public Cloud service. He has also engaged with multiple startups, laying the foundation of Data/ML/AI infrastructures.

Shreya Mohanty is a Deep Learning Architect at the AWS Generative AI Innovation Center, where she partners with customers across industries to design and implement high-impact GenAI-powered solutions. She specializes in translating customer goals into tangible outcomes that drive measurable impact.

Long Chen is a Sr. Applied Scientist at AWS Generative AI Innovation Center. He holds a Ph.D. in Applied Physics from University of Michigan – Ann Arbor. With more than a decade of experience for research and development, he works on innovative solutions in various domains using generative AI and other machine learning techniques, ensuring the success of AWS customers. His interest includes generative models, multi-modal systems and graph learning.

Amaran Asokkumar is a Deep Learning Architect at AWS, specializing in infrastructure, automation, and AI. He leads the design of GenAI-enabled solutions across industry segments. Amaran is passionate about all things AI and helping customers accelerate their GenAI exploration and transformation efforts.

Vidya Sagar Ravipati is a Science Manager at the Generative AI Innovation Center, where he leverages his vast experience in large-scale distributed systems and his passion for machine learning to help AWS customers across different industry verticals accelerate their AI and cloud adoption.

Scaling up image segmentation across data and tasks

June 12, 2025

by Pei Wang Amazon AWS

Scaling up image segmentation across data and tasks

Novel architecture that fuses learnable queries and conditional queries improves a segmentation models ability to transfer across tasks.

Computer vision

Pei Wang

Stefano Soatto

June 12, 12:25 PMJune 12, 12:25 PM

The first draft of this blog post was generated by Amazon Nova Pro, based on detailed instructions from Amazon Science editors and multiple examples of prior posts.

In a paper we’re presenting at the 2025 Conference on Computer Vision and Pattern Recognition (CVPR), we introduce a new approach to image segmentation that scales across diverse datasets and tasks. Traditional segmentation models, while effective on isolated tasks, often struggle as the number of new tasks or unfamiliar scenarios grows. Our proposed method, which uses a model we call a mixed-query transformer (MQ-former), aims to enable joint training and evaluation across multiple tasks and datasets.

Scalable segmentation

Image segmentation is a computer vision task that involves partitioning an image into distinct regions or segments. Each segment corresponds to a different object or part of the scene. There are several types of segmentation tasks, including foreground/background segmentation (distinguishing objects at different distances), semantic segmentation (labeling each pixel as belonging to a particular object class), and instance segmentation (identifying each pixel as belonging to a particular instance of an object class).

An example of instance segmentation, given the text prompt “cardinal”.

Scalability means that a segmentation model can effectively improve with an increase in the size of its training dataset, in the diversity of the tasks it performs, or both. Most prior research has focused on one or the other data or task diversity. We address both at once.

A tale of two queries

In our paper, we show that one issue preventing effective scalability in segmentation models is the design of object queries. An object query is a way of representing a hypothesis about objects in a scene a hypothesis that can be tested against images.

There are two main types of object queries. The first, which we refer to as learnable queries, are learned vectors that interact with image features and encode information about location and object class. Learnable queries tend to perform well on semantic segmentation as the they do not contain object-specific priors.

The second type of object query, which we refer to as a conditional query, is akin to two-stage object detection: region proposals are generated by a Transformer encoder, and then high-confidence proposals are fed into the Transformer decoder as queries to generate the final prediction. Conditional queries are closely aligned with the object classes and excel at object detection and instance segmentation on semantically well-defined objects.

Our approach is to combine both types of queries, which improves the models ability to transfer across tasks. Our MQ-Former model represents inputs using both learnable queries and conditional queries, and every layer of the decoder has a cross-attention mechanism, so that the processing of the learnable queries can factor in information from the conditional-query processing, and vice versa.

Architectural schematics for learnable queries, conditional queries, and mixed queries. Solid triangles represent instance segmentation ground truth, and striped triangles represent semantic-segmentation ground truth.

Leveraging synthetic data

Mixed queries aid scalability across segmentation tasks, but the other aspect of scalability in segmentation models is dataset size. One of the key challenges in scaling up segmentation models is the scarcity of high-quality, annotated data. To overcome this limitation, we propose leveraging synthetic data.

Examples of synthetic data. At left are two examples of synthetic masks, at right two examples of synthetic captions.

While segmentation data is scarce, object recognition data is plentiful. Object recognition datasets typically include bounding boxes, or rectangles that identify the image regions in which labeled objects can be found.

Asking a trained segmentation model to segment only the object within a bounding box significantly improves performance; we are thus able to use weaker segmentation models to convert object recognition datasets into segmentation datasets that can be used to train stronger segmentation models.

Bounding boxes can also focus automatic captioning models on regions of interest in an image, to provide the type of object classifications necessary to train semantic-segmentation and instance segmentation models.

Experimental results

We evaluated our approach using 15 datasets covering a range of segmentation tasks and found that, with MQ-Former, scaling up both the volume of training data and the diversity of tasks consistently enhances the models segmentation capabilities.

For example, on the SeginW benchmark, which includes 25 datasets used for open-vocabulary in-the-wild segmentation evaluation, scaling the data and tasks from 100,000 samples to 600,000 boosted performance 16%, as measured by average precision of object masking. Incorporating synthetic data improved performance by another 14%, establishing a new state of the art.

Research areas: Computer vision

Tags: Image segmentation, Data representation

Adobe enhances developer productivity using Amazon Bedrock Knowledge Bases

June 11, 2025

by Kamran Razi Amazon AWS

Adobe Inc. excels in providing a comprehensive suite of creative tools that empower artists, designers, and developers across various digital disciplines. Their product landscape is the backbone of countless creative projects worldwide, ranging from web design and photo editing to vector graphics and video production.

Adobe’s internal developers use a vast array of wiki pages, software guidelines, and troubleshooting guides. Recognizing the challenge developers faced in efficiently finding the right information for troubleshooting, software upgrades, and more, Adobe’s Developer Platform team sought to build a centralized system. This led to the initiative Unified Support, designed to help thousands of the company’s internal developers get immediate answers to questions from a centralized place and reduce time and cost spent on developer support. For instance, a developer setting up a continuous integration and delivery (CI/CD) pipeline in a new AWS Region or running a pipeline on a dev branch can quickly access Adobe-specific guidelines and best practices through this centralized system.

The initial prototype for Adobe’s Unified Support provided valuable insights and confirmed the potential of the approach. This early phase highlighted key areas requiring further development to operate effectively at Adobe’s scale, including addressing scalability needs, simplifying resource onboarding, improving content synchronization mechanisms, and optimizing infrastructure efficiency. Building on these learnings, improving retrieval precision emerged as the next critical step.

To address these challenges, Adobe partnered with the AWS Generative AI Innovation Center, using Amazon Bedrock Knowledge Bases and the Vector Engine for Amazon OpenSearch Serverless. This solution dramatically improved their developer support system, resulting in a 20% increase in retrieval accuracy. Metadata filtering empowers developers to fine-tune their search, helping them surface more relevant answers across complex, multi-domain knowledge sources. This improvement not only enhanced the developer experience but also contributed to reduced support costs.

In this post, we discuss the details of this solution and how Adobe enhances their developer productivity.

Solution overview

Our project aimed to address two key objectives:

Document retrieval engine enhancement – We developed a robust system to improve search result accuracy for Adobe developers. This involved creating a pipeline for data ingestion, preprocessing, metadata extraction, and indexing in a vector database. We evaluated retrieval performance against Adobe’s ground truth data to produce high-quality, domain-specific results.
Scalable, automated deployment – To support Unified Support across Adobe, we designed a reusable blueprint for deployment. This solution accommodates large-scale data ingestion of various types and offers flexible configurations, including embedding model selection and chunk size adjustment.

Using Amazon Bedrock Knowledge Bases, we created a customized, fully managed solution that improved the retrieval effectiveness. Key achievements include a 20% increase in accuracy metrics for document retrieval, seamless document ingestion and change synchronization, and enhanced scalability to support thousands of Adobe developers. This solution provides a foundation for improved developer support and scalable deployment across Adobe’s teams. The following diagram illustrates the solution architecture.

Let’s take a closer look at our solution:

Amazon Bedrock Knowledge Bases index – The backbone of our system is Amazon Bedrock Knowledge Bases. Data is indexed through the following stages:
- Data ingestion – We start by pulling data from Amazon Simple Storage Service (Amazon S3) buckets. This could be anything from resolutions to past issues or wiki pages.
- Chunking – Amazon Bedrock Knowledge Bases breaks data down into smaller pieces, or chunks, defining the specific units of information that can be retrieved. This chunking process is configurable, allowing for optimization based on the specific needs of the business.
- Vectorization – Each chunk is passed through an embedding model (in this case, Amazon Titan V2 on Amazon Bedrock) creating a 1,024-dimension numerical vector. This vector represents the semantic meaning of the chunk, allowing for similarity searches
- Storage – These vectors are stored in the Amazon OpenSearch Serverless vector database, creating a searchable repository of information.
Runtime – When a user poses a question, our system competes the following steps:
- Query vectorization – With the Amazon Bedrock Knowledge Bases Retrieve API, the user’s question is automatically embedded using the same embedding model used for the chunks during data ingestion.
- Similarity search and retrieval – The system retrieves the most relevant chunks in the vector database based on similarity scores to the query.
- Ranking and presentation – The corresponding documents are ranked based on the sematic similarity of their modest relevant chunks to the query, and the top-ranked information is presented to the user.

Multi-tenancy through metadata filtering

As developers, we often find ourselves seeking help across various domains. Whether it’s tackling CI/CD issues, setting up project environments, or adopting new libraries, the landscape of developer challenges is vast and varied. Sometimes, our questions even span multiple domains, making it crucial to have a system for retrieving relevant information. Metadata filtering empowers developers to retrieve not just semantically relevant information, but a well-defined subset of that information based on specific criteria. This powerful tool enables you to apply filters to your retrievals, helping developers narrow the search results to a limited set of documents based on the filter, thereby improving the relevancy of the search.

To use this feature, metadata files are provided alongside the source data files in an S3 bucket. To enable metadata-based filtering, each source data file needs to be accompanied by a corresponding metadata file. These metadata files used the same base name as the source file, with a .metadata.json suffix. Each metadata file included relevant attributes—such as domain, year, or type—to support multi-tenancy and fine-grained filtering in OpenSearch Service. The following code shows what an example metadata file looks like:

{
  "metadataAttributes": 
      {
        "domain": "project A",
        "year": 2016,
        "type": "wiki"
       }
 }

Retrieve API

The Retrieve API allows querying a knowledge base to retrieve relevant information. You can use it as follows:

Send a POST request to /knowledgebases/knowledgeBaseId/retrieve.
Include a JSON body with the following:
1. retrievalQuery – Contains the text query.
2. retrievalConfiguration – Specifies search parameters, such as number of results and filters.
3. nextToken – For pagination (optional).

The following is an example request syntax:

POST /knowledgebases/knowledgeBaseId/retrieve HTTP/1.1
Content-type: application/json
{
   "nextToken": "string",
   "retrievalConfiguration": { 
      "vectorSearchConfiguration": { 
         "filter": { ... },
         "numberOfResults": number,
         "overrideSearchType": "string"
      }
   },
   "retrievalQuery": { 
      "text": "string"
   }
}

Additionally, you can set up the retriever with ease using the langchain-aws package:

from langchain_aws import AmazonKnowledgeBasesRetriever
retriever = AmazonKnowledgeBasesRetriever(
    knowledge_base_id="YOUR-ID",
    retrieval_config={"vectorSearchConfiguration": {"numberOfResults": 4}},
)
retriever.get_relevant_documents(query="What is the meaning of life?")

This approach enables semantic querying of the knowledge base to retrieve relevant documents based on the provided query, simplifying the implementation of search.

Experimentation

To deliver the most accurate and efficient knowledge retrieval system, the Adobe and AWS teams put the solution to the test. The team conducted a series of rigorous experiments to fine-tune the system and find the optimal settings.

Before we dive into our findings, let’s discuss the metrics and evaluation process we used to measure success. We used the open source model evaluation framework Ragas to evaluate the retrieval system across two metrics: document relevance and mean reciprocal rank (MRR). Although Ragas comes with many metrics for evaluating model performance out of the box, we needed to implement these metrics by extending the Ragas framework with custom code.

Document relevance – Document relevance offers a qualitative approach to assessing retrieval accuracy. This metric uses a large language model (LLM) as an impartial judge to compare retrieved chunks against user queries. It evaluates how effectively the retrieved information addresses the developer’s question, providing a score between 1–10.
Mean reciprocal rank – On the quantitative side, we have the MRR metric. MRR evaluates how well a system ranks the first relevant item for a query. For each query, find the rank k of the highest-ranked relevant document. The score for that query is 1/k. MRR is the average of these 1/k scores over the entire set of queries. A higher score (closer to 1) signifies that the first relevant result is typically ranked high.

These metrics provide complementary insights: document relevance offers a content-based assessment, and MRR provides a ranking-based evaluation. Together, they offer a comprehensive view of the retrieval system’s effectiveness in finding and prioritizing relevant information.In our recent experiments, we explored various data chunking strategies to optimize the performance of retrieval. We tested several approaches, including fixed-size chunking as well as more advanced semantic chunking and hierarchical chunking.Semantic chunking focuses on preserving the contextual relationships within the data by segmenting it based on semantic meaning. This approach aims to improve the relevance and coherence of retrieved results.Hierarchical chunking organizes data into a hierarchical parent-child structure, allowing for more granular and efficient retrieval based on the inherent relationships within your data.

For more information on how to set up different chunking strategies, refer to Amazon Bedrock Knowledge Bases now supports advanced parsing, chunking, and query reformulation giving greater control of accuracy in RAG based applications.

We tested the following chunking methods with Amazon Bedrock Knowledge Bases:

Fixed-size short chunking – 400-token chunks with a 20% overlap (shown as the blue variant in the following figure)
Fixed-size long chunking – 1,000-token chunks with a 20% overlap
Hierarchical chunking – Parent chunks of 1,500 tokens and child chunks of 300 tokens, with a 60-token overlap
Semantic chunking – 400-token chunks with a 95% similarity percentile threshold

For reference, a paragraph of approximately 1,000 characters typically translates to around 200 tokens. To assess performance, we measured document relevance and MRR across different context sizes, ranging from 1–5. This comparison aims to provide insights into the most effective chunking strategy for organizing and retrieving information for this use case.The following figures illustrate the MRR and document relevance metrics, respectively.

As a result of these experiments, we found that MRR is a more sensitive metric for evaluating the impact of chunking strategies, particularly when varying the number of retrieved chunks (top-k from 1 to 5). Among the approaches tested, the fixed-size 400-token strategy—shown in blue—proved to be the simplest and most effective, consistently yielding the highest accuracy across different retrieval sizes.

Conclusion

In the journey to design Adobe’s developer Unified Support search and retrieval system, we’ve successfully harnessed the power of Amazon Bedrock Knowledge Bases to create a robust, scalable, and efficient solution. By configuring fixed-size chunking and using the Amazon Titan V2 embedding model, we achieved a remarkable 20% increase in accuracy metrics for document retrieval compared to Adobe’s existing solution, by running evaluations on the customer’s testing system and provided dataset.The integration of metadata filtering emerged as a game changing feature, allowing for seamless navigation across diverse domains and enabling customized retrieval. This capability proved invaluable for Adobe, given the complexity and breadth of their information landscape. Our comprehensive comparison of retrieval accuracy for different configurations of the Amazon Bedrock Knowledge Bases index has yielded valuable insights. The metrics we developed provide an objective framework for assessing the quality of retrieved context, which is crucial for applications demanding high-precision information retrieval. As we look to the future, this customized, fully managed solution lays a solid foundation for continuous improvement in developer support at Adobe, offering enhanced scalability and seamless support infrastructure in tandem with evolving developer needs.

For those interested in working with AWS on similar projects, visit Generative AI Innovation Center. To learn more about Amazon Bedrock Knowledge Bases, see Retrieve data and generate AI responses with knowledge bases.

About the Authors

Kamran Razi is a Data Scientist at the Amazon Generative AI Innovation Center. With a passion for delivering cutting-edge generative AI solutions, Kamran helps customers unlock the full potential of AWS AI/ML services to solve real-world business challenges. With over a decade of experience in software development, he specializes in building AI-driven solutions, including AI agents. Kamran holds a PhD in Electrical Engineering from Queen’s University.

Nay Doummar is an Engineering Manager on the Unified Support team at Adobe, where she’s been since 2012. Over the years, she has contributed to projects in infrastructure, CI/CD, identity management, containers, and AI. She started on the CloudOps team, which was responsible for migrating Adobe’s infrastructure to the AWS Cloud, marking the beginning of her long-term collaboration with AWS. In 2020, she helped build a support chatbot to simplify infrastructure-related assistance, sparking her passion for user support. In 2024, she joined a project to Unify Support for the Developer Platform, aiming to streamline support and boost productivity.

Varsha Chandan Bellara is a Software Development Engineer at Adobe, specializing in AI-driven solutions to boost developer productivity. She leads the development of an AI assistant for the Unified Support initiative, using Amazon Bedrock, implementing RAG to provide accurate, context-aware responses for technical support and issue resolution. With expertise in cloud-based technologies, Varsha combines her passion for containers and serverless architectures with advanced AI to create scalable, efficient solutions that streamline developer workflows.

Jan Michael Ong is a Senior Software Engineer at Adobe, where he supports the developer community and engineering teams through tooling and automation. Currently, he is part of the Developer Experience team at Adobe, working on AI projects and automation contributing to Adobe’s internal Developer Platform.

Justin Johns is a Deep Learning Architect at Amazon Web Services who is passionate about innovating with generative AI and delivering cutting-edge solutions for customers. With over 5 years of software development experience, he specializes in building cloud-based solutions powered by generative AI.

Gaurav Dhamija is a Principal Solutions Architect at Amazon Web Services, where he helps customers design and build scalable, reliable, and secure applications on AWS. He is passionate about developer experience, containers, and serverless technologies, and works closely with engineering teams to modernize application architectures. Gaurav also specializes in generative AI, using AWS generative AI services to drive innovation and enhance productivity across a wide range of use cases.

Sandeep Singh is a Senior Generative AI Data Scientist at Amazon Web Services, helping businesses innovate with generative AI. He specializes in generative AI, machine learning, and system design. He has successfully delivered state-of-the-art AI/ML-powered solutions to solve complex business problems for diverse industries, optimizing efficiency and scalability.

Anila Joshi has more than a decade of experience building AI solutions. As a Senior Manager, Applied Science at AWS Generative AI Innovation Center, Anila pioneers innovative applications of AI that push the boundaries of possibility and accelerate the adoption of AWS services with customers by helping customers ideate, identify, and implement secure generative AI solutions.

Amazon Nova Lite enables Bito to offer a free tier option for its AI-powered code reviews

June 11, 2025

by Eshan Bhatnagar Amazon AWS

This post is co-written by Amar Goel, co-founder and CEO of Bito.

Meticulous code review is a critical step in the software development process, one that helps delivery high-quality code that’s ready for enterprise use. However, it can be a time-consuming process at scale, when experts must review thousands of lines of code, looking for bugs or other issues. Traditional reviews can be subjective and inconsistent, because humans are human. As generative AI becomes more and more integrated into the software development process, a significant number of new AI-powered code review tools have entered the marketplace, helping software development groups raise quality and ship clean code faster—while boosting productivity and job satisfaction among developers.

Bito is an innovative startup that creates AI agents for a broad range of software developers. It emerged as a pioneer in its field in 2023, when it launched its first AI-powered developer agents, which brought large language models (LLMs) to code-writing environments. Bito’s executive team sees developers and AI as key elements of the future and works to bring them together in powerful new ways with its expanding portfolio of AI-powered agents. Its flagship product, AI Code Review Agent, speeds pull request (PR) time-to-merge up to 89%—accelerating development times and allowing developers to focus on writing code and innovating. Intended to assist, not replace, review by senior software engineers, Bito’s AI Code Review Agent focuses on summarizing, organizing, then suggesting line-level fixes with code base context, freeing engineers to focus on the more strategic aspects of code review, such as evaluating business logic.

With more than 100,000 active developers using its products globally, Bito has proven successful at using generative AI to streamline code review—delivering impressive results and business value to its customers, which include Gainsight, Privado, PubMatic, On-Board Data Systems (OBDS), and leading software enterprises. Bito’s customers have found that AI Code Review Agent provides the following benefits:

Accelerates PR merges by 89%
Reduces regressions by 34%
Delivers 87% of a PR’s feedback necessary for review
Has a 2.33 signal-to-noise ratio
Works in over 50 programming languages

Although these results are compelling, Bito needed to overcome inherent wariness among developers evaluating a flood of new AI-powered tools. To accelerate and expand adoption of AI Code Review Agent, Bito leaders wanted to launch a free tier option, which would let these prospective users experience the capabilities and value that AI Code Review Agent offered—and encourage them to upgrade to the full, for-pay Teams Plan. Its Free Plan would offer AI-generated PR summaries to provide an overview of changes, and its Teams Plan would provide more advanced features, such as downstream impact analysis and one-click acceptance of line-level code suggestions.

In this post, we share how Bito is able to offer a free tier option for its AI-powered code reviews using Amazon Nova.

Choosing a cost-effective model for a free tier offering

To offer a free tier option for AI Code Review Agent, Bito needed a foundation model (FM) that would provide the right level of performance and results at a reasonable cost. Offering code review for free to its potential customers would not be free for Bito, of course, because it would be paying inference costs. To identify a model for its Free Plan, Bito carried out a 2-week evaluation process across a broad range of models, including the high-performing FMs on Amazon Bedrock, as well as OpenAI GPT-4o mini. The Amazon Nova models—fast, cost-effective models that were recently introduced on Amazon Bedrock—were particularly interesting to the team.

At the end of its evaluation, Bito determined that Amazon Nova Lite delivered the right mix of performance and cost-effectiveness for its use cases. Its speed provided fast creation of code review summaries. However, cost—a key consideration for Bito’s Free Plan—proved to be the deciding factor. Ultimately, Amazon Nova Lite met Bito’s criteria for speed, cost, and quality. The combination of Amazon Nova Lite and Amazon Bedrock also make it possible for Bito to offer the reliability and security that their customers needed when entrusting Bito with their code. After all, careful control of code is one of Bito’s core promises to its customers. It doesn’t store code or use it for model training. And its products are SOC 2 Type 2-certified to provide data security, processing integrity, privacy, and confidentiality.

Implementing the right models for different tiers of offerings

Bito has now adopted Amazon Bedrock as its standardized platform to explore, add, and run models. Bito uses Amazon Nova Lite as the primary model for its Free Plan, and Anthropic’s Claude 3.7 Sonnet powers its for-pay Teams Plan, all accessed and integrated through the unified Amazon Bedrock API and controls. Amazon Bedrock provides seamless shifting from Amazon Nova Lite to Anthropic’s Sonnet when customers upgrade, with minimal code changes. Bito leaders are quick to point out that Amazon Nova Lite doesn’t just power its Free Plan—it inspired it. Without the very low cost of Amazon Nova Lite, they wouldn’t have been able to offer a free tier of AI Code Review Agent, which they viewed as a strategic move that would enable it to expand its enterprise customer base. This strategy quickly generated results, attracting three times more prospective customers to its Free Plan than anticipated. At the end of the 14-day trial period, a significant number of users convert to the full AI Code Review Agent to access its full array of capabilities.

Encouraged by the success with AI Code Review Agent, Bito is now using Amazon Nova Lite to power the chat capabilities of its offering for Bito Wingman, its latest AI agentic technology—a full-featured developer assistant in the integrated development environment (IDE) that combines code generation, error handling, architectural advice, and more. Again, the combination of quality and low cost of Amazon Nova Lite made it the right choice for Bito.

Conclusion

In this post, we shared how Bito—an innovative startup that offers a growing portfolio of AI-powered developer agents—chose Amazon Nova Lite to power its free tier offering of AI Code Review Agent, its flagship product. Its AI-powered agents are designed specifically to make developers’ lives easier and their work more impactful:

Amazon Nova Lite enabled Bito to meet one of its core business challenges—attracting enterprise customers. By introducing a free tier, Bito attracted three times more prospective new customers to its generative AI-driven flagship product—AI Code Review Agent.
Amazon Nova Lite outperformed other models during rigorous internal testing, providing the right level of performance at the very low cost Bito needed to launch a free tier of AI Code Review Agent.
Amazon Bedrock empowers Bito to seamlessly switch between models as needed for each tier of AI Code Review Agent—Amazon Nova Lite for its Free Plan and Anthropic’s Claude 3.7 Sonnet for its for-pay Teams Plan. Amazon Bedrock also provided security and privacy, critical considerations for Bito customers.
Bito shows how innovative organizations can use the combination of quality, cost-effectiveness, and speed in Amazon Nova Lite to deliver value to their customers—and to their business.

“Our challenge is to push the capabilities of AI to deliver new value to developers, but at a reasonable cost,” shares Amar Goel, co-founder and CEO of Bito. “Amazon Nova Lite gives us the very fast, low-cost model we needed to power the free offering of our AI Code Review Agent—and attract new customers.”

Get started with Amazon Nova on the Amazon Bedrock console. Learn more Amazon Nova Lite at the Amazon Nova product page.

About the authors

Eshan Bhatnagar is the Director of Product Management for Amazon AGI at Amazon Web Services.

Amar Goel is Co-Founder and CEO of Bito. A serial entrepreneur, Amar previously founded PubMatic (went public in 2020), and formerly worked at Microsoft, McKinsey, and was a software engineer at Netscape, the original browser company. Amar attended Harvard University. He is excited about using GenAI to power the next generation of how software gets built!