June 2025 – Page 9

Build generative AI solutions with Amazon Bedrock

Generative AI is revolutionizing how businesses operate, interact with customers, and innovate. If you’re embarking on the journey to build a generative AI-powered solution, you might wonder how to navigate the complexities involved from selecting the right models to managing prompts and enforcing data privacy.

In this post, we show you how to build generative AI applications on Amazon Web Services (AWS) using the capabilities of Amazon Bedrock, highlighting how Amazon Bedrock can be used at each step of your generative AI journey. This guide is valuable for both experienced AI engineers and newcomers to the generative AI space, helping you use Amazon Bedrock to its fullest potential.

Amazon Bedrock is a fully managed service that provides a unified API to access a wide range of high-performing foundation models (FMs) from leading AI companies like Anthropic, Cohere, Meta, Mistral AI, AI21 Labs, Stability AI, and Amazon. It offers a robust set of tools and features designed to help you build generative AI applications efficiently while adhering to best practices in security, privacy, and responsible AI.

Calling an LLM with an API

You want to integrate a generative AI feature into your application through a straightforward, single-turn interaction with a large language model (LLM). Perhaps you need to generate text, answer a question, or provide a summary based on user input. Amazon Bedrock simplifies generative AI application development and scaling through a unified API for accessing diverse, leading FMs. With support for Amazon models and leading AI providers, you have the freedom to experiment without being locked into a single model or provider. With the rapid pace of development in AI, you can seamlessly switch models for optimized performance with no application rewrite required.

Beyond direct model access, Amazon Bedrock expands your options with the Amazon Bedrock Marketplace. This marketplace gives you access to over 100 specialized FMs; you can discover, test, and integrate new capabilities all through fully managed endpoints. Whether you need the latest innovation in text generation, image synthesis, or domain-specific AI, Amazon Bedrock provides the flexibility to adapt and scale your solution with ease.

With one API, you stay agile and can effortlessly switch between models, upgrade to the latest versions, and future-proof your generative AI applications with minimal code changes. To summarize, Amazon Bedrock offers the following benefits:

Simplicity: No need to manage infrastructure or deal with multiple APIs
Flexibility: Experiment with different models to find the best fit
Scalability: Scale your application without worrying about underlying resources

To get started, use the Chat or Text playground to experiment with different FMs, and use the Converse API to integrate FMs into your application.

After you’ve integrated a basic LLM feature, the next step is optimizing the performance and making sure you’re using the right model for your requirements. This brings us to the importance of evaluating and comparing models.

Choosing the right model for your use case

Selecting the right FM for your use case is crucial, but with so many options available, how do you know which one will give you the best performance for your application? Whether it’s for generating more relevant responses, summarizing information, or handling nuanced queries, choosing the best model is key to providing optimal performance.

You can use Amazon Bedrock model evaluation to rigorously test different FMs to find the one that delivers the best results for your use case. Whether you’re in the early stages of development or preparing for launch, selecting the right model can make a significant difference in the effectiveness of your generative AI solutions.

The model evaluation process consists of the following components:

Automatic and human evaluation: Begin by experimenting with different models using automated evaluation metrics like accuracy, robustness, or toxicity. You can also bring in human evaluators to measure more subjective factors, such as friendliness, style, or how well the model aligns with your brand voice.
Custom datasets and metrics: Evaluate the performance of models using your own datasets or pre-built options. Customize the metrics that matter most for your project, making sure the selected model aligns with your business or operational goals.
Iterative feedback: Throughout the development process, run evaluations iteratively, allowing for faster refinement. This helps you compare models side by side, so you can make a data-driven decision when selecting the FM that fits your use case.

Imagine you’re building a customer support AI assistant for an ecommerce service. You can model evaluation to test multiple FMs with real customer queries, evaluating which model provides the most accurate, friendly, and contextually appropriate responses. By comparing models side by side, you can choose the model that will deliver the best possible user experience for your customers. After you’ve evaluated and selected the ideal model, the next step is making sure it aligns with your business needs. Off-the-shelf models might perform well, but for a truly tailored experience, you need more customization. This leads to the next important step in your generative AI journey: personalizing models to reflect your business context. You need to make sure the model generates the most accurate and contextually relevant responses. Even the best FMs will not have access to the latest or domain-specific information critical to your business. To solve this, the model needs to use your proprietary data sources, making sure its outputs reflect the most up-to-date and relevant information. This is where you can use Retrieval Augmented Generation (RAG) to enrich the model’s responses by incorporating your organization’s unique knowledge base.

Enriching model responses with your proprietary data

A publicly available LLM might perform well on general knowledge tasks, but struggle with outdated information or lack context from your organization’s proprietary data. You need a way to provide the model with the most relevant, up-to-date insights to provide accuracy and contextual depth. There are two key approaches that you can use to enrich model responses:

RAG: Use RAG to dynamically retrieve relevant information at query time, enriching model responses without requiring retraining
Fine-tuning: Use RAG to customize your chosen model by training it on proprietary data, improving its ability to handle organization-specific tasks or domain knowledge

We recommend starting with RAG because of its flexible and straightforward to implement. You can then fine-tune the model for deeper domain adaptation if needed. RAG dynamically retrieves relevant information at query time, making sure model responses stay accurate and context aware. In this approach, data is first processed and indexed in a vector database or similar retrieval system. When a user submits a query, Amazon Bedrock searches this indexed data to find relevant context, which is injected into the prompt. The model then generates a response based on both the original query and the retrieved insights without requiring additional training.

Amazon Bedrock Knowledge Bases automates the RAG pipeline—including data ingestion, retrieval, prompt augmentation, and citations—reducing the complexity of setting up custom integrations. By seamlessly integrating proprietary data, you can make sure that the models generate accurate, contextually rich, and continuously updated responses.

Bedrock Knowledge Bases supports various data types to tailor AI-generated responses to business-specific needs:

Unstructured data: Extract insights from text-heavy sources like documents, PDFs, and emails
Structured data: Enable natural language queries on databases, data lakes, and warehouses without moving or preprocessing data
Multimodal data: Process both text and visual elements in documents and images using Amazon Bedrock Data Automation
GraphRAG: Enhance knowledge retrieval with graph-based relationships, enabling AI to understand entity connections for more context-aware responses

With these capabilities, Amazon Bedrock reduces data silos, making it straightforward to enrich AI applications with both real-time and historical knowledge. Whether working with text, images, structured datasets, or interconnected knowledge graphs, Amazon Bedrock provides a fully managed, scalable solution without the need for complex infrastructure. To summarize, using RAG with Amazon Bedrock offers the following benefits:

Up-to-date information: Responses include the latest data from your knowledge bases
Accuracy: Reduces the risk of incorrect or irrelevant answers
No extra infrastructure: You can avoid setting up and managing your own vector databases or custom integrations

When your model is pulling from the most accurate and relevant data, you might find that its general behavior still needs some refinement perhaps in its tone, style, or understanding of industry-specific language. This is where you can further fine-tune the model to align it even more closely with your business needs.

Tailoring models to your business needs

Out-of-the-box FMs provide a strong starting point, but they often lack the precision, brand voice, or industry-specific expertise required for real-world applications. Maybe the language doesn’t align with your brand, or the model struggles with specialized terminology. You might have experimented with prompt engineering and RAG to enhance responses with additional context. Although these techniques help, they have limitations (for example, longer prompts can increase latency and cost), and models might still lack deep domain expertise needed for domain-specific tasks. To fully harness generative AI, businesses need a way to securely adapt models, making sure AI-generated responses are not only accurate but also relevant, reliable, and aligned with business goals.

Amazon Bedrock simplifies model customization, enabling businesses to fine-tune FMs with proprietary data without building models from scratch or managing complex infrastructure.

Rather than retraining an entire model, Amazon Bedrock provides a fully managed fine-tuning process that creates a private copy of the base FM. This makes sure your proprietary data remains confidential and isn’t used to train the original model. Amazon Bedrock offers two powerful techniques to help businesses refine models efficiently:

Fine-tuning: You can train an FM with labeled datasets to improve accuracy in industry-specific terminology, brand voice, and company workflows. This allows the model to generate more precise, context-aware responses without relying on complex prompts.
Continued pre-training: If you have unlabeled domain-specific data, you can use continued pre-training to further train an FM on specialized industry knowledge without manual labeling. This approach is especially useful for regulatory compliance, domain-specific jargon, or evolving business operations.

By combining fine-tuning for core domain expertise with RAG for real-time knowledge retrieval, businesses can create highly specialized AI models that stay accurate and adaptable, and make sure the style of responses align with business goals. To summarize, Amazon Bedrock offers the following benefits:

Privacy-preserved customization: Fine-tune models securely while making sure that your proprietary data remains private
Efficiency: Achieve high accuracy and domain relevance without the complexity of building models from scratch

As your project evolves, managing and optimizing prompts becomes critical, especially when dealing with different iterations or testing multiple prompt versions. The next step is refining your prompts to maximize model performance.

Managing and optimizing prompts

As your AI projects scale, managing multiple prompts efficiently becomes a growing challenge. Tracking versions, collaborating with teams, and testing variations can quickly become complex. Without a structured approach, prompt management can slow down innovation, increase costs, and make iteration cumbersome. Optimizing a prompt for one FM doesn’t always translate well to another. A prompt that performs well with one FM might produce inconsistent or suboptimal outputs with another, requiring significant rework. This makes switching between models time-consuming and inefficient, limiting your ability to experiment with different AI capabilities effectively. Without a centralized way to manage, test, and refine prompts, AI development becomes slower, more costly, and less adaptable to evolving business needs.

Amazon Bedrock simplifies prompt engineering with Amazon Bedrock Prompt Management, an integrated system that helps teams create, refine, version, and share prompts effortlessly. Instead of manually adjusting prompts for months, Amazon Bedrock accelerates experimentation and enhances response quality without additional code. Bedrock Prompt Management introduces the following capabilities:

Versioning and collaboration: Manage prompt iterations in a shared workspace, so teams can track changes and reuse optimized prompts.
Side-by-side testing: Compare up to two prompt variations simultaneously to analyze model behavior and identify the most effective format.
Automated prompt optimization: Fine-tune and rewrite prompts based on the selected FM to improve response quality. You can select a model, apply optimization, and generate a more accurate, contextually relevant prompt.

Bedrock Prompt Management offers the following benefits:

Efficiency: Quickly iterate and optimize prompts without writing additional code
Teamwork: Enhance collaboration with shared access and version control
Insightful testing: Identify which prompts perform best for your use case

After you’ve optimized your prompts for the best results, the next challenge is optimizing your application for cost and latency by choosing the most appropriate model within a family for a given task. This is where intelligent prompt routing can help.

Optimizing efficiency with intelligent model selection

Not all prompts require the same level of AI processing. Some are straightforward and need fast responses, whereas others require deeper reasoning and more computational power. Using high-performance models for every request increases costs and latency, even when a lighter, faster model could generate an equally effective response. At the same time, relying solely on smaller models might reduce accuracy for complex queries. Without an automated approach, business must manually determine which model to use for each request, leading to higher costs, inefficiencies, and slower development cycles.

Amazon Bedrock Intelligent Prompt Routing optimizes AI performance and cost by dynamically selecting the most appropriate FM for each request. Instead of manually choosing a model, Amazon Bedrock automates model selection within a model family, making sure that each prompt is routed to the best-performing model for its complexity. Bedrock Intelligent Prompt Routing offers the following capabilities:

Adaptive model routing: Automatically directs simple prompts to lightweight models and complex queries to more advanced models, providing the right balance between speed and efficiency
Performance balance: Makes sure that you use high-performance models only when necessary, reducing AI inference costs by up to 30%
Effortless integration: Automatically selects the right model within a family, simplifying deployment

By automating model selection, Amazon Bedrock removes the need for manual decision-making, reduces operational overhead, and makes sure AI applications run efficiently at scale. With Amazon Bedrock Intelligent Prompt Routing, each query is processed by the most efficient model, delivering speed, cost savings, and high-quality responses. The next step in optimizing AI efficiency is reducing redundant computations in frequently used prompts. Many AI applications require maintaining context across multiple interactions, which can lead to performance bottlenecks, increased costs, and unnecessary processing overhead.

Reducing redundant processing for faster responses

As your generative AI applications scale, efficiency becomes just as critical as accuracy. Applications that repeatedly use the same context—such as document Q&A systems (where users ask multiple questions about the same document) or coding assistants that maintain context about code files—often face performance bottlenecks and rising costs because of redundant processing. Each time a query includes long, static context, models reprocess unchanged information, leading to increased latency as models repeatedly analyze the same content and unnecessary token usage inflates compute expenses. To keep AI applications fast, cost-effective, and scalable, optimizing how prompts are reused and processed is essential.

Amazon Bedrock Prompt Caching enhances efficiency by storing frequently used portions of prompts—reducing redundant computations and improving response times. It offers the following benefits:

Faster processing: Skips unnecessary recomputation of cached prompt prefixes, boosting overall throughput
Lower latency: Reduces processing time for long, repetitive prompts, delivering a smoother user experience, and reducing latency by up to 85% for supported models
Cost-efficiency: Minimizes compute resource usage by avoiding repeated token processing, reducing costs by up to 90%

With prompt caching, AI applications respond faster, reduce operational costs, and scale efficiently while maintaining high performance. With Bedrock Prompt Caching providing faster responses and cost-efficiency, the next step is enabling AI applications to move beyond static prompt-response interactions. This is where agentic AI comes in, empowering applications to dynamically orchestrate multistep processes, automate decision-making, and drive intelligent workflows.

Automating multistep tasks with agentic AI

As AI applications grow more sophisticated, automating complex, multistep tasks become essential. You need a solution that can interact with internal systems, APIs, and databases to execute intricate workflows autonomously. The goal is to reduce manual intervention, improve efficiency, and create more dynamic, intelligent applications. Traditional AI models are reactive; they generate responses based on inputs but lack the ability to plan and execute multistep tasks. Agentic AI refers to AI systems that act with autonomy, breaking down complex tasks into logical steps, making decisions, and executing actions without constant human input. Unlike traditional models that only respond to prompts, agentic AI models have the following capabilities:

Autonomous planning and execution: Breaks complex tasks into smaller steps, makes decisions, and plans actions to complete the workflow
Chaining capabilities: Handles sequences of actions based on a single request, enabling the AI to manage intricate tasks that would otherwise require manual intervention or multiple interactions
Interaction with APIs and systems: Connects to your enterprise systems and automatically invokes necessary APIs or databases to fetch or update data

Amazon Bedrock Agents enables AI-powered task automation by using FMs to plan, orchestrate, and execute workflows. With a fully managed orchestration layer, Amazon Bedrock simplifies the process of deploying, scaling, and managing AI agents. Bedrock Agents offers the following benefits:

Task orchestration: Uses FMs’ reasoning capabilities to break down tasks, plan execution, and manage dependencies
API integration: Automatically calls APIs within enterprise systems to interact with business applications
Memory retention: Maintains context across interactions, allowing agents to remember previous steps, providing a seamless user experience

When a task requires multiple specialized agents, Amazon Bedrock supports multi-agent collaboration, making sure agents work together efficiently while alleviating manual orchestration overhead. This unlocks the following capabilities:

Supervisor-agent coordination: A supervisor agent delegates tasks to specialized subagents, providing optimal distribution of workloads
Efficient task execution: Supports parallel task execution, enabling faster processing and improved accuracy
Flexible collaboration modes: You can choose between the following modes:
- Fully orchestrated supervisor mode: A central agent manages the full workflow, providing seamless coordination
- Routing mode: Basic tasks bypass the supervisor and go directly to subagents, reducing unnecessary orchestration
Seamless integration: Works with enterprise APIs and internal knowledge bases, making it straightforward to automate business operations across multiple domains

By using multi-agent collaboration, you can increase task success rates, reduce execution time, and improve accuracy, making AI-driven automation more effective for real-world, complex workflows. To summarize, agentic AI offers the following benefits:

Automation: Reduces manual intervention in complex processes
Flexibility: Agents can adapt to changing requirements or gather additional information as needed
Transparency: You can use the trace capability to debug and optimize agent behavior

Although automating tasks with agents can streamline operations, handling sensitive information and enforcing privacy is paramount, especially when interacting with user data and internal systems. As your application grows more sophisticated, so do the security and compliance challenges.

Maintaining security, privacy, and responsible AI practices

As you integrate generative AI into your business, security, privacy, and compliance become critical concerns. AI-generated responses must be safe, reliable, and aligned with your organization’s policies to help violating brand guidelines or regulatory policies, and must not include inaccurate or misleading responses.

Amazon Bedrock Guardrails provides a comprehensive framework to enhance security, privacy, and accuracy in AI-generated outputs. With built-in safeguards, you can enforce policies, filter content, and improve trustworthiness in AI interactions. Bedrock Guardrails offers the following capabilities:

Content filtering: Block undesirable topics and harmful content in user inputs and model responses.
Privacy protection: Detect and redact sensitive information like personally identifiable information (PII) and confidential data to help prevent data leaks.
Custom policies: Define organization-specific rules to make sure AI-generated content aligns with internal policies and brand guidelines.
Hallucination detection: Identify and filter out responses not grounded in your data sources through the following capabilities:
- Contextual grounding checks: Make sure model responses are factually correct and relevant by validating them against enterprise data source. Detect hallucinations when outputs contain unverified or irrelevant information.
- Automated reasoning for accuracy: Moves beyond trust me to prove it AI outputs by applying mathematically sound logic and structured reasoning to verify factual correctness.

With security and privacy measures in place, your AI solution is not only powerful but also responsible. However, if you’ve already made significant investments in custom models, the next step is to integrate them seamlessly into Amazon Bedrock.

Using existing custom models with Amazon Bedrock Custom Model Import

Use Amazon Bedrock Custom Model Import if you’ve already invested in custom models developed outside of Amazon Bedrock and want to integrate them into your new generative AI solution without managing additional infrastructure.

Bedrock Custom Model Import includes the following capabilities:

Seamless integration: Import your custom models into Amazon Bedrock
Unified API access: Interact with models—both base and custom—through the same API
Operational efficiency: Let Amazon Bedrock handle the model lifecycle and infrastructure management

Bedrock Custom Model Import offers the following benefits:

Cost savings: Maximize the value of your existing models
Simplified management: Reduce overhead by consolidating model operations
Consistency: Maintain a unified development experience across models

By importing custom models, you can use your prior investments. To truly unlock the potential of your models and prompt structures, you can automate more complex workflows, combining multiple prompts and integrating with other AWS services.

Automating workflows with Amazon Bedrock Flows

You need to build complex workflows that involve multiple prompts and integrate with other AWS services or business logic, but you want to avoid extensive coding.

Amazon Bedrock Flows has the following capabilities:

Visual builder: Drag-and-drop components to create workflows
Workflow automation: Link prompts with AWS services and automate sequences
Testing and versioning: Test flows directly in the console and manage versions

Amazon Bedrock Flows offers the following benefits:

No-code solution: Build workflows without writing code
Speed: Accelerate development and deployment of complex applications
Collaboration: Share and manage workflows within your team

With workflows now automated and optimized, you’re nearly ready to deploy your generative AI-powered solution. The final stage is making sure that your generative AI solution can scale efficiently and maintain high performance as demand grows.

Monitoring and logging to close the loop on AI operations

As you prepare to move your generative AI application into production, it’s critical to implement robust logging and observability to monitor system health, verify compliance, and quickly troubleshoot issues. Amazon Bedrock offers built-in observability capabilities that integrate seamlessly with AWS monitoring tools, enabling teams to track performance, understand usage patterns, and maintain operational control

Model invocation logging: You can enable detailed logging of model invocations, capturing input prompts and output responses. These logs can be streamed to Amazon CloudWatch or Amazon Simple Storage Service (Amazon S3) for real-time monitoring or long-term analysis. Logging is configurable through the AWS Management Console or the CloudWatchConfig API.
CloudWatch metrics: Amazon Bedrock provides rich operational metrics out-of-the-box, including:
- Invocation count
- Token usage (input/output)
- Response latency
- Error rates (for example, invalid input and model failures)

These capabilities are essential for running generative AI solutions at scale with confidence. By using CloudWatch, you gain visibility across the full AI pipeline from input prompts to model behavior; making it straightforward to maintain uptime, performance, and compliance as your application grows.

Finalizing and scaling your generative AI solution

You’re ready to deploy your generative AI application and need to scale it efficiently while providing reliable performance. Whether you’re handling unpredictable workloads, enhancing resilience, or needing consistent throughput, you must choose the right scaling approach. Amazon Bedrock offers three flexible scaling options that you can use to tailor your infrastructure to your workload needs:

On-demand: Start with the flexibility of on-demand scaling, where you pay only for what you use. This option is ideal for early-stage deployments or applications with variable or unpredictable traffic. It offers the following benefits:
- No commitments.
- Pay only for tokens processed (input/output).
- Great for dynamic or fluctuating workloads.
Cross-Region inference: When your traffic grows or becomes unpredictable, you can use cross-Region inference to handle bursts by distributing compute across multiple AWS Regions, enhancing availability without additional cost. It offers the following benefits:
- Up to two times larger burst capacity.
- Improved resilience and availability.
- No additional charges, you have the same pricing as your primary Region.
Provisioned Throughput: For large, consistent workloads, Provisioned Throughput maintains a fixed level of performance. This option is perfect when you need predictable throughput, particularly for custom models. It offers the following benefits:
- Consistent performance for high-demand applications.
- Required for custom models.
- Flexible commitment terms (1 month or 6 months).

Conclusion

Building generative AI solutions is a multifaceted process that requires careful consideration at every stage. Amazon Bedrock simplifies this journey by providing a unified service that supports each phase, from model selection and customization to deployment and compliance. Amazon Bedrock offers a comprehensive suite of features that you can use to streamline and enhance your generative AI development process. By using its unified tools and APIs, you can significantly reduce complexity, enabling accelerated development and smoother workflows. Collaboration becomes more efficient because team members can work seamlessly across different stages, fostering a more cohesive and productive environment. Additionally, Amazon Bedrock integrates robust security and privacy measures, helping to ensure that your solutions meet industry and organization requirements. Finally, you can use its scalable infrastructure to bring your generative AI solutions to production faster while minimizing overhead. Amazon Bedrock stands out as a one-stop solution that you can use to build sophisticated, secure, and scalable generative AI applications. Its extensive capabilities alleviate the need for multiple vendors and tools, streamlining your workflow and enhancing productivity.

Explore Amazon Bedrock and discover how you can use its features to support your needs at every stage of generative AI development. To learn more, see the Amazon Bedrock User Guide.

About the authors

Venkata Santosh Sajjan Alla is a Senior Solutions Architect at AWS Financial Services, driving AI-led transformation across North America’s FinTech sector. He partners with organizations to design and execute cloud and AI strategies that speed up innovation and deliver measurable business impact. His work has consistently translated into millions in value through enhanced efficiency and additional revenue streams. With deep expertise in AI/ML, Generative AI, and cloud-native architectures, Sajjan enables financial institutions to achieve scalable, data-driven outcomes. When not architecting the future of finance, he enjoys traveling and spending time with family. Connect with him on LinkedIn.

Axel Larsson is a Principal Solutions Architect at AWS based in the greater New York City area. He supports FinTech customers and is passionate about helping them transform their business through cloud and AI technology. Outside of work, he is an avid tinkerer and enjoys experimenting with home automation.

How Netsertive built a scalable AI assistant to extract meaningful insights from real-time data using Amazon Bedrock and Amazon Nova

This post was co-written with Herb Brittner from Netsertive.

Netsertive is a leading digital marketing solutions provider for multi-location brands and franchises, helping businesses maximize local advertising, improve engagement, and gain deep customer insights.

With a growing demand in providing more actionable insights from their customer call tracking data, Netsertive needed a solution that could unlock business intelligence from every call, making it easier for franchises to improve customer service and boost conversion rates. The team was looking for a single, flexible system that could do several things:

Understand phone calls – Automatically create summaries of what was discussed
Gauge customer feelings – Determine if the caller was happy, upset, or neutral
Identify important topics – Pull out keywords related to frequent services, questions, problems, and mentions of competitors
Improve agent performance – Offer advice and suggestions for coaching
Track performance over time – Generate reports on trends for individual locations, regions, and the entire country

Crucially, this new system needed to work smoothly with their existing Multi-Location Experience (MLX) platform. The MLX platform is specifically designed for businesses with many locations and helps them manage both national and local marketing. It allows them to run campaigns across various online channels, including search engines, social media, display ads, videos, connected TVs, and online reviews, as well as manage SEO, business listings, reviews, social media posting, and individual location web pages.

In this post, we show how Netsertive introduced a generative AI-powered assistant into MLX, using Amazon Bedrock and Amazon Nova, to bring their next generation of the platform to life.

Solution overview

Operating a comprehensive digital marketing solution, Netsertive handles campaign execution while providing key success metrics through their Insights Manager product. The platform features location-specific content management capabilities and robust lead capture functionality, collecting data from multiple sources, including paid campaigns, organic website traffic, and attribution pro forms. With CRM integration and call tracking features, MLX creates a seamless flow of customer data and marketing insights. This combination of managed services, automated tools, and analytics makes MLX a single source of truth for businesses seeking to optimize their digital marketing efforts while taking advantage of Netsertive’s expertise in campaign management. To address their desire to provide more actionable insights on the platform from customer call tracking data, Netsertive considered various solutions. After evaluating different tools and models, they decided to use Amazon Bedrock and the Amazon Nova Micro model. This choice was driven by the API-driven approach of Amazon Bedrock, its wide selection of large language models (LLMs), and the performance of the Amazon Nova Micro model specifically. They selected Amazon Nova Micro based on its ability to deliver fast response times at a low cost, while providing consistent and intelligent insights—key factors for Netsertive. With its generation speed of over 200 tokens per second and highly performant language understanding skills, this text-only model proved ideal for Netsertive. The following diagram shows how their MLX platform receives real-time phone calls and uses Amazon Nova Micro in Amazon Bedrock for processing real-time phone calls.

The real-time call processing flow consists of the following steps:

When a call comes in, it’s immediately routed to the Lead API. This process captures both the live call transcript and important metadata about the caller. This system continuously processes new calls as they arrive, facilitating real-time handling of incoming communications.
The captured transcript is forwarded to Amazon Bedrock for analysis. The system currently uses a standardized base prompt for all customers, and the architecture is designed to allow for customer-specific prompt customization as an added layer of context.
Amazon Nova Micro processes the transcript and returns a structured JSON response. This response includes multiple analysis components: sentiment analysis of the conversation, a concise call summary, identified key terms, overall call theme classification, and specific coaching suggestions for improvement.
All analysis results are systematically stored in an Amazon Aurora database with their associated key metrics. This makes sure the processed data is properly indexed and readily available for both immediate access and future analysis.

The aggregate report schedule flow consists of the following steps:

The aggregate analysis process automatically initiates on both weekly and monthly schedules. During each run, the system gathers call data that falls within the specified time period.
This aggregate analysis uses both Amazon Bedrock and Amazon Nova Micro, applying a specialized prompt designed specifically for trend analysis. This prompt differs from the real-time analysis to focus on identifying patterns and insights across multiple calls.

The processed aggregate data from both workflows is transformed into comprehensive reports displaying trend analysis and comparative metrics through the UI. This provides stakeholders with valuable insights into performance patterns and trends over time while allowing the user to dive deeper into specific metrics.

Results

The implementation of generative AI to create a real-time call data analysis solution has been a transformative journey for Netsertive. Their new Call Insights AI feature, using Amazon Nova Micro on Amazon Bedrock, only takes minutes to create actionable insights, compared to their previous manual call review processes, which took hours or even days for customers with high call volumes. Netsertive chose Amazon Bedrock and Amazon Nova Micro for their solution after a swift evaluation period of approximately 1 week of testing different tools and models. Their development approach was methodical and customer-focused. The Call Insights AI feature was added to their platform’s roadmap based on direct customer feedback and internal marketing expertise. The entire development process, from creating and testing their Amazon Nova Micro prompts to integrating Amazon Bedrock with their MLX platform, was completed within approximately 30 days before launching in beta. The transformation of real-time call data analysis isn’t just about processing more calls—it’s about creating a more comprehensive understanding of customer interactions. By implementing Amazon Bedrock and Amazon Nova Micro, Netsertive is able to better understand call purposes and value, enhance measurement capabilities, and progress towards more automated and efficient analysis systems. This evolution can not only streamline operations but also provide customers with more actionable insights about their digital marketing performance.

Conclusion

In this post, we shared how Netsertive introduced a generative AI-powered assistant into MLX, using Amazon Bedrock and Amazon Nova. This solution helped scale their MLX platform to provide their customers with instant, actionable insights, creating a more engaging and informative user experience. By using the advanced natural language processing capabilities of Amazon Bedrock and the high-performance, low-latency Amazon Nova Micro model, Netsertive was able to build a comprehensive call intelligence system that goes beyond just transcription and sentiment analysis.

The success of this project has demonstrated the transformative potential of generative AI in driving business intelligence and operational efficiency. To learn more about building powerful, generative AI assistants and applications using Amazon Bedrock and Amazon Nova, see Generative AI on AWS.

About the authors

Nicholas Switzer is an AI/ML Specialist Solutions Architect at Amazon Web Services. He joined AWS in 2022 and specializes in AI/ML, generative AI, IoT, and edge AI. He is based in the US and enjoys building intelligent products that improve everyday life.

Jane Ridge is Senior Solutions Architect at Amazon Web Services with over 20 years of technology experience. She joined AWS in 2020 and is based in the US. She is passionate around enabling growth of her customers through innovative solutions combined with her deep technical expertise in the AWS ecosystem. She is known for her ability to guide customers through all stages of their cloud journey and deliver impactful solutions.

Herb Brittner is the Vice President of Product & Engineering at Netsertive, where he leads the development of AI-driven digital marketing solutions for multi-location brands and franchises. With a strong background in product innovation and scalable engineering, he specializes in using machine learning and cloud technologies to drive business insights and customer engagement. Herb is passionate about building data-driven platforms that enhance marketing performance and operational efficiency.

Make videos accessible with automated audio descriptions using Amazon Nova

According to the World Health Organization, more than 2.2 billion people globally have vision impairment. For compliance with disability legislation, such as the Americans with Disabilities Act (ADA) in the United States, media in visual formats like television shows or movies are required to provide accessibility to visually impaired people. This often comes in the form of audio description tracks that narrate the visual elements of the film or show. According to the International Documentary Association, creating audio descriptions can cost $25 per minute (or more) when using third parties. For building audio descriptions internally, the effort for businesses in the media industry can be significant, requiring content creators, audio description writers, description narrators, audio engineers, delivery vendors and more according to the American Council of the Blind (ACB). This leads to the natural question, can you automate this process with the help of generative AI offerings in Amazon Web Services (AWS)?

Newly announced in December at re:Invent 2024, the Amazon Nova Foundation Models family is available through Amazon Bedrock and includes three multimodal foundational models (FMs):

Amazon Nova Lite (GA) – A low-cost multimodal model that’s lightning-fast for processing image, video, and text inputs
Amazon Nova Pro (GA) – A highly capable multimodal model with a balanced combination of accuracy, speed, and cost for a wide range of tasks
Amazon Nova Premier (GA) – Our most capable model for complex tasks and a teacher for model distillation

In this post, we demonstrate how you can use services like Amazon Nova, Amazon Rekognition, and Amazon Polly to automate the creation of accessible audio descriptions for video content. This approach can significantly reduce the time and cost required to make videos accessible for visually impaired audiences. However, this post doesn’t provide a complete, deployment-ready solution. We share pseudocode snippets and guidance in sequential order, in addition to detailed explanations and links to resources. For a complete script, you can use additional resources, such as Amazon Q Developer, to build a fully functional system. The automated workflow described in the post involves analyzing video content, generating text descriptions, and narrating them using AI voice generation. In summary, while powerful, this requires careful integration and testing to deploy effectively. By the end of this post, you’ll understand the key steps, but some additional work is needed to create a production-ready solution for your specific use case.

Solution overview

The following architecture diagram demonstrates the end-to-end workflow of the proposed solution. We will describe each component in-depth in the later sections of this post, but note that you can define the logic within a single script. You can then run your script on an Amazon Elastic Compute Cloude (Amazon EC2) instance or on your local computer. For this post, we assume that you will run the script on an Amazon SageMaker notebook.

Services used

The services shown in the architecture diagram include:

Amazon S3 – Amazon Simple Storage Service (Amazon S3) is an object storage service that provides scalable, durable, and highly available storage. In this example, we use Amazon S3 to store the video files (input) and scene description (text files) and audio description (MP3 files) output generated by the solution. The script starts by fetching the source video from an S3 bucket.
Amazon Rekognition – Amazon Rekognition is a computer vision service that can detect and extract video segments or scenes by identifying technical cues such as shot boundaries, black frames, and other visual elements. To yield higher accuracy for the generated video descriptions, you use Amazon Rekognition to segment the source video into smaller chunks before passing it to Amazon Nova. These video segments can be stored in a temporary directory on your compute machine.
Amazon Bedrock – Amazon Bedrock is a managed service that provides access to large, pre-trained AI models such as the Amazon Nova Pro model, which is used in this solution to analyze the content of each video segment and generate detailed scene descriptions. You can store these text descriptions in a text file (for example, video_analysis.txt).
Amazon Polly – Amazon Polly is a text-to-speech service that is used to convert the text descriptions generated by the Amazon Nova Pro model into high-quality audio, made available using an MP3 file.

Prerequisites

To follow along with the solution outlined in this post, you should have the following in place:

A video file. For this post, we use a public domain video, This is Coffee.
An AWS account with access to the following services:
- Amazon Rekognition
- Amazon Nova Pro
- Amazon S3
- Amazon Polly
- Configure your AWS Command Line Interface (AWS CLI) or environment with valid credentials (using aws configure or environment variables)
To write the script, you need access to an AWS Software Development Kit (AWS SDK) in the language of your choice. In this post, we assume that you will use the AWS SDK for Python (Boto3). Additional information on AWS SDK for Boto3 is available in the Quickstart for Boto3.

You can use AWS SDK to create, configure, and manage AWS services. For Boto3, you can include it at the top of your script using: import boto3

Additionally, you need a mechanism to split videos. If you’re using Python, we recommend the moviepy library.
import moviepy # pip install moviepy

Solution walkthrough

The solution includes the following basic steps, which you can use as a basic structure and customize or expand to fit your use case.

Define the requirements for the AWS environment, including defining the use of the Amazon Nova Pro model for its visual support and the AWS Region you’re working in. For optimal throughput, we recommend using inference profiles when configuring Amazon Bedrock to invoke the Amazon Nova Pro model. Initialize a client for Amazon Rekognition, which you use for its support of segmentation.

CLASS VideoAnalyzer:
	FUNCTION initialize():
 		Set AWS_REGION to "us-east-1"
 		Set MODEL_ID to "amazon.nova-pro-v1:0"
 		Set chunk_delay to 20 Initialize AWS clients (Bedrock and Rekognition)

Define a function for detecting segments in the video. Amazon Rekognition supports segmentation, which means users have the option to detect and extract different segments or scenes within a video. By using the Amazon Rekognition Segment API, you can perform the following:
1. Detect technical cues such as black frames, color bars, opening and end credits, and studio logos in a video.
2. Detect shot boundaries to identify the start, end, and duration of individual shots within the video.

The solution uses Amazon Rekognition to partition the video into multiple segments and perform Amazon Nova Pro-based inference on each segment. Finally, you can piece together each segment’s inference output to return a comprehensive audio description for the entire video.

FUNCTION get_segment_results(job_id):
 	TRY:
 	   Initialize empty segments list 
 	   WHILE more results exist:
 	         Get segment detection results 
                Add segments to list 
                IF no more results THEN break
          RETURN segments 
       CATCH any errors and return null 

FUNCTION extract_scene(video_path, start_time, end_time):
       TRY: 
           Load video file 
           Validate time range
           Create temporary directory 
           Extract video segment 
           Save segment to file 
           RETURN path to saved segment 
       CATCH any errors and return null

In the preceding image, there are two scenes: a screenshot of one scene on the left followed by the scene that immediately follows it on the right. With the Amazon Rekognition segmentation API, you can identify that the scene has changed—that the content that is displayed on screen is different—and therefore you need to generate a new scene description.

Create the segmentation job and:
- Upload the video file for which you want to create an audio description to Amazon S3.
- Start the job using that video.

Setting SegmentType=[‘SHOT’] identifies the start, end, and duration of a scene. Additionally, MinSegmentConfidence sets the minimum confidence Amazon Rekognition must have to return a detected segment, with 0 being lowest confidence and 100 being highest.

Use the analyze_chunk function. This function defines the main logic of the audio description solution. Some items to note about analyze_chunk:
- For this example, we sent a video scene to Amazon Nova Pro for an analysis of the contents using the prompt Describe what is happening in this video in detail. This prompt is relatively straightforward and experimentation or customization for your use case is encouraged. Amazon Nova Pro then returned the text description for our video scene.
- For longer videos with many scenes, you might encounter throttling. This is resolved by implementing a retry mechanism. For details on throttling and quotas for Amazon Bedrock, see Quotas for Amazon Bedrock.

FUNCTION analyze_chunk(chunk_path): 
     TRY: 
        Convert video chunk to base64 
        Create request body for Bedrock 
        Set max_retries and backoff_time 

        WHILE retry_count < max_retries:
          TRY:
             Send InvokeModel request to Bedrock
             RETURN analysis results 
          CATCH throttling: 
              Wait and retry with exponential backoff 
          CATCH other errors: 
              Return null 
     CATCH any errors:
         Return null

In effect, the raw scenes are converted into rich, descriptive text. Using this text, you can generate a complete scene-by-scene walkthrough of the video and send it to Amazon Polly for audio.

Use the following code to orchestrate the process:
1. Initiate the detection of the various segments by using Amazon Rekognition.
2. Each segment is processed through a flow of:
  1. Extraction.
  2. Analysis using Amazon Nova Pro.
  3. Compiling the analysis into a video_analysis.txt file.
The analyze_video function brings together all the components and produces a text file that contains the complete, scene-by-scene analysis of the video contents, with timestamps

FUNCTION analyze_video(video_path, bucket): 
     TRY: 
         Start segment detection 
         Wait for job completion 
         Get segments 
         FOR each segment: 
             Extract scene 
             Analyze chunk 
             Save analysis results 
         Write results to file 
      CATCH any errors

If you refer back to the previous screenshot, the output—without any additional refinement—will look similar to the following image.

“Segment 103.136-126.026 seconds:
[{'text': 'The video shows a close-up of a coffee cup with steam rising from it, followed by three cups of coffee on a table with milk and sugar jars. A person then picks up a bunch of coffee beans from a plant.'}]
Segment 126.059-133.566 seconds:
[{'text': "The video starts with a person's hand, covered in dirt and holding a branch with green leaves and berries. The person then picks up some berries. The video then shows a man standing in a field with trees and plants. He is holding a bunch of red fruits in his right hand and looking at them. He is wearing a shirt and has a mustache. He seems to be picking the fruits. The fruits are probably coffee beans. The area is surrounded by green plants and trees."}]”

The following screenshot is an example is a more extensive look at the video_analysis.txt for the coffee.mp4 video:

Send the contents of the text file to Amazon Polly. Amazon Polly adds a voice to the text file, completing the workflow of the audio description solution.

FUNCTION generate_audio(text_file, output_audio_file):
     TRY:
        Read analysis text
        Set max_retries and backoff_time

        WHILE retry_count < max_retries:
           TRY:
              Initialize Polly client
              Convert text to speech
              Save audio file
              RETURN success
           CATCH throttling:
              Wait with exponential backoff
              retry_count += 1
           CATCH other errors:
              retry_count += 1
              Continue or Break based on error type
     CATCH any errors:
         RETURN error

For a list of different voices that you can use in Amazon Polly, see Available voices in the Amazon Polly Developer Guide.

Your final output with Polly should sound something like this:

Clean up

It’s a best practice to delete the resources you provisioned for this solution. If you used an EC2 or SageMaker Notebook Instance, stop or terminate it. Remember to delete unused files from your S3 bucket (eg: video_analysis.txt and video_analysis.mp3).

Conclusion

Recapping the solution at a high level, in this post, you used:

Amazon S3 to store the original video, intermediate data, and the final audio description artifacts
Amazon Rekognition to partition the video file into time-stamped scenes
Computer vision capabilities from Amazon Nova Pro (available through Amazon Bedrock) to analyze the contents of each scene

We showed you how to use Amazon Polly to create an MP3 audio file from the final scene description text file, which is what will be consumed by the audience members. The solution outlined in this post demonstrates how to fully automate the process of creating audio descriptions for video content to improve accessibility. By using Amazon Rekognition for video segmentation, the Amazon Nova Pro model for scene analysis, and Amazon Polly for text-to-speech, you can generate a comprehensive audio description track that narrates the key visual elements of a video. This end-to-end automation can significantly reduce the time and cost required to make video content accessible for visually impaired audiences, helping businesses and organizations meet their accessibility goals. With the power of AWS AI services, this solution provides a scalable and efficient way to improve accessibility and inclusion for video-based media.

This solution isn’t limited to using it for TV shows and movies. Any visual media that requires accessibility can be a candidate! For more information about the new Amazon Nova model family and the amazing things these models can do, see Introducing Amazon Nova foundation models: Frontier intelligence and industry leading price performance.

In addition to the steps described in this post, additional actions you might need to take include:

Removing a video segment analysis’s introductory text from Amazon Nova. When Amazon Nova returns a response, it might begin with something like “In this video…” or something similar. You probably want just the video description itself without this introductory text. If there is introductory text in your scene descriptions, then Amazon Polly will speak it aloud and impact the quality of your audio transcriptions. You can account for this in a few ways.
- For example, prior to sending it to Amazon Polly, you can modify the generated scene descriptions by programmatically removing that type of text from them.
- Alternatively, you can use prompt engineering to request that Amazon Bedrock return only the scene descriptions in a structured format or without any additional commentary.
- The third option is to define and use a tool when performing inference on Amazon Bedrock. This can be a more comprehensive technique of defining the format of the output that you want Amazon Bedrock to return. Using tools to shape model output, is known as function calling. For more information, see Use a tool to complete an Amazon Bedrock model response.
You should also be mindful of the architectural components of the solution. In a production environment, being mindful of any potential scaling, security, and storage elements is important because the architecture might begin to resemble something more complex than the basic solution architecture diagram that this post began with.

About the Authors

Dylan Martin is an AWS Solutions Architect, working primarily in the generative AI space helping AWS Technical Field teams build AI/ML workloads on AWS. He brings his experience as both a security solutions architect and software engineer. Outside of work he enjoys motorcycling, the French Riviera and studying languages.

Ankit Patel is an AWS Solutions Developer, part of the Prototyping And Customer Engineering (PACE) team. Ankit helps customers bring their innovative ideas to life by rapid prototyping; using the AWS platform to build, orchestrate, and manage custom applications.

Training Llama 3.3 Swallow: A Japanese sovereign LLM on Amazon SageMaker HyperPod

This post is based on a technical report written by Kazuki Fujii, who led the Llama 3.3 Swallow model development.

The Institute of Science Tokyo has successfully trained Llama 3.3 Swallow, a 70-billion-parameter large language model (LLM) with enhanced Japanese capabilities, using Amazon SageMaker HyperPod. The model demonstrates superior performance in Japanese language tasks, outperforming GPT-4o-mini and other leading models. This technical report details the training infrastructure, optimizations, and best practices developed during the project.

This post is organized as follows:

Overview of Llama 3.3 Swallow
Architecture for Llama 3.3 Swallow training
Software stack and optimizations employed in Llama 3.3 Swallow training
Experiment management

We discuss topics relevant to machine learning (ML) researchers and engineers with experience in distributed LLM training and familiarity with cloud infrastructure and AWS services. We welcome readers who understand model parallelism and optimization techniques, especially those interested in continuous pre-training and supervised fine-tuning approaches.

Overview of the Llama 3.3 Swallow

Llama 3.3 Swallow is a 70-billion-parameter LLM that builds upon Meta’s Llama 3.3 architecture with specialized enhancements for Japanese language processing. The model was developed through a collaboration between the Okazaki Laboratory and Yokota Laboratory at the School of Computing, Institute of Science Tokyo, and the National Institute of Advanced Industrial Science and Technology (AIST).

The model is available in two variants on Hugging Face:

Llama 3.3 Swallow 70B Base v0.4 – The pretrained base model, which serves as the foundation for Japanese language understanding
Llama 3.3 Swallow 70B Instruct v0.4 – The instruction-tuned model, optimized for dialogue and task completion

Both variants are accessible through the tokyotech-llm organization on Hugging Face, providing researchers and developers with flexible options for different application needs.

Training methodology

The base model was developed through continual pre-training from Meta Llama 3.3 70B Instruct, maintaining the original vocabulary without expansion. The training data primarily consisted of the Swallow Corpus Version 2, a carefully curated Japanese web corpus derived from Common Crawl. To secure high-quality training data, the team employed the Swallow Education Classifier to extract educationally valuable content from the corpus. The following table summarizes the training data used for the base model training with approximately 314 billion tokens. For compute, the team used 32 ml.p5.48xlarge Amazon Elastic Compute Cloud (Amazon EC2) instances (H100, 80 GB, 256 GPUs) for continual pre-training with 16 days and 6 hours.

Training Data	Number of Training Tokens
Japanese Swallow Corpus v2	210 billion
Japanese Wikipedia	5.3 billion
English Wikipedia	6.9 billion
English Cosmopedia	19.5 billion
English DCLM baseline	12.8 billion
Laboro ParaCorpus	1.4 billion
Code Swallow-Code	50.2 billion
Math Finemath-4+	7.85 billion

For the instruction-tuned variant, the team focused exclusively on Japanese dialogue and code generation tasks. This version was created through supervised fine-tuning of the base model, using the same Japanese dialogue data that proved successful in the previous Llama 3.1 Swallow v0.3 release. Notably, the team made a deliberate choice to exclude English dialogue data from the fine-tuning process to maintain focus on Japanese language capabilities. The following table summarizes the instruction-tuning data used for the instruction-tuned model.

Training Data	Number of Training Samples
Gemma-2-LMSYS-Chat-1M-Synth	240,000
Swallow-Magpie-Ultra-v0.1	42,000
Swallow-Gemma-Magpie-v0.1	99,000
Swallow-Code-v0.3-Instruct-style	380,000

Performance and benchmarks

The base model has demonstrated remarkable performance in Japanese language tasks, consistently outperforming several industry-leading models. In comprehensive evaluations, it has shown superior capabilities compared to OpenAI’s GPT-4o (gpt-4o-2024-08-06), GPT-4o-mini (gpt-4o-mini-2024-07-18), GPT-3.5 (gpt-3.5-turbo-0125), and Qwen2.5-72B. These benchmarks reflect the model’s enhanced ability to understand and generate Japanese text. The following graph illustrates the base model performance comparison across these different benchmarks (original image).

The instruction-tuned model has shown particularly strong performance on the Japanese MT-Bench, as evaluated by GPT-4o-2024-08-06, demonstrating its effectiveness in practical applications. The following graph presents the performance metrics (original image).

Licensing and usage

The model weights are publicly available on Hugging Face and can be used for both research and commercial purposes. Users must comply with both the Meta Llama 3.3 license and the Gemma Terms of Use. This open availability aims to foster innovation and advancement in Japanese language AI applications while enforcing responsible usage through appropriate licensing requirements.

Training infrastructure architecture

The training infrastructure for Llama 3.3 Swallow was built on SageMaker HyperPod, with a focus on high performance, scalability, and observability. The architecture combines compute, network, storage, and monitoring components to enable efficient large-scale model training. The base infrastructure stack is available as an AWS CloudFormation template for seamless deployment and replication. This template provisions a comprehensive foundation by creating a dedicated virtual private cloud (VPC). The networking layer is complemented by a high-performance Amazon FSx for Lustre file system, alongside an Amazon Simple Storage Service (Amazon S3) bucket configured to store lifecycle scripts, which are used to configure the SageMaker HyperPod cluster.

Before deploying this infrastructure, it’s essential to make sure the AWS account has the appropriate service quotas. The deployment of SageMaker HyperPod requires specific quota values that often exceed default limits. You should check your current quota against the requirements detailed in SageMaker HyperPod quotas and submit a quota increase request as needed.

The following diagram illustrates the high-level architecture of the training infrastructure.

Compute and network configuration

The compute infrastructure is based on SageMaker HyperPod using a cluster of 32 EC2 P5 instances, each equipped with 8 NVIDIA H100 GPUs. The deployment uses a single spine configuration to provide minimal latency between instances. All communication between GPUs is handled through NCCL over an Elastic Fabric Adapter (EFA), providing high-throughput, low-latency networking essential for distributed training. The SageMaker HyperPod Slurm configuration manages the deployment and orchestration of these resources effectively.

Storage architecture

The project implements a hierarchical storage approach that balances performance and cost-effectiveness. At the foundation is Amazon S3, providing long-term storage for training data and checkpoints. To prevent storage bottlenecks during training, the team deployed FSx for Lustre as a high-performance parallel file system. This configuration enables efficient data access patterns across all training nodes, crucial for handling the massive datasets required for the 70-billion-parameter model.

The following diagram illustrates the storage hierarchy implementation.

The integration between Amazon S3 and FSx for Lustre is managed through a data repository association, configured using the following AWS Command Line Interface (AWS CLI) command:


aws fsx create-data-repository-association 
    --file-system-id ${FSX_ID} 
    --file-system-path "/hsmtest" 
    --data-repository-path s3://${BUCKET_NAME_DATA} 
    --s3 AutoImportPolicy='{Events=[NEW,CHANGED,DELETED]}',AutoExportPolicy={Events=[NEW,CHANGED,DELETED]} 
    --batch-import-meta-data-on-create 
    --region ${AWS_REGION}

Observability stack

The monitoring infrastructure combines Amazon Managed Service for Prometheus and Amazon Managed Grafana to provide comprehensive observability. The team integrated DCGM Exporter for GPU metrics and EFA Exporter for network metrics, enabling real-time monitoring of system health and performance. This setup allows for continuous tracking of GPU health, network performance, and training progress, with automated alerting for any anomalies through Grafana Dashboards. The following screenshot shows an example of a GPU health dashboard.

Software stack and training optimizations

The training environment is built on SageMaker HyperPod DLAMI, which provides a preconfigured Ubuntu base Amazon Machine Image (AMI) with essential components for distributed training. The software stack includes CUDA drivers and libraries (such as cuDNN and cuBLAS), NCCL for multi-GPU communication, and AWS-OFI-NCCL for EFA support. On top of this foundation, the team deployed Megatron-LM as the primary framework for model training. The following diagram illustrates the software stack architecture.

Distributed training implementation

The training implementation uses Megatron-LM’s advanced features for scaling LLM training. The framework provides sophisticated model parallelism capabilities, including both tensor and pipeline parallelism, along with efficient data parallelism that supports communication overlap. These features are essential for managing the computational demands of training a 70-billion-parameter model.

Advanced parallelism and communication

The team used a comprehensive 4D parallelism strategy of Megatron-LM that maximizes GPU utilization through careful optimization of communication patterns across multiple dimensions: data, tensor, and pipeline, and sequence parallelism. Data parallelism splits the training batch across GPUs, tensor parallelism divides individual model layers, pipeline parallelism splits the model into stages across GPUs, and sequence parallelism partitions the sequence length dimension—together enabling efficient training of massive models.

The implementation overlaps communication across data parallelism, tensor parallelism, and pipeline parallelism domains, significantly reducing blocking time during computation. This optimized configuration enables efficient scaling across the full cluster of GPUs while maintaining consistently high utilization rates. The following diagram illustrates this communication and computation overlap in distributed training (original image).

Megatron-LM enables fine-grained communication overlapping through multiple configuration flags: --overlap-grad-reduce and --overlap-param-gather for data-parallel operations, --tp-comm-overlap for tensor parallel operations, and built-in pipeline-parallel communication overlap (enabled by default). These optimizations work together to improve training scalability.

Checkpointing strategy

The training infrastructure implements an optimized checkpointing strategy using Distributed Checkpoint (DCP) and asynchronous I/O operations. DCP parallelizes checkpoint operations across all available GPUs, rather than being constrained by tensor and pipeline parallel dimensions as in traditional Megatron-LM implementations. This parallelization, combined with asynchronous I/O, enables the system to:

Save checkpoints up to 10 times faster compared to synchronous approaches
Minimize training interruption by offloading I/O operations
Scale checkpoint performance with the total number of GPUs
Maintain consistency through coordinated distributed saves

The checkpointing system automatically saves model states to the FSx Lustre file system at configurable intervals, with metadata tracked in Amazon S3. For redundancy, checkpoints are asynchronously replicated to Amazon S3 storage.

For implementation details on asynchronous DCP, see Asynchronous Saving with Distributed Checkpoint (DCP).

Experiment management

In November 2024, the team introduced a systematic approach to resource optimization through the development of a sophisticated memory prediction tool. This tool accurately predicts per-GPU memory usage during training and semi-automatically determines optimal training settings by analyzing all possible 4D parallelism configurations. Based on proven algorithmic research, this tool has become instrumental in maximizing resource utilization across the training infrastructure. The team plans to open source this tool with comprehensive documentation to benefit the broader AI research community.

The following screenshot shows an example of the memory consumption prediction tool interface (original image).

Training pipeline management

The success of the training process heavily relied on maintaining high-quality data pipelines. The team implemented rigorous data curation processes and robust cleaning pipelines, maintaining a careful balance in dataset composition across different languages and domains.For experiment planning, version control was critical. The team first fixed the versions of pre-training libraries and instruction tuning libraries to be used in the next experiment cycle. For libraries without formal version releases, the team managed versions using Git branches or tags to provide reproducibility. After the versions were locked, the team conducted short-duration training runs to:

Measure throughput with different numbers of GPU nodes
Search for optimal configurations among distributed training settings identified by the memory prediction library
Establish accurate training time estimates for scheduling

The following screenshot shows an example experiment schedule showing GPU node allocation, expected training duration, and key milestones across different training phases (original image).

To optimize storage performance before beginning experiments, training data was preloaded from Amazon S3 to the FSx for Lustre file system to prevent I/O bottlenecks during training. This preloading process used parallel transfers to maximize throughput:

# Preload data to Lustre filesystem
find <data/path> -type f -print0 | xargs -0 -n 1 -P 8 sudo lfs 
hsm_restore

Monitoring and performance management

The team implemented a comprehensive monitoring system focused on real-time performance tracking and proactive issue detection. By integrating with Weights & Biases, the system continuously monitors training progress and delivers automated notifications for key events such as job completion or failure and performance anomalies. Weights & Biases provides a set of tools that enable customized alerting through Slack channels. The following screenshot shows an example of a training monitoring dashboard in Slack (original image).

The monitoring infrastructure excels at identifying both job failures and performance bottlenecks like stragglers. The following figure presents an example of straggler detection showing training throughput degradation.

Conclusion

The successful training of Llama 3.3 Swallow represents a significant milestone in the development of LLMs using cloud infrastructure. Through this project, the team has demonstrated the effectiveness of combining advanced distributed training techniques with carefully orchestrated cloud resources. The implementation of efficient 4D parallelism and asynchronous checkpointing has established new benchmarks for training efficiency, and the comprehensive monitoring and optimization tools have provided consistent performance throughout the training process.

The project’s success is built on several foundational elements: a systematic approach to resource planning and optimization, robust data pipeline management, and a comprehensive monitoring and alerting system. The efficient storage hierarchy implementation has proven particularly crucial in managing the massive datasets required for training a 70-billion-parameter model.Looking ahead, the project opens several promising directions for future development. The team plans to open source the memory prediction tools, so other researchers can benefit from the optimizations developed during this project. Further improvements to the training pipelines are under development, along with continued enhancement of Japanese language capabilities. The project’s success also paves the way for expanded model applications across various domains.

Resources and references

This section provides key resources and references for understanding and replicating the work described in this paper. The resources are organized into documentation for the infrastructure and tools used, as well as model-specific resources for accessing and working with Llama 3.3 Swallow.

Documentation

The following resources provide detailed information about the technologies and frameworks used in this project:

Model resources

For more information about Llama 3.3 Swallow and access to the model, refer to the following resources:

About the Authors

Kazuki Fujii graduated with a bachelor’s degree in Computer Science from Tokyo Institute of Technology in 2024 and is currently a master’s student there (2024–2026). Kazuki is responsible for the pre-training and fine-tuning of the Swallow model series, a state-of-the-art multilingual LLM specializing in Japanese and English as of December 2023. Kazuki focuses on distributed training and building scalable training systems to enhance the model’s performance and infrastructure efficiency.

Daisuke Miyamato is a Senior Specialist Solutions Architect for HPC at Amazon Web Services. He is mainly supporting HPC customers in drug discovery, numerical weather prediction, electronic design automation, and ML training.

Kei Sasaki is a Senior Solutions Architect on the Japan Public Sector team at Amazon Web Services, where he helps Japanese universities and research institutions navigate their cloud migration journeys. With a background as a systems engineer specializing in high-performance computing, Kei supports these academic institutions in their large language model development initiatives and advanced computing projects.

Keita Watanabe is a Senior GenAI World Wide Specialist Solutions Architect at Amazon Web Services, where he helps develop machine learning solutions using OSS projects such as Slurm and Kubernetes. His background is in machine learning research and development. Prior to joining AWS, Keita worked in the ecommerce industry as a research scientist developing image retrieval systems for product search. Keita holds a PhD in Science from the University of Tokyo.

Get an audio overview of Search results in Labs, then click through to learn more.

Today, we’re launching a new Search experiment in Labs – Audio Overviews, which uses our latest Gemini models to generate quick, conversational audio overviews for certa…Read More

Behind “ANCESTRA”: combining Veo with live-action filmmaking

We partnered with Darren Aronofsky, Eliza McNitt and a team of more than 200 people to make a film using Veo and live-action filmmaking.Read More

Behind “ANCESTRA:” combining Veo with live-action filmmaking

We partnered with Darren Aronofsky, Eliza McNitt and a team of more than 200 to make ANCESTRA.Read More

NVIDIA and Deutsche Telekom Partner to Advance Germany’s Sovereign AI

Industrial AI isn’t slowing down. Germany is ready.

Following London Tech Week and GTC Paris at VivaTech, NVIDIA founder and CEO Jensen Huang’s European tour continued with a stop in Germany to discuss with Chancellor Friedrich Merz new partnerships poised to bring breakthrough innovations on the world’s first industrial AI cloud.

This AI factory, to be located in Germany and operated by Deutsche Telekom, will enable Europe’s industrial leaders to accelerate manufacturing applications including design, engineering, simulation, digital twins and robotics.

“In the era of AI, every manufacturer needs two factories: one for making things, and one for creating the intelligence that powers them,” said Jensen Huang, founder and CEO of NVIDIA. “By building Europe’s first industrial AI infrastructure, we’re enabling the region’s leading industrial companies to advance simulation-first, AI-driven manufacturing.”

“Europe’s technological future needs a sprint, not a stroll,” said Timotheus Höttges, CEO of Deutsche Telekom AG. “We must seize the opportunities of artificial intelligence now, revolutionize our industry and secure a leading position in the global technology competition. Our economic success depends on quick decisions and collaborative innovations.”

This AI infrastructure — Germany’s single largest AI deployment — is an important leap for the nation in establishing its own sovereign AI infrastructure and providing a launchpad to accelerate AI development and adoption across industries. In its first phase, it’ll feature 10,000 NVIDIA Blackwell GPUs — spanning NVIDIA DGX GB200 systems and NVIDIA RTX PRO Servers — as well as NVIDIA networking and AI software.

NEURA Robotics’ training center for cognitive robots.

NEURA Robotics, a Germany-based global pioneer in physical AI and cognitive robotics, will use the computing resources to power its state-of-the-art training centers for cognitive robots — a tangible example of how physical AI can evolve through powerful, connected infrastructure.

At this work’s core is the Neuraverse, a seamlessly networked robot ecosystem that allows robots to learn from each other across a wide range of industrial and domestic applications. This platform creates an app-store-like hub for robotic intelligence — for tasks like welding and ironing — enabling continuous development and deployment of robotic skills in real-world environments.

“Physical AI is the electricity of the future — it will power every machine on the planet,” said David Reger, founder and CEO of NEURA Robotics. “Through this initiative, we’re helping build the sovereign infrastructure Europe needs to lead in intelligent robotics and stay in control of its future.”

Critical to Germany’s competitiveness is AI technology development, including the expansion of data center capacity, according to a Deloitte study. This is strategically important because demand for data center capacity is expected to triple over the next five years to 5 gigawatts.

Driving Germany’s Industrial Ecosystem

Deutsche Telekom will operate the AI factory and provide AI cloud computing resources to Europe’s industrial ecosystem.

Customers will be able to run NVIDIA CUDA-X libraries, as well as NVIDIA RTX- and Omniverse-accelerated workloads from leading software providers such as Siemens, Ansys, Cadence and Rescale.

Many more stand to benefit. From the country’s robust small- and medium-sized businesses, known as the Mittelstand, to academia, research and major enterprises — the AI factory offers strategic technology leaps.

A Speedboat Toward AI Gigafactories

The industrial AI cloud will accelerate AI development and adoption from European manufacturers, driving simulation-first, AI-driven manufacturing practices and helping prepare for the country’s transition to AI gigafactories, the next step in Germany’s sovereign AI infrastructure journey.

The AI gigafactory initiative is a 100,000 GPU-powered program backed by the European Union, Germany and partners.

Poised to go online in 2027, it’ll provide state-of-the-art AI infrastructure that gives enterprises, startups, researchers and universities access to accelerated computing through the establishment and expansion of high-performance computing centers.

As of March, there are about 900 Germany-based members of the NVIDIA Inception program for cutting-edge startups, all of which will be eligible to access the AI resources.

NVIDIA offers learning courses through its Deep Learning Institute to promote education and certification in AI across the globe, and those resources are broadly available across Germany’s computing ecosystem to offer upskilling opportunities.

Additional European telcos are building AI infrastructure for regional enterprises to build and deploy agentic AI applications.

Learn more about the latest AI advancements by watching Huang’s GTC Paris keynote in replay.

Discriminating Form and Meaning in Multilingual Models with Minimal-Pair ABX Tasks

We introduce a set of training-free ABX-style discrimination tasks to evaluate how multilingual language models represent language identity (form) and semantic content (meaning). Inspired from speech processing, these zero-shot tasks measure whether minimal differences in representation can be reliably detected. This offers a flexible and interpretable alternative to probing. Applied to XLM-R (Conneau et al, 2020) across pretraining checkpoints and layers, we find that language discrimination declines over training and becomes concentrated in lower layers, while meaning discrimination…Apple Machine Learning Research

Accelerating Articul8’s domain-specific model development with Amazon SageMaker HyperPod

This post was co-written with Renato Nascimento, Felipe Viana, Andre Von Zuben from Articul8.

Generative AI is reshaping industries, offering new efficiencies, automation, and innovation. However, generative AI requires powerful, scalable, and resilient infrastructures that optimize large-scale model training, providing rapid iteration and efficient compute utilization with purpose-built infrastructure and automated cluster management.

In this post, we share how Articul8 is accelerating their training and deployment of domain-specific models (DSMs) by using Amazon SageMaker HyperPod and achieving over 95% cluster utilization and a 35% improvement in productivity.

What is SageMaker HyperPod?

SageMaker HyperPod is an advanced distributed training solution designed to accelerate the development of scalable, reliable, and secure generative AI model development. Articul8 uses SageMaker HyperPod to efficiently train large language models (LLMs) on diverse, representative data and uses its observability and resiliency features to keep the training environment stable over the long duration of training jobs. SageMaker HyperPod provides the following features:

Fault-tolerant compute clusters with automated faulty node replacement during model training
Efficient cluster utilization through observability and performance monitoring
Seamless model experimentation with streamlined infrastructure orchestration using Slurm and Amazon Elastic Kubernetes Service (Amazon EKS)

Who is Articul8?

Articul8 was established to address the gaps in enterprise generative AI adoption by developing autonomous, production-ready products. For instance, they found that most general-purpose LLMs often fall short in delivering the accuracy, efficiency, and domain-specific knowledge needed for real-world business challenges. They are pioneering a set of DSMs that offer twofold better accuracy and completeness, compared to general-purpose models, at a fraction of the cost. (See their recent blog post for more details.)

The company’s proprietary ModelMesh technology serves as an autonomous layer that decides, selects, executes, and evaluates the right models at runtime. Think of it as a reasoning system that determines what to run, when to run it, and in what sequence, based on the task and context. It evaluates responses at every step to refine its decision-making, enabling more reliable and interpretable AI solutions while dramatically improving performance.

Articul8’s ModelMesh supports:

LLMs for general tasks
Domain-specific models optimized for industry-specific applications
Non-LLMs for specialized reasoning tasks or established domain-specific tasks (for example, scientific simulation)

Articul8’s domain-specific models are setting new industry standards across supply chain, energy, and semiconductor sectors. The A8-SupplyChain model, built for complex workflows, achieves 92% accuracy and threefold performance gains over general-purpose LLMs in sequential reasoning. In energy, A8-Energy models were developed with EPRI and NVIDIA as part of the Open Power AI Consortium, enabling advanced grid optimization, predictive maintenance, and equipment reliability. The A8-Semicon model has set a new benchmark, outperforming top open-source (DeepSeek-R1, Meta Llama 3.3/4, Qwen 2.5) and proprietary models (GPT-4o, Anthropic’s Claude) by twofold in Verilog code accuracy, all while running at 50–100 times smaller model sizes for real-time AI deployment.

Articul8 develops some of their domain-specific models using Meta’s Llama family as a flexible, open-weight foundation for expert-level reasoning. Through a rigorous fine-tuning pipeline with reasoning trajectories and curated benchmarks, general Llama models are transformed into domain specialists. To tailor models for areas like hardware description languages, Articul8 applies Reinforcement Learning with Verifiable Rewards (RLVR), using automated reward pipelines to specialize the model’s policy. In one case, a dataset of 50,000 documents was automatically processed into 1.2 million images, 360,000 tables, and 250,000 summaries, clustered into a knowledge graph of over 11 million entities. These structured insights fuel A8-DSMs across research, product design, development, and operations.

How SageMaker HyperPod accelerated the development of Articul8’s DSMs

Cost and time to train DSMs is critical for success for Articul8 in a rapidly evolving ecosystem. Training high-performance DSMs requires extensive experimentation, rapid iteration, and scalable compute infrastructure. With SageMaker HyperPod, Articul8 was able to:

Rapidly iterate on DSM training – SageMaker HyperPod resiliency features enabled Articul8 to train and fine-tune its DSMs in a fraction of the time required by traditional infrastructure
Optimize model training performance – By using the automated failure recovery feature in SageMaker HyperPod, Articul8 provided stable and resilient training processes
Reduce AI deployment time by four times and lower total cost of ownership by five times – The orchestration capabilities of SageMaker HyperPod alleviated the manual overhead of cluster management, allowing Articul8’s research teams to focus on model optimization rather than infrastructure upkeep

These advantages contributed to record-setting benchmark results by Articul8, proving that domain-specific models deliver superior real-world performance compared to general-purpose models.

Distributed training challenges and the role of SageMaker HyperPod

Distributed training across hundreds of nodes faces several critical challenges beyond basic resource constraints. Managing massive training clusters requires robust infrastructure orchestration and careful resource allocation for operational efficiency. SageMaker HyperPod offers both managed Slurm and Amazon EKS orchestration experience that streamlines cluster creation, infrastructure resilience, job submission, and observability. The following details focus on the Slurm implementation for reference:

Cluster setup – Although setting up a cluster is a one-time effort, the process is streamlined with a setup script that walks the administrator through each step of cluster creation. This post shows how this can be done in discrete steps.
Resiliency – Fault tolerance becomes paramount when operating at scale. SageMaker HyperPod handles node failures and network interruptions by replacing faulty nodes automatically. You can add the flag --auto-resume=1 with the Slurm srun command, and the distributed training job will recover from the last checkpoint.
Job submission – SageMaker HyperPod managed Slurm orchestration is a powerful way for data scientists to submit and manage distributed training jobs. Refer to the following example in the AWS-samples distributed training repo for reference. For instance, a distributed training job can be submitted with a Slurm sbatch command: sbatch 1.distributed-training-llama2.sbatch. You can use squeue and scancel to view and cancel jobs, respectively.
Observability – SageMaker HyperPod uses Amazon CloudWatch and open source managed Prometheus and Grafana services for monitoring and logging. Cluster administrators can view the health of the infrastructure (network, storage, compute) and utilization.

Solution overview

The SageMaker HyperPod platform enables Articul8 to efficiently manage high-performance compute clusters without requiring a dedicated infrastructure team. The service automatically monitors cluster health and replaces faulty nodes, making the deployment process frictionless for researchers.

To enhance their experimental capabilities, Articul8 integrated SageMaker HyperPod with Amazon Managed Grafana, providing real-time observability of GPU resources through a single-pane-of-glass dashboard. They also used SageMaker HyperPod lifecycle scripts to customize their cluster environment and install required libraries and packages. This comprehensive setup empowers Articul8 to conduct rapid experimentation while maintaining high performance and reliability—they reduced their customers’ AI deployment time by four times and lowered their total cost of ownership by five times.

The following diagram illustrates the observability architecture.

The platform’s efficiency in managing computational resources with minimum downtime has been particularly valuable for Articul8’s research and development efforts, empowering them to quickly iterate on their generative AI solutions while maintaining enterprise-grade performance standards. The following sections describe the setup and results in detail.

For the setup for this post, we begin with the AWS published workshop for SageMaker HyperPod, and adjust it to suit our workload.

Prerequisites

The following two AWS CloudFormation templates address the prerequisites of the solution setup.

For SageMaker HyperPod

This CloudFormation stack addresses the prerequisites for SageMaker HyperPod:

VPC and two subnets – A public subnet and a private subnet are created in an Availability Zone (provided as a parameter). The virtual private cloud (VPC) contains two CIDR blocks with 10.0.0.0/16 (for the public subnet) and 10.1.0.0/16 (for the private subnet). An internet gateway and NAT gateway are deployed in the public subnet.
Amazon FSx for Lustre file system – An Amazon FSx for Lustre volume is created in the specified Availability Zone, with a default of 1.2 TB storage, which can be overridden by a parameter. For this case study, we increased the storage size to 7.2 TB.
Amazon S3 bucket – The stack deploys endpoints for Amazon Simple Storage Service (Amazon S3) to store lifecycle scripts.
IAM role – An AWS Identity and Access Management (IAM) role is also created to help execute SageMaker HyperPod cluster operations.
Security group – The script creates a security group to enable EFA communication for multi-node parallel batch jobs.

For cluster observability

To get visibility into cluster operations and make sure workloads are running as expected, an optional CloudFormation stack has been used for this case study. This stack includes:

Node exporter – Supports visualization of CPU load averages, memory and disk usage, network traffic, file system, and disk I/O metrics
NVIDIA DCGM – Supports visualization of GPU utilization, temperatures, power usage, and memory usage
EFA metrics – Supports visualization of EFA network and error metrics, EFA RDMA performance, and so on.
FSx for Lustre – Supports visualization of file system read/write operations, free capacity, and metadata operations

Observability can be configured through YAML scripts to monitor SageMaker HyperPod clusters on AWS. Amazon Managed Service for Prometheus and Amazon Managed Grafana workspaces with associated IAM roles are deployed in the AWS account. Prometheus and exporter services are also set up on the cluster nodes.

Using Amazon Managed Grafana with SageMaker HyperPod helps you create dashboards to monitor GPU clusters and make sure they operate efficiently with minimum downtime. In addition, dashboards have become a critical tool to give you a holistic view of how specialized workloads consume different resources of the cluster, helping developers optimize their implementation.

Cluster setup

The cluster is set up with the following components (results might vary based on customer use case and deployment setup):

Head node and compute nodes – For this case study, we use a head node and SageMaker HyperPod compute nodes. The head node has an ml.m5.12xlarge instance, and the compute queue consists of ml.p4de.24xlarge instances.
Shared volume – The cluster has an FSx for Lustre file system mounted at /fsx on both the head and compute nodes.
Local storage – Each node has 8 TB local NVME volume attached for local storage.
Scheduler – Slurm is used as an orchestrator. Slurm is an open source and highly scalable cluster management tool and job scheduling system for high-performance computing (HPC) clusters.
Accounting – As part of cluster configuration, a local MariaDB is deployed that keeps track of job runtime information.

Results

During this project, Articul8 was able to confirm the expected performance of A100 with the added benefit of creating a cluster using Slurm and providing observability metrics to monitor the health of various components (storage, GPU nodes, fiber). The primary validation was on the ease of use and rapid ramp-up of data science experiments. Furthermore, they were able to demonstrate near linear scaling with distributed training, achieving a 3.78 times reduction in time to train for Meta Llama-2 13B with 4x nodes. Having the flexibility to run multiple experiments, without losing development time from infrastructure overhead was an important accomplishment for the Articul8 data science team.

Clean up

If you run the cluster as part of the workshop, you can follow the cleanup steps to delete the CloudFormation resources after deleting the cluster.

Conclusion

This post demonstrated how Articul8 AI used SageMaker HyperPod to overcome the scalability and efficiency challenges of training multiple high-performing DSMs across key industries. By alleviating infrastructure complexity, SageMaker HyperPod empowered Articul8 to focus on building AI systems with measurable business outcomes. From semiconductor and energy to supply chain, Articul8’s DSMs are proving that the future of enterprise AI is not general—it’s purpose-built. Key takeaways include:

DSMs significantly outperform general-purpose LLMs in critical domains
SageMaker HyperPod accelerated the development of Articul8’s A8-Semicon, A8-SupplyChain, and Energy DSM models
Articul8 reduced AI deployment time by four times and lowered total cost of ownership by five times using the scalable, automated training infrastructure of SageMaker HyperPod

Learn more about SageMaker HyperPod by following this workshop. Reach out to your account team on how you can use this service to accelerate your own training workloads.

About the Authors

Yashesh A. Shroff, PhD. is a Sr. GTM Specialist in the GenAI Frameworks organization, responsible for scaling customer foundational model training and inference on AWS using self-managed or specialized services to meet cost and performance requirements. He holds a PhD in Computer Science from UC Berkeley and an MBA from Columbia Graduate School of Business.

Amit Bhatnagar is a Sr Technical Account Manager with AWS, in the Enterprise Support organization, with a focus on generative AI startups. He is responsible for helping key AWS customers with their strategic initiatives and operational excellence in the cloud. When he is not chasing technology, Amit loves to cook vegan delicacies and hit the road with his family to chase the horizon.

Renato Nascimento is the Head of Technology at Articul8, where he leads the development and execution of the company’s technology strategy. With a focus on innovation and scalability, he ensures the seamless integration of cutting-edge solutions into Articul8’s products, enabling industry-leading performance and enterprise adoption.

Felipe Viana is the Head of Applied Research at Articul8, where he leads the design, development, and deployment of innovative generative AI technologies, including domain-specific models, new model architectures, and multi-agent autonomous systems.

Andre Von Zuben is the Head of Architecture at Articul8, where he is responsible for designing and implementing scalable generative AI platform elements, novel generative AI model architectures, and distributed model training and deployment pipelines.

Calling an LLM with an API

Choosing the right model for your use case

Enriching model responses with your proprietary data

Tailoring models to your business needs

Managing and optimizing prompts

Optimizing efficiency with intelligent model selection

Reducing redundant processing for faster responses

Automating multistep tasks with agentic AI

Maintaining security, privacy, and responsible AI practices

Using existing custom models with Amazon Bedrock Custom Model Import

Automating workflows with Amazon Bedrock Flows

Monitoring and logging to close the loop on AI operations

Finalizing and scaling your generative AI solution

Conclusion

About the authors

Solution overview

Results

Conclusion

About the authors

Solution overview

Services used

Prerequisites

Solution walkthrough

Clean up

Conclusion

About the Authors

Overview of the Llama 3.3 Swallow

Training methodology

Performance and benchmarks

Licensing and usage

Training infrastructure architecture

Compute and network configuration

Storage architecture

Observability stack

Software stack and training optimizations

Distributed training implementation

Advanced parallelism and communication

Checkpointing strategy

Experiment management

Training pipeline management

Monitoring and performance management

Conclusion

Resources and references

Documentation

Model resources

About the Authors

Driving Germany’s Industrial Ecosystem

A Speedboat Toward AI Gigafactories

What is SageMaker HyperPod?

Who is Articul8?

How SageMaker HyperPod accelerated the development of Articul8’s DSMs

Distributed training challenges and the role of SageMaker HyperPod

Solution overview

Prerequisites

For SageMaker HyperPod

For cluster observability

Cluster setup

Results

Clean up

Conclusion

About the Authors

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.