How Netsertive built a scalable AI assistant to extract meaningful insights from real-time data using Amazon Bedrock and Amazon Nova

How Netsertive built a scalable AI assistant to extract meaningful insights from real-time data using Amazon Bedrock and Amazon Nova

This post was co-written with Herb Brittner from Netsertive.

Netsertive is a leading digital marketing solutions provider for multi-location brands and franchises, helping businesses maximize local advertising, improve engagement, and gain deep customer insights.

With a growing demand in providing more actionable insights from their customer call tracking data, Netsertive needed a solution that could unlock business intelligence from every call, making it easier for franchises to improve customer service and boost conversion rates. The team was looking for a single, flexible system that could do several things:

  • Understand phone calls – Automatically create summaries of what was discussed
  • Gauge customer feelings – Determine if the caller was happy, upset, or neutral
  • Identify important topics – Pull out keywords related to frequent services, questions, problems, and mentions of competitors
  • Improve agent performance – Offer advice and suggestions for coaching
  • Track performance over time – Generate reports on trends for individual locations, regions, and the entire country

Crucially, this new system needed to work smoothly with their existing Multi-Location Experience (MLX) platform. The MLX platform is specifically designed for businesses with many locations and helps them manage both national and local marketing. It allows them to run campaigns across various online channels, including search engines, social media, display ads, videos, connected TVs, and online reviews, as well as manage SEO, business listings, reviews, social media posting, and individual location web pages.

In this post, we show how Netsertive introduced a generative AI-powered assistant into MLX, using Amazon Bedrock and Amazon Nova, to bring their next generation of the platform to life.

Solution overview

Operating a comprehensive digital marketing solution, Netsertive handles campaign execution while providing key success metrics through their Insights Manager product. The platform features location-specific content management capabilities and robust lead capture functionality, collecting data from multiple sources, including paid campaigns, organic website traffic, and attribution pro forms. With CRM integration and call tracking features, MLX creates a seamless flow of customer data and marketing insights. This combination of managed services, automated tools, and analytics makes MLX a single source of truth for businesses seeking to optimize their digital marketing efforts while taking advantage of Netsertive’s expertise in campaign management. To address their desire to provide more actionable insights on the platform from customer call tracking data, Netsertive considered various solutions. After evaluating different tools and models, they decided to use Amazon Bedrock and the Amazon Nova Micro model. This choice was driven by the API-driven approach of Amazon Bedrock, its wide selection of large language models (LLMs), and the performance of the Amazon Nova Micro model specifically. They selected Amazon Nova Micro based on its ability to deliver fast response times at a low cost, while providing consistent and intelligent insights—key factors for Netsertive. With its generation speed of over 200 tokens per second and highly performant language understanding skills, this text-only model proved ideal for Netsertive. The following diagram shows how their MLX platform receives real-time phone calls and uses Amazon Nova Micro in Amazon Bedrock for processing real-time phone calls.

AWS architecture for Netsertive showcasing EKS, Aurora, Bedrock integration with insights management and call reporting workflow

The real-time call processing flow consists of the following steps:

  1. When a call comes in, it’s immediately routed to the Lead API. This process captures both the live call transcript and important metadata about the caller. This system continuously processes new calls as they arrive, facilitating real-time handling of incoming communications.
  2. The captured transcript is forwarded to Amazon Bedrock for analysis. The system currently uses a standardized base prompt for all customers, and the architecture is designed to allow for customer-specific prompt customization as an added layer of context.
  3. Amazon Nova Micro processes the transcript and returns a structured JSON response. This response includes multiple analysis components: sentiment analysis of the conversation, a concise call summary, identified key terms, overall call theme classification, and specific coaching suggestions for improvement.
  4. All analysis results are systematically stored in an Amazon Aurora database with their associated key metrics. This makes sure the processed data is properly indexed and readily available for both immediate access and future analysis.

The aggregate report schedule flow consists of the following steps:

  1. The aggregate analysis process automatically initiates on both weekly and monthly schedules. During each run, the system gathers call data that falls within the specified time period.
  2. This aggregate analysis uses both Amazon Bedrock and Amazon Nova Micro, applying a specialized prompt designed specifically for trend analysis. This prompt differs from the real-time analysis to focus on identifying patterns and insights across multiple calls.

The processed aggregate data from both workflows is transformed into comprehensive reports displaying trend analysis and comparative metrics through the UI. This provides stakeholders with valuable insights into performance patterns and trends over time while allowing the user to dive deeper into specific metrics.

Results

The implementation of generative AI to create a real-time call data analysis solution has been a transformative journey for Netsertive. Their new Call Insights AI feature, using Amazon Nova Micro on Amazon Bedrock, only takes minutes to create actionable insights, compared to their previous manual call review processes, which took hours or even days for customers with high call volumes. Netsertive chose Amazon Bedrock and Amazon Nova Micro for their solution after a swift evaluation period of approximately 1 week of testing different tools and models. Their development approach was methodical and customer-focused. The Call Insights AI feature was added to their platform’s roadmap based on direct customer feedback and internal marketing expertise. The entire development process, from creating and testing their Amazon Nova Micro prompts to integrating Amazon Bedrock with their MLX platform, was completed within approximately 30 days before launching in beta. The transformation of real-time call data analysis isn’t just about processing more calls—it’s about creating a more comprehensive understanding of customer interactions. By implementing Amazon Bedrock and Amazon Nova Micro, Netsertive is able to better understand call purposes and value, enhance measurement capabilities, and progress towards more automated and efficient analysis systems. This evolution can not only streamline operations but also provide customers with more actionable insights about their digital marketing performance.

Conclusion

In this post, we shared how Netsertive introduced a generative AI-powered assistant into MLX, using Amazon Bedrock and Amazon Nova. This solution helped scale their MLX platform to provide their customers with instant, actionable insights, creating a more engaging and informative user experience. By using the advanced natural language processing capabilities of Amazon Bedrock and the high-performance, low-latency Amazon Nova Micro model, Netsertive was able to build a comprehensive call intelligence system that goes beyond just transcription and sentiment analysis.

The success of this project has demonstrated the transformative potential of generative AI in driving business intelligence and operational efficiency. To learn more about building powerful, generative AI assistants and applications using Amazon Bedrock and Amazon Nova, see Generative AI on AWS.


About the authors

Nicholas Switzer is an AI/ML Specialist Solutions Architect at Amazon Web Services. He joined AWS in 2022 and specializes in AI/ML, generative AI, IoT, and edge AI. He is based in the US and enjoys building intelligent products that improve everyday life.

Jane Ridge is Senior Solutions Architect at Amazon Web Services with over 20 years of technology experience. She joined AWS in 2020 and is based in the US. She is passionate around enabling growth of her customers through innovative solutions combined with her deep technical expertise in the AWS ecosystem. She is known for her ability to guide customers through all stages of their cloud journey and deliver impactful solutions.

Herb Brittner is the Vice President of Product & Engineering at Netsertive, where he leads the development of AI-driven digital marketing solutions for multi-location brands and franchises. With a strong background in product innovation and scalable engineering, he specializes in using machine learning and cloud technologies to drive business insights and customer engagement. Herb is passionate about building data-driven platforms that enhance marketing performance and operational efficiency.

Read More

Make videos accessible with automated audio descriptions using Amazon Nova

Make videos accessible with automated audio descriptions using Amazon Nova

According to the World Health Organization, more than 2.2 billion people globally have vision impairment. For compliance with disability legislation, such as the Americans with Disabilities Act (ADA) in the United States, media in visual formats like television shows or movies are required to provide accessibility to visually impaired people. This often comes in the form of audio description tracks that narrate the visual elements of the film or show. According to the International Documentary Association, creating audio descriptions can cost $25 per minute (or more) when using third parties. For building audio descriptions internally, the effort for businesses in the media industry can be significant, requiring content creators, audio description writers, description narrators, audio engineers, delivery vendors and more according to the American Council of the Blind (ACB). This leads to the natural question, can you automate this process with the help of generative AI offerings in Amazon Web Services (AWS)?

Newly announced in December at re:Invent 2024, the Amazon Nova Foundation Models family is available through Amazon Bedrock and includes three multimodal foundational models (FMs):

  • Amazon Nova Lite (GA) – A low-cost multimodal model that’s lightning-fast for processing image, video, and text inputs
  • Amazon Nova Pro (GA) – A highly capable multimodal model with a balanced combination of accuracy, speed, and cost for a wide range of tasks
  • Amazon Nova Premier (GA) – Our most capable model for complex tasks and a teacher for model distillation

In this post, we demonstrate how you can use services like Amazon Nova, Amazon Rekognition, and Amazon Polly to automate the creation of accessible audio descriptions for video content. This approach can significantly reduce the time and cost required to make videos accessible for visually impaired audiences. However, this post doesn’t provide a complete, deployment-ready solution. We share pseudocode snippets and guidance in sequential order, in addition to detailed explanations and links to resources. For a complete script, you can use additional resources, such as Amazon Q Developer, to build a fully functional system. The automated workflow described in the post involves analyzing video content, generating text descriptions, and narrating them using AI voice generation. In summary, while powerful, this requires careful integration and testing to deploy effectively. By the end of this post, you’ll understand the key steps, but some additional work is needed to create a production-ready solution for your specific use case.

Solution overview

The following architecture diagram demonstrates the end-to-end workflow of the proposed solution. We will describe each component in-depth in the later sections of this post, but note that you can define the logic within a single script. You can then run your script on an Amazon Elastic Compute Cloude (Amazon EC2) instance or on your local computer. For this post, we assume that you will run the script on an Amazon SageMaker notebook.

End-to-end AWS workflow demonstrating video content analysis using AI services to generate text descriptions and audio narration

Services used

The services shown in the architecture diagram include:

  1. Amazon S3Amazon Simple Storage Service (Amazon S3) is an object storage service that provides scalable, durable, and highly available storage. In this example, we use Amazon S3 to store the video files (input) and scene description (text files) and audio description (MP3 files) output generated by the solution. The script starts by fetching the source video from an S3 bucket.
  2. Amazon Rekognition – Amazon Rekognition is a computer vision service that can detect and extract video segments or scenes by identifying technical cues such as shot boundaries, black frames, and other visual elements. To yield higher accuracy for the generated video descriptions, you use Amazon Rekognition to segment the source video into smaller chunks before passing it to Amazon Nova. These video segments can be stored in a temporary directory on your compute machine.
  3. Amazon Bedrock – Amazon Bedrock is a managed service that provides access to large, pre-trained AI models such as the Amazon Nova Pro model, which is used in this solution to analyze the content of each video segment and generate detailed scene descriptions. You can store these text descriptions in a text file (for example, video_analysis.txt).
  4. Amazon Polly – Amazon Polly is a text-to-speech service that is used to convert the text descriptions generated by the Amazon Nova Pro model into high-quality audio, made available using an MP3 file.

Prerequisites

To follow along with the solution outlined in this post, you should have the following in place:

You can use AWS SDK to create, configure, and manage AWS services. For Boto3, you can include it at the top of your script using: import boto3

Additionally, you need a mechanism to split videos. If you’re using Python, we recommend the moviepy library.
import moviepy # pip install moviepy

Solution walkthrough

The solution includes the following basic steps, which you can use as a basic structure and customize or expand to fit your use case.

  1. Define the requirements for the AWS environment, including defining the use of the Amazon Nova Pro model for its visual support and the AWS Region you’re working in. For optimal throughput, we recommend using inference profiles when configuring Amazon Bedrock to invoke the Amazon Nova Pro model. Initialize a client for Amazon Rekognition, which you use for its support of segmentation.
CLASS VideoAnalyzer:
	FUNCTION initialize():
 		Set AWS_REGION to "us-east-1"
 		Set MODEL_ID to "amazon.nova-pro-v1:0"
 		Set chunk_delay to 20 Initialize AWS clients (Bedrock and Rekognition)
  1. Define a function for detecting segments in the video. Amazon Rekognition supports segmentation, which means users have the option to detect and extract different segments or scenes within a video. By using the Amazon Rekognition Segment API, you can perform the following:
    1. Detect technical cues such as black frames, color bars, opening and end credits, and studio logos in a video.
    2. Detect shot boundaries to identify the start, end, and duration of individual shots within the video.

The solution uses Amazon Rekognition to partition the video into multiple segments and perform Amazon Nova Pro-based inference on each segment. Finally, you can piece together each segment’s inference output to return a comprehensive audio description for the entire video.

FUNCTION get_segment_results(job_id):
 	TRY:
 	   Initialize empty segments list 
 	   WHILE more results exist:
 	         Get segment detection results 
                Add segments to list 
                IF no more results THEN break
          RETURN segments 
       CATCH any errors and return null 

FUNCTION extract_scene(video_path, start_time, end_time):
       TRY: 
           Load video file 
           Validate time range
           Create temporary directory 
           Extract video segment 
           Save segment to file 
           RETURN path to saved segment 
       CATCH any errors and return null

Three coffee cups on checkered tablecloth and close-up of coffee grounds in cup

In the preceding image, there are two scenes: a screenshot of one scene on the left followed by the scene that immediately follows it on the right. With the Amazon Rekognition segmentation API, you can identify that the scene has changed—that the content that is displayed on screen is different—and therefore you need to generate a new scene description.

  1. Create the segmentation job and:
    • Upload the video file for which you want to create an audio description to Amazon S3.
    • Start the job using that video.

Setting SegmentType=[‘SHOT’] identifies the start, end, and duration of a scene. Additionally, MinSegmentConfidence sets the minimum confidence Amazon Rekognition must have to return a detected segment, with 0 being lowest confidence and 100 being highest.

  1. Use the analyze_chunk function. This function defines the main logic of the audio description solution. Some items to note about analyze_chunk:
    • For this example, we sent a video scene to Amazon Nova Pro for an analysis of the contents using the prompt Describe what is happening in this video in detail. This prompt is relatively straightforward and experimentation or customization for your use case is encouraged. Amazon Nova Pro then returned the text description for our video scene.
    • For longer videos with many scenes, you might encounter throttling. This is resolved by implementing a retry mechanism. For details on throttling and quotas for Amazon Bedrock, see Quotas for Amazon Bedrock.
FUNCTION analyze_chunk(chunk_path): 
     TRY: 
        Convert video chunk to base64 
        Create request body for Bedrock 
        Set max_retries and backoff_time 

        WHILE retry_count < max_retries:
          TRY:
             Send InvokeModel request to Bedrock
             RETURN analysis results 
          CATCH throttling: 
              Wait and retry with exponential backoff 
          CATCH other errors: 
              Return null 
     CATCH any errors:
         Return null

In effect, the raw scenes are converted into rich, descriptive text. Using this text, you can generate a complete scene-by-scene walkthrough of the video and send it to Amazon Polly for audio.

  1. Use the following code to orchestrate the process:
    1. Initiate the detection of the various segments by using Amazon Rekognition.
    2. Each segment is processed through a flow of:
      1. Extraction.
      2. Analysis using Amazon Nova Pro.
      3. Compiling the analysis into a video_analysis.txt file.
  2. The analyze_video function brings together all the components and produces a text file that contains the complete, scene-by-scene analysis of the video contents, with timestamps
FUNCTION analyze_video(video_path, bucket): 
     TRY: 
         Start segment detection 
         Wait for job completion 
         Get segments 
         FOR each segment: 
             Extract scene 
             Analyze chunk 
             Save analysis results 
         Write results to file 
      CATCH any errors

If you refer back to the previous screenshot, the output—without any additional refinement—will look similar to the following image.

Three coffee cups on checkered tablecloth and close-up of coffee grounds in cup

“Segment 103.136-126.026 seconds:
[{'text': 'The video shows a close-up of a coffee cup with steam rising from it, followed by three cups of coffee on a table with milk and sugar jars. A person then picks up a bunch of coffee beans from a plant.'}]
Segment 126.059-133.566 seconds:
[{'text': "The video starts with a person's hand, covered in dirt and holding a branch with green leaves and berries. The person then picks up some berries. The video then shows a man standing in a field with trees and plants. He is holding a bunch of red fruits in his right hand and looking at them. He is wearing a shirt and has a mustache. He seems to be picking the fruits. The fruits are probably coffee beans. The area is surrounded by green plants and trees."}]” 

The following screenshot is an example is a more extensive look at the video_analysis.txt for the coffee.mp4 video:

Detailed video analysis text file displaying 12 chronological segments with timestamps, describing a day's journey from waking up to coffee cultivation and brewing.

  1. Send the contents of the text file to Amazon Polly. Amazon Polly adds a voice to the text file, completing the workflow of the audio description solution.
FUNCTION generate_audio(text_file, output_audio_file):
     TRY:
        Read analysis text
        Set max_retries and backoff_time

        WHILE retry_count < max_retries:
           TRY:
              Initialize Polly client
              Convert text to speech
              Save audio file
              RETURN success
           CATCH throttling:
              Wait with exponential backoff
              retry_count += 1
           CATCH other errors:
              retry_count += 1
              Continue or Break based on error type
     CATCH any errors:
         RETURN error

For a list of different voices that you can use in Amazon Polly, see Available voices in the Amazon Polly Developer Guide.

Your final output with Polly should sound something like this:

Clean up

It’s a best practice to delete the resources you provisioned for this solution. If you used an EC2 or SageMaker Notebook Instance, stop or terminate it. Remember to delete unused files from your S3 bucket (eg: video_analysis.txt and video_analysis.mp3).

Conclusion

Recapping the solution at a high level, in this post, you used:

  • Amazon S3 to store the original video, intermediate data, and the final audio description artifacts
  • Amazon Rekognition to partition the video file into time-stamped scenes
  • Computer vision capabilities from Amazon Nova Pro (available through Amazon Bedrock) to analyze the contents of each scene

We showed you how to use Amazon Polly to create an MP3 audio file from the final scene description text file, which is what will be consumed by the audience members. The solution outlined in this post demonstrates how to fully automate the process of creating audio descriptions for video content to improve accessibility. By using Amazon Rekognition for video segmentation, the Amazon Nova Pro model for scene analysis, and Amazon Polly for text-to-speech, you can generate a comprehensive audio description track that narrates the key visual elements of a video. This end-to-end automation can significantly reduce the time and cost required to make video content accessible for visually impaired audiences, helping businesses and organizations meet their accessibility goals. With the power of AWS AI services, this solution provides a scalable and efficient way to improve accessibility and inclusion for video-based media.

This solution isn’t limited to using it for TV shows and movies. Any visual media that requires accessibility can be a candidate! For more information about the new Amazon Nova model family and the amazing things these models can do, see Introducing Amazon Nova foundation models: Frontier intelligence and industry leading price performance.

In addition to the steps described in this post, additional actions you might need to take include:

  • Removing a video segment analysis’s introductory text from Amazon Nova. When Amazon Nova returns a response, it might begin with something like “In this video…” or something similar. You probably want just the video description itself without this introductory text. If there is introductory text in your scene descriptions, then Amazon Polly will speak it aloud and impact the quality of your audio transcriptions. You can account for this in a few ways.
    • For example, prior to sending it to Amazon Polly, you can modify the generated scene descriptions by programmatically removing that type of text from them.
    • Alternatively, you can use prompt engineering to request that Amazon Bedrock return only the scene descriptions in a structured format or without any additional commentary.
    • The third option is to define and use a tool when performing inference on Amazon Bedrock. This can be a more comprehensive technique of defining the format of the output that you want Amazon Bedrock to return. Using tools to shape model output, is known as function calling. For more information, see Use a tool to complete an Amazon Bedrock model response.
  • You should also be mindful of the architectural components of the solution. In a production environment, being mindful of any potential scaling, security, and storage elements is important because the architecture might begin to resemble something more complex than the basic solution architecture diagram that this post began with.

About the Authors

Dylan Martin is an AWS Solutions Architect, working primarily in the generative AI space helping AWS Technical Field teams build AI/ML workloads on AWS. He brings his experience as both a security solutions architect and software engineer. Outside of work he enjoys motorcycling, the French Riviera and studying languages.

Ankit Patel is an AWS Solutions Developer, part of the Prototyping And Customer Engineering (PACE) team. Ankit helps customers bring their innovative ideas to life by rapid prototyping; using the AWS platform to build, orchestrate, and manage custom applications.

Read More

Training Llama 3.3 Swallow: A Japanese sovereign LLM on Amazon SageMaker HyperPod

Training Llama 3.3 Swallow: A Japanese sovereign LLM on Amazon SageMaker HyperPod

This post is based on a technical report written by Kazuki Fujii, who led the Llama 3.3 Swallow model development.

The Institute of Science Tokyo has successfully trained Llama 3.3 Swallow, a 70-billion-parameter large language model (LLM) with enhanced Japanese capabilities, using Amazon SageMaker HyperPod. The model demonstrates superior performance in Japanese language tasks, outperforming GPT-4o-mini and other leading models. This technical report details the training infrastructure, optimizations, and best practices developed during the project.

This post is organized as follows:

  • Overview of Llama 3.3 Swallow
  • Architecture for Llama 3.3 Swallow training
  • Software stack and optimizations employed in Llama 3.3 Swallow training
  • Experiment management

We discuss topics relevant to machine learning (ML) researchers and engineers with experience in distributed LLM training and familiarity with cloud infrastructure and AWS services. We welcome readers who understand model parallelism and optimization techniques, especially those interested in continuous pre-training and supervised fine-tuning approaches.

Overview of the Llama 3.3 Swallow

Llama 3.3 Swallow is a 70-billion-parameter LLM that builds upon Meta’s Llama 3.3 architecture with specialized enhancements for Japanese language processing. The model was developed through a collaboration between the Okazaki Laboratory and Yokota Laboratory at the School of Computing, Institute of Science Tokyo, and the National Institute of Advanced Industrial Science and Technology (AIST).

The model is available in two variants on Hugging Face:

Both variants are accessible through the tokyotech-llm organization on Hugging Face, providing researchers and developers with flexible options for different application needs.

Training methodology

The base model was developed through continual pre-training from Meta Llama 3.3 70B Instruct, maintaining the original vocabulary without expansion. The training data primarily consisted of the Swallow Corpus Version 2, a carefully curated Japanese web corpus derived from Common Crawl. To secure high-quality training data, the team employed the Swallow Education Classifier to extract educationally valuable content from the corpus. The following table summarizes the training data used for the base model training with approximately 314 billion tokens. For compute, the team used 32 ml.p5.48xlarge Amazon Elastic Compute Cloud (Amazon EC2) instances (H100, 80 GB, 256 GPUs) for continual pre-training with 16 days and 6 hours.

Training Data Number of Training Tokens
Japanese Swallow Corpus v2 210 billion
Japanese Wikipedia 5.3 billion
English Wikipedia 6.9 billion
English Cosmopedia 19.5 billion
English DCLM baseline 12.8 billion
Laboro ParaCorpus 1.4 billion
Code Swallow-Code 50.2 billion
Math Finemath-4+ 7.85 billion

For the instruction-tuned variant, the team focused exclusively on Japanese dialogue and code generation tasks. This version was created through supervised fine-tuning of the base model, using the same Japanese dialogue data that proved successful in the previous Llama 3.1 Swallow v0.3 release. Notably, the team made a deliberate choice to exclude English dialogue data from the fine-tuning process to maintain focus on Japanese language capabilities. The following table summarizes the instruction-tuning data used for the instruction-tuned model.

Training Data Number of Training Samples
Gemma-2-LMSYS-Chat-1M-Synth 240,000
Swallow-Magpie-Ultra-v0.1 42,000
Swallow-Gemma-Magpie-v0.1 99,000
Swallow-Code-v0.3-Instruct-style 380,000

Performance and benchmarks

The base model has demonstrated remarkable performance in Japanese language tasks, consistently outperforming several industry-leading models. In comprehensive evaluations, it has shown superior capabilities compared to OpenAI’s GPT-4o (gpt-4o-2024-08-06), GPT-4o-mini (gpt-4o-mini-2024-07-18), GPT-3.5 (gpt-3.5-turbo-0125), and Qwen2.5-72B. These benchmarks reflect the model’s enhanced ability to understand and generate Japanese text. The following graph illustrates the base model performance comparison across these different benchmarks (original image).

The instruction-tuned model has shown particularly strong performance on the Japanese MT-Bench, as evaluated by GPT-4o-2024-08-06, demonstrating its effectiveness in practical applications. The following graph presents the performance metrics (original image).

Licensing and usage

The model weights are publicly available on Hugging Face and can be used for both research and commercial purposes. Users must comply with both the Meta Llama 3.3 license and the Gemma Terms of Use. This open availability aims to foster innovation and advancement in Japanese language AI applications while enforcing responsible usage through appropriate licensing requirements.

Training infrastructure architecture

The training infrastructure for Llama 3.3 Swallow was built on SageMaker HyperPod, with a focus on high performance, scalability, and observability. The architecture combines compute, network, storage, and monitoring components to enable efficient large-scale model training. The base infrastructure stack is available as an AWS CloudFormation template for seamless deployment and replication. This template provisions a comprehensive foundation by creating a dedicated virtual private cloud (VPC). The networking layer is complemented by a high-performance Amazon FSx for Lustre file system, alongside an Amazon Simple Storage Service (Amazon S3) bucket configured to store lifecycle scripts, which are used to configure the SageMaker HyperPod cluster.

Before deploying this infrastructure, it’s essential to make sure the AWS account has the appropriate service quotas. The deployment of SageMaker HyperPod requires specific quota values that often exceed default limits. You should check your current quota against the requirements detailed in SageMaker HyperPod quotas and submit a quota increase request as needed.

The following diagram illustrates the high-level architecture of the training infrastructure.

Compute and network configuration

The compute infrastructure is based on SageMaker HyperPod using a cluster of 32 EC2 P5 instances, each equipped with 8 NVIDIA H100 GPUs. The deployment uses a single spine configuration to provide minimal latency between instances. All communication between GPUs is handled through NCCL over an Elastic Fabric Adapter (EFA), providing high-throughput, low-latency networking essential for distributed training. The SageMaker HyperPod Slurm configuration manages the deployment and orchestration of these resources effectively.

Storage architecture

The project implements a hierarchical storage approach that balances performance and cost-effectiveness. At the foundation is Amazon S3, providing long-term storage for training data and checkpoints. To prevent storage bottlenecks during training, the team deployed FSx for Lustre as a high-performance parallel file system. This configuration enables efficient data access patterns across all training nodes, crucial for handling the massive datasets required for the 70-billion-parameter model.

The following diagram illustrates the storage hierarchy implementation.

The integration between Amazon S3 and FSx for Lustre is managed through a data repository association, configured using the following AWS Command Line Interface (AWS CLI) command:


aws fsx create-data-repository-association 
    --file-system-id ${FSX_ID} 
    --file-system-path "/hsmtest" 
    --data-repository-path s3://${BUCKET_NAME_DATA} 
    --s3 AutoImportPolicy='{Events=[NEW,CHANGED,DELETED]}',AutoExportPolicy={Events=[NEW,CHANGED,DELETED]} 
    --batch-import-meta-data-on-create 
    --region ${AWS_REGION}

Observability stack

The monitoring infrastructure combines Amazon Managed Service for Prometheus and Amazon Managed Grafana to provide comprehensive observability. The team integrated DCGM Exporter for GPU metrics and EFA Exporter for network metrics, enabling real-time monitoring of system health and performance. This setup allows for continuous tracking of GPU health, network performance, and training progress, with automated alerting for any anomalies through Grafana Dashboards. The following screenshot shows an example of a GPU health dashboard.

Software stack and training optimizations

The training environment is built on SageMaker HyperPod DLAMI, which provides a preconfigured Ubuntu base Amazon Machine Image (AMI) with essential components for distributed training. The software stack includes CUDA drivers and libraries (such as cuDNN and cuBLAS), NCCL for multi-GPU communication, and AWS-OFI-NCCL for EFA support. On top of this foundation, the team deployed Megatron-LM as the primary framework for model training. The following diagram illustrates the software stack architecture.

Distributed training implementation

The training implementation uses Megatron-LM’s advanced features for scaling LLM training. The framework provides sophisticated model parallelism capabilities, including both tensor and pipeline parallelism, along with efficient data parallelism that supports communication overlap. These features are essential for managing the computational demands of training a 70-billion-parameter model.

Advanced parallelism and communication

The team used a comprehensive 4D parallelism strategy of Megatron-LM that maximizes GPU utilization through careful optimization of communication patterns across multiple dimensions: data, tensor, and pipeline, and sequence parallelism. Data parallelism splits the training batch across GPUs, tensor parallelism divides individual model layers, pipeline parallelism splits the model into stages across GPUs, and sequence parallelism partitions the sequence length dimension—together enabling efficient training of massive models.

The implementation overlaps communication across data parallelism, tensor parallelism, and pipeline parallelism domains, significantly reducing blocking time during computation. This optimized configuration enables efficient scaling across the full cluster of GPUs while maintaining consistently high utilization rates. The following diagram illustrates this communication and computation overlap in distributed training (original image).

Megatron-LM enables fine-grained communication overlapping through multiple configuration flags: --overlap-grad-reduce and --overlap-param-gather for data-parallel operations, --tp-comm-overlap for tensor parallel operations, and built-in pipeline-parallel communication overlap (enabled by default). These optimizations work together to improve training scalability.

Checkpointing strategy

The training infrastructure implements an optimized checkpointing strategy using Distributed Checkpoint (DCP) and asynchronous I/O operations. DCP parallelizes checkpoint operations across all available GPUs, rather than being constrained by tensor and pipeline parallel dimensions as in traditional Megatron-LM implementations. This parallelization, combined with asynchronous I/O, enables the system to:

  • Save checkpoints up to 10 times faster compared to synchronous approaches
  • Minimize training interruption by offloading I/O operations
  • Scale checkpoint performance with the total number of GPUs
  • Maintain consistency through coordinated distributed saves

The checkpointing system automatically saves model states to the FSx Lustre file system at configurable intervals, with metadata tracked in Amazon S3. For redundancy, checkpoints are asynchronously replicated to Amazon S3 storage.

For implementation details on asynchronous DCP, see Asynchronous Saving with Distributed Checkpoint (DCP).

Experiment management

In November 2024, the team introduced a systematic approach to resource optimization through the development of a sophisticated memory prediction tool. This tool accurately predicts per-GPU memory usage during training and semi-automatically determines optimal training settings by analyzing all possible 4D parallelism configurations. Based on proven algorithmic research, this tool has become instrumental in maximizing resource utilization across the training infrastructure. The team plans to open source this tool with comprehensive documentation to benefit the broader AI research community.

The following screenshot shows an example of the memory consumption prediction tool interface (original image).

Training pipeline management

The success of the training process heavily relied on maintaining high-quality data pipelines. The team implemented rigorous data curation processes and robust cleaning pipelines, maintaining a careful balance in dataset composition across different languages and domains.For experiment planning, version control was critical. The team first fixed the versions of pre-training libraries and instruction tuning libraries to be used in the next experiment cycle. For libraries without formal version releases, the team managed versions using Git branches or tags to provide reproducibility. After the versions were locked, the team conducted short-duration training runs to:

  • Measure throughput with different numbers of GPU nodes
  • Search for optimal configurations among distributed training settings identified by the memory prediction library
  • Establish accurate training time estimates for scheduling

The following screenshot shows an example experiment schedule showing GPU node allocation, expected training duration, and key milestones across different training phases (original image).

To optimize storage performance before beginning experiments, training data was preloaded from Amazon S3 to the FSx for Lustre file system to prevent I/O bottlenecks during training. This preloading process used parallel transfers to maximize throughput:

# Preload data to Lustre filesystem
find <data/path> -type f -print0 | xargs -0 -n 1 -P 8 sudo lfs 
hsm_restore

Monitoring and performance management

The team implemented a comprehensive monitoring system focused on real-time performance tracking and proactive issue detection. By integrating with Weights & Biases, the system continuously monitors training progress and delivers automated notifications for key events such as job completion or failure and performance anomalies. Weights & Biases provides a set of tools that enable customized alerting through Slack channels. The following screenshot shows an example of a training monitoring dashboard in Slack (original image).

The monitoring infrastructure excels at identifying both job failures and performance bottlenecks like stragglers. The following figure presents an example of straggler detection showing training throughput degradation.

Conclusion

The successful training of Llama 3.3 Swallow represents a significant milestone in the development of LLMs using cloud infrastructure. Through this project, the team has demonstrated the effectiveness of combining advanced distributed training techniques with carefully orchestrated cloud resources. The implementation of efficient 4D parallelism and asynchronous checkpointing has established new benchmarks for training efficiency, and the comprehensive monitoring and optimization tools have provided consistent performance throughout the training process.

The project’s success is built on several foundational elements: a systematic approach to resource planning and optimization, robust data pipeline management, and a comprehensive monitoring and alerting system. The efficient storage hierarchy implementation has proven particularly crucial in managing the massive datasets required for training a 70-billion-parameter model.Looking ahead, the project opens several promising directions for future development. The team plans to open source the memory prediction tools, so other researchers can benefit from the optimizations developed during this project. Further improvements to the training pipelines are under development, along with continued enhancement of Japanese language capabilities. The project’s success also paves the way for expanded model applications across various domains.

Resources and references

This section provides key resources and references for understanding and replicating the work described in this paper. The resources are organized into documentation for the infrastructure and tools used, as well as model-specific resources for accessing and working with Llama 3.3 Swallow.

Documentation

The following resources provide detailed information about the technologies and frameworks used in this project:

Model resources

For more information about Llama 3.3 Swallow and access to the model, refer to the following resources:


About the Authors

Kazuki Fujii graduated with a bachelor’s degree in Computer Science from Tokyo Institute of Technology in 2024 and is currently a master’s student there (2024–2026). Kazuki is responsible for the pre-training and fine-tuning of the Swallow model series, a state-of-the-art multilingual LLM specializing in Japanese and English as of December 2023. Kazuki focuses on distributed training and building scalable training systems to enhance the model’s performance and infrastructure efficiency.

Daisuke Miyamato is a Senior Specialist Solutions Architect for HPC at Amazon Web Services. He is mainly supporting HPC customers in drug discovery, numerical weather prediction, electronic design automation, and ML training.

Kei Sasaki is a Senior Solutions Architect on the Japan Public Sector team at Amazon Web Services, where he helps Japanese universities and research institutions navigate their cloud migration journeys. With a background as a systems engineer specializing in high-performance computing, Kei supports these academic institutions in their large language model development initiatives and advanced computing projects.

Keita Watanabe is a Senior GenAI World Wide Specialist Solutions Architect at Amazon Web Services, where he helps develop machine learning solutions using OSS projects such as Slurm and Kubernetes. His background is in machine learning research and development. Prior to joining AWS, Keita worked in the ecommerce industry as a research scientist developing image retrieval systems for product search. Keita holds a PhD in Science from the University of Tokyo.

Read More

Accelerating Articul8’s domain-specific model development with Amazon SageMaker HyperPod

Accelerating Articul8’s domain-specific model development with Amazon SageMaker HyperPod

This post was co-written with Renato Nascimento, Felipe Viana, Andre Von Zuben from Articul8.

Generative AI is reshaping industries, offering new efficiencies, automation, and innovation. However, generative AI requires powerful, scalable, and resilient infrastructures that optimize large-scale model training, providing rapid iteration and efficient compute utilization with purpose-built infrastructure and automated cluster management.

In this post, we share how Articul8 is accelerating their training and deployment of domain-specific models (DSMs) by using Amazon SageMaker HyperPod and achieving over 95% cluster utilization and a 35% improvement in productivity.

What is SageMaker HyperPod?

SageMaker HyperPod is an advanced distributed training solution designed to accelerate the development of scalable, reliable, and secure generative AI model development. Articul8 uses SageMaker HyperPod to efficiently train large language models (LLMs) on diverse, representative data and uses its observability and resiliency features to keep the training environment stable over the long duration of training jobs. SageMaker HyperPod provides the following features:

  • Fault-tolerant compute clusters with automated faulty node replacement during model training
  • Efficient cluster utilization through observability and performance monitoring
  • Seamless model experimentation with streamlined infrastructure orchestration using Slurm and Amazon Elastic Kubernetes Service (Amazon EKS)

Who is Articul8?

Articul8 was established to address the gaps in enterprise generative AI adoption by developing autonomous, production-ready products. For instance, they found that most general-purpose LLMs often fall short in delivering the accuracy, efficiency, and domain-specific knowledge needed for real-world business challenges. They are pioneering a set of DSMs that offer twofold better accuracy and completeness, compared to general-purpose models, at a fraction of the cost. (See their recent blog post for more details.)

The company’s proprietary ModelMesh™ technology serves as an autonomous layer that decides, selects, executes, and evaluates the right models at runtime. Think of it as a reasoning system that determines what to run, when to run it, and in what sequence, based on the task and context. It evaluates responses at every step to refine its decision-making, enabling more reliable and interpretable AI solutions while dramatically improving performance.

Articul8’s ModelMesh™ supports:

  • LLMs for general tasks
  • Domain-specific models optimized for industry-specific applications
  • Non-LLMs for specialized reasoning tasks or established domain-specific tasks (for example, scientific simulation)

Articul8’s domain-specific models are setting new industry standards across supply chain, energy, and semiconductor sectors. The A8-SupplyChain model, built for complex workflows, achieves 92% accuracy and threefold performance gains over general-purpose LLMs in sequential reasoning. In energy, A8-Energy models were developed with EPRI and NVIDIA as part of the Open Power AI Consortium, enabling advanced grid optimization, predictive maintenance, and equipment reliability. The A8-Semicon model has set a new benchmark, outperforming top open-source (DeepSeek-R1, Meta Llama 3.3/4, Qwen 2.5) and proprietary models (GPT-4o, Anthropic’s Claude) by twofold in Verilog code accuracy, all while running at 50–100 times smaller model sizes for real-time AI deployment.

Articul8 develops some of their domain-specific models using Meta’s Llama family as a flexible, open-weight foundation for expert-level reasoning. Through a rigorous fine-tuning pipeline with reasoning trajectories and curated benchmarks, general Llama models are transformed into domain specialists. To tailor models for areas like hardware description languages, Articul8 applies Reinforcement Learning with Verifiable Rewards (RLVR), using automated reward pipelines to specialize the model’s policy. In one case, a dataset of 50,000 documents was automatically processed into 1.2 million images, 360,000 tables, and 250,000 summaries, clustered into a knowledge graph of over 11 million entities. These structured insights fuel A8-DSMs across research, product design, development, and operations.

How SageMaker HyperPod accelerated the development of Articul8’s DSMs

Cost and time to train DSMs is critical for success for Articul8 in a rapidly evolving ecosystem. Training high-performance DSMs requires extensive experimentation, rapid iteration, and scalable compute infrastructure. With SageMaker HyperPod, Articul8 was able to:

  • Rapidly iterate on DSM training – SageMaker HyperPod resiliency features enabled Articul8 to train and fine-tune its DSMs in a fraction of the time required by traditional infrastructure
  • Optimize model training performance – By using the automated failure recovery feature in SageMaker HyperPod, Articul8 provided stable and resilient training processes
  • Reduce AI deployment time by four times and lower total cost of ownership by five times – The orchestration capabilities of SageMaker HyperPod alleviated the manual overhead of cluster management, allowing Articul8’s research teams to focus on model optimization rather than infrastructure upkeep

These advantages contributed to record-setting benchmark results by Articul8, proving that domain-specific models deliver superior real-world performance compared to general-purpose models.

Distributed training challenges and the role of SageMaker HyperPod

Distributed training across hundreds of nodes faces several critical challenges beyond basic resource constraints. Managing massive training clusters requires robust infrastructure orchestration and careful resource allocation for operational efficiency. SageMaker HyperPod offers both managed Slurm and Amazon EKS orchestration experience that streamlines cluster creation, infrastructure resilience, job submission, and observability. The following details focus on the Slurm implementation for reference:

  • Cluster setup – Although setting up a cluster is a one-time effort, the process is streamlined with a setup script that walks the administrator through each step of cluster creation. This post shows how this can be done in discrete steps.
  • ResiliencyFault tolerance becomes paramount when operating at scale. SageMaker HyperPod handles node failures and network interruptions by replacing faulty nodes automatically. You can add the flag --auto-resume=1 with the Slurm srun command, and the distributed training job will recover from the last checkpoint.
  • Job submission – SageMaker HyperPod managed Slurm orchestration is a powerful way for data scientists to submit and manage distributed training jobs. Refer to the following example in the AWS-samples distributed training repo for reference. For instance, a distributed training job can be submitted with a Slurm sbatch command: sbatch 1.distributed-training-llama2.sbatch. You can use squeue and scancel to view and cancel jobs, respectively.
  • Observability – SageMaker HyperPod uses Amazon CloudWatch and open source managed Prometheus and Grafana services for monitoring and logging. Cluster administrators can view the health of the infrastructure (network, storage, compute) and utilization.

Solution overview

The SageMaker HyperPod platform enables Articul8 to efficiently manage high-performance compute clusters without requiring a dedicated infrastructure team. The service automatically monitors cluster health and replaces faulty nodes, making the deployment process frictionless for researchers.

To enhance their experimental capabilities, Articul8 integrated SageMaker HyperPod with Amazon Managed Grafana, providing real-time observability of GPU resources through a single-pane-of-glass dashboard. They also used SageMaker HyperPod lifecycle scripts to customize their cluster environment and install required libraries and packages. This comprehensive setup empowers Articul8 to conduct rapid experimentation while maintaining high performance and reliability—they reduced their customers’ AI deployment time by four times and lowered their total cost of ownership by five times.

The following diagram illustrates the observability architecture.

SageMaker HyperPod Architecture (Slurm)

The platform’s efficiency in managing computational resources with minimum downtime has been particularly valuable for Articul8’s research and development efforts, empowering them to quickly iterate on their generative AI solutions while maintaining enterprise-grade performance standards. The following sections describe the setup and results in detail.

For the setup for this post, we begin with the AWS published workshop for SageMaker HyperPod, and adjust it to suit our workload.

Prerequisites

The following two AWS CloudFormation templates address the prerequisites of the solution setup.

For SageMaker HyperPod

This CloudFormation stack addresses the prerequisites for SageMaker HyperPod:

  • VPC and two subnets – A public subnet and a private subnet are created in an Availability Zone (provided as a parameter). The virtual private cloud (VPC) contains two CIDR blocks with 10.0.0.0/16 (for the public subnet) and 10.1.0.0/16 (for the private subnet). An internet gateway and NAT gateway are deployed in the public subnet.
  • Amazon FSx for Lustre file system – An Amazon FSx for Lustre volume is created in the specified Availability Zone, with a default of 1.2 TB storage, which can be overridden by a parameter. For this case study, we increased the storage size to 7.2 TB.
  • Amazon S3 bucket – The stack deploys endpoints for Amazon Simple Storage Service (Amazon S3) to store lifecycle scripts.
  • IAM role – An AWS Identity and Access Management (IAM) role is also created to help execute SageMaker HyperPod cluster operations.
  • Security groupThe script creates a security group to enable EFA communication for multi-node parallel batch jobs.

For cluster observability

To get visibility into cluster operations and make sure workloads are running as expected, an optional CloudFormation stack has been used for this case study. This stack includes:

  • Node exporter – Supports visualization of CPU load averages, memory and disk usage, network traffic, file system, and disk I/O metrics
  • NVIDIA DCGM – Supports visualization of GPU utilization, temperatures, power usage, and memory usage
  • EFA metrics – Supports visualization of EFA network and error metrics, EFA RDMA performance, and so on.
  • FSx for Lustre – Supports visualization of file system read/write operations, free capacity, and metadata operations

Observability can be configured through YAML scripts to monitor SageMaker HyperPod clusters on AWS. Amazon Managed Service for Prometheus and Amazon Managed Grafana workspaces with associated IAM roles are deployed in the AWS account. Prometheus and exporter services are also set up on the cluster nodes.

Using Amazon Managed Grafana with SageMaker HyperPod helps you create dashboards to monitor GPU clusters and make sure they operate efficiently with minimum downtime. In addition, dashboards have become a critical tool to give you a holistic view of how specialized workloads consume different resources of the cluster, helping developers optimize their implementation.

Cluster setup

The cluster is set up with the following components (results might vary based on customer use case and deployment setup):

  • Head node and compute nodes – For this case study, we use a head node and SageMaker HyperPod compute nodes. The head node has an ml.m5.12xlarge instance, and the compute queue consists of ml.p4de.24xlarge instances.
  • Shared volume – The cluster has an FSx for Lustre file system mounted at /fsx on both the head and compute nodes.
  • Local storage – Each node has 8 TB local NVME volume attached for local storage.
  • Scheduler – Slurm is used as an orchestrator. Slurm is an open source and highly scalable cluster management tool and job scheduling system for high-performance computing (HPC) clusters.
  • Accounting – As part of cluster configuration, a local MariaDB is deployed that keeps track of job runtime information.

Results

During this project, Articul8 was able to confirm the expected performance of A100 with the added benefit of creating a cluster using Slurm and providing observability metrics to monitor the health of various components (storage, GPU nodes, fiber). The primary validation was on the ease of use and rapid ramp-up of data science experiments. Furthermore, they were able to demonstrate near linear scaling with distributed training, achieving a 3.78 times reduction in time to train for Meta Llama-2 13B with 4x nodes. Having the flexibility to run multiple experiments, without losing development time from infrastructure overhead was an important accomplishment for the Articul8 data science team.

Clean up

If you run the cluster as part of the workshop, you can follow the cleanup steps to delete the CloudFormation resources after deleting the cluster.

Conclusion

This post demonstrated how Articul8 AI used SageMaker HyperPod to overcome the scalability and efficiency challenges of training multiple high-performing DSMs across key industries. By alleviating infrastructure complexity, SageMaker HyperPod empowered Articul8 to focus on building AI systems with measurable business outcomes. From semiconductor and energy to supply chain, Articul8’s DSMs are proving that the future of enterprise AI is not general—it’s purpose-built. Key takeaways include:

  • DSMs significantly outperform general-purpose LLMs in critical domains
  • SageMaker HyperPod accelerated the development of Articul8’s A8-Semicon, A8-SupplyChain, and Energy DSM models
  • Articul8 reduced AI deployment time by four times and lowered total cost of ownership by five times using the scalable, automated training infrastructure of SageMaker HyperPod

Learn more about SageMaker HyperPod by following this workshop. Reach out to your account team on how you can use this service to accelerate your own training workloads.


About the Authors

Yashesh A. Shroff, PhD.Yashesh A. Shroff, PhD. is a Sr. GTM Specialist in the GenAI Frameworks organization, responsible for scaling customer foundational model training and inference on AWS using self-managed or specialized services to meet cost and performance requirements. He holds a PhD in Computer Science from UC Berkeley and an MBA from Columbia Graduate School of Business.

Amit Bhatnagar is a Sr Technical Account Manager with AWS, in the Enterprise Support organization, with a focus on generative AI startups. He is responsible for helping key AWS customers with their strategic initiatives and operational excellence in the cloud. When he is not chasing technology, Amit loves to cook vegan delicacies and hit the road with his family to chase the horizon.

Renato Nascimento is the Head of Technology at Articul8, where he leads the development and execution of the company’s technology strategy. With a focus on innovation and scalability, he ensures the seamless integration of cutting-edge solutions into Articul8’s products, enabling industry-leading performance and enterprise adoption.

Felipe Viana is the Head of Applied Research at Articul8, where he leads the design, development, and deployment of innovative generative AI technologies, including domain-specific models, new model architectures, and multi-agent autonomous systems.

Andre Von Zuben is the Head of Architecture at Articul8, where he is responsible for designing and implementing scalable generative AI platform elements, novel generative AI model architectures, and distributed model training and deployment pipelines.

Read More

How VideoAmp uses Amazon Bedrock to power their media analytics interface

How VideoAmp uses Amazon Bedrock to power their media analytics interface

This post was co-written with Suzanne Willard and Makoto Uchida from VideoAmp.

In this post, we illustrate how VideoAmp, a media measurement company, worked with the AWS Generative AI Innovation Center (GenAIIC) team to develop a prototype of the VideoAmp Natural Language (NL) Analytics Chatbot to uncover meaningful insights at scale within media analytics data using Amazon Bedrock. The AI-powered analytics solution involved the following components:

  • A natural language to SQL pipeline, with a conversational interface, that works with complex queries and media analytics data from VideoAmp
  • An automated testing and evaluation tool for the pipeline

VideoAmp background

VideoAmp is a tech-first measurement company that empowers media agencies, brands, and publishers to precisely measure and optimize TV, streaming, and digital media. With a comprehensive suite of measurement, planning, and optimization solutions, VideoAmp offers clients a clear, actionable view of audiences and attribution across environments, enabling them to make smarter media decisions that help them drive better business outcomes. VideoAmp has seen incredible adoption for its measurement and currency solutions with 880% YoY growth, 98% coverage of the TV publisher landscape, 11 agency groups, and more than 1,000 advertisers. VideoAmp is headquartered in Los Angeles and New York with offices across the United States. To learn more, visit www.videoamp.com.

VideoAmp’s AI journey

VideoAmp has embraced AI to enhance its measurement and optimization capabilities. The company has integrated machine learning (ML) algorithms into its infrastructure to analyze vast amounts of viewership data across traditional TV, streaming, and digital services. This AI-driven approach allows VideoAmp to provide more accurate audience insights, improve cross-environment measurement, and optimize advertising campaigns in real time. By using AI, VideoAmp has been able to offer advertisers and media owners more precise targeting, better attribution models, and increased return on investment for their advertising spend. The company’s AI journey has positioned it as a leader in the evolving landscape of data-driven advertising and media measurement.

To take their innovations a step further, VideoAmp is building a brand-new analytics solution powered by generative AI, which will provide their customers with accessible business insights. Their goal for a beta product is to create a conversational AI assistant powered by large language models (LLMs) that allows VideoAmp’s data analysts and non-technical users such as content researchers and publishers to perform data analytics using natural language queries.

Use case overview

VideoAmp is undergoing a transformative journey by integrating generative AI into its analytics. The company aims to revolutionize how customers, including publishers, media agencies, and brands, interact with and derive insights from VideoAmp’s vast repository of data through a conversational AI assistant interface.

Presently, analysis by data scientists and analysts is done manually, requires technical SQL knowledge, and can be time-consuming for complex and high-dimensional datasets. Acknowledging the necessity for streamlined and accessible processes, VideoAmp worked with the GenAIIC to develop an AI assistant capable of comprehending natural language queries, generating and executing SQL queries on VideoAmp’s data warehouse, and delivering natural language summaries of retrieved information. The assistant allows non-technical users to surface data-driven insights, and it reduces research and analysis time for both technical and non-technical users.

Key success criteria for the project included:

  • The ability to convert natural language questions into SQL statements, connect to VideoAmp’s provided database, execute statements on VideoAmp performance metrics data, and create a natural language summary of results
  • A UI to ask natural language questions and view assistant output, which includes generated SQL queries, reasoning for the SQL statements, retrieved data, and natural language data summaries
  • Conversational support for the user to iteratively refine and filter asked questions
  • Low latency and cost-effectiveness
  • An automated evaluation pipeline to assess the quality and accuracy of the assistant

The team overcame a few challenges during the development process:

  • Adapting LLMs to understand the domain aspects of VideoAmp’s dataset – The dataset included highly industry-specific fields and metrics, and required complex queries to effectively filter and analyze. The queries often involved multiple specialized metric calculations, filters selecting from over 30 values, and extensive grouping and ordering.
  • Developing an automated evaluation pipeline – The pipeline is able to correctly identify if generated outputs are equivalent to ground truth data, even if they have different column aliasing, ordering, and metric calculations.

Solution overview

The GenAIIC team worked with VideoAmp to create an AI assistant that used Anthropic’s Claude 3 LLMs through Amazon Bedrock. Amazon Bedrock was chosen for this project because it provides access to high-quality foundation models (FMs), including Anthropic’s Claude 3 series, through a unified API. This allowed the team to quickly integrate the most suitable models for different components of the solution, such as SQL generation and data summarization.

Additional features in Amazon Bedrock, including Amazon Bedrock Prompt Management, native support for Retrieval Augmented Generation (RAG) and structured data retrieval through Amazon Bedrock Knowledge Bases, Amazon Bedrock Guardrails, and fine-tuning, enable VideoAmp to quickly expand the analytics solution and take it to production. Amazon Bedrock also offers robust security and adheres to compliance certifications, allowing VideoAmp to confidently expand their AI analytics solution while maintaining data privacy and adhering to industry standards.

The solution is connected to a data warehouse. It supports a variety of database connections, such as Snowflake, SingleStore, PostgreSQL, Excel and CSV files, and more. The following diagram illustrates the high-level workflow of the solution.

A diagram illustrating the high-level workflow of VideoAmp's Natural Language Analytics solution

The workflow consists of the following steps:

  1. The user navigates to the frontend application and asks a question in natural language.
  2. A Question Rewriter LLM component uses previous conversational context to augment the question with additional details if applicable. This allows follow-up questions and refinements to previous questions.
  3. A Text-to-SQL LLM component creates a SQL query that corresponds to the user question.
  4. The SQL query is executed in the data warehouse.
  5. A Data-to-Text LLM component summarizes the retrieved data for the user.

The rewritten question, generated SQL, reasoning, and retrieved data are returned at each step.

AI assistant workflow details

In this section, we discuss the components of the AI assistant workflow in more detail.

Rewriter

After the user asks the question, the current question and the previous questions the user asked in the current session are sent to the Question Rewriter component, which uses Anthropic’s Claude 3 Sonnet model. If deemed necessary, the LLM uses context from the previous questions to augment the current user question to make it a standalone question with context included. This enables multi-turn conversational support for the user, allowing for natural interactions with the assistant.

For example, if a user first asked, “For the week of 09/04/2023 – 09/10/2023, what were the top 10 ranked original national broadcast shows based on viewership for households with 18+?”, followed by, “Can I have the same data for one year later”, the rewriter would rewrite the latter question as “For the week of 09/03/2024 – 09/09/2024, what were the top 10 ranked original national broadcast shows based on viewership for households with 18+?”

Text-to-SQL

The rewritten user question is sent to the Text-to-SQL component, which also uses Anthropic’s Claude 3 Sonnet model. The Text-to-SQL component uses information about the database in its prompt to generate a SQL query corresponding to the user question. It also generates an explanation of the query.

The text-to-SQL prompt addressed several challenges, such as industry-specific language in user questions, complex metrics, and several rules and defaults for filtering. The prompt was developed through several iterations, based on feedback and guidance from the VideoAmp team, and manual and automated evaluation.

The prompt consisted of four overarching sections: context, SQL instructions, task, and examples. During the development phase, database schema and domain- or task-specific knowledge were found to be critical, so one major part of the prompt was designed to incorporate them in the context. To make this solution reusable and scalable, a modularized design of the prompt/input system is employed, making it generic so it can be applied to other use cases and domains. The solution can support Q&A with multiple databases by dynamically switching/changing the corresponding context with an orchestrator if needed.

The context section contains the following details:

  • Database schema
  • Sample categories for relevant data fields such as television networks to aid the LLM in understanding what fields to use for identifiers in the question
  • Industry term definitions
  • How to calculate different types of metrics or aggregations
  • Default values or fields should be selected if not specified
  • Other domain- or task-specific knowledge

The SQL instructions contain the following details:

  • Dynamic insertion of today’s date as a reference for terms, such as “last 3 quarters”
  • Instructions on usage of sub-queries
  • Instructions on when to retrieve additional informational columns not specified in the user question
  • Known SQL syntax and database errors to avoid and potential fixes

In the task section, the LLM is given a detailed step-by-step process to formulate SQL queries based on the context. A step-by-step process is required for the LLM to correctly think through and assimilate the required context and rules. Without the step-by-step process, the team found that the LLM wouldn’t adhere to all instructions provided in the previous sections.

In the examples section, the LLM is given several examples of user questions, corresponding SQL statements, and explanations.

In addition to iterating on the prompt content, different content organization patterns were tested due to long context. The final prompt was organized with markdown and XML.

SQL execution

After the Text-to-SQL component outputs a query, the query is executed against VideoAmp’s data warehouse using database connector code. For this use case, only read queries for analytics are executed to protect the database from unexpected operations like updates or deletes. The credentials for the database are securely stored and accessed using AWS Secrets Manager and AWS Key Management Service (AWS KMS).

Data-to-Text

The data retrieved by the SQL query is sent to the Data-to-Text component, along with the rewritten user question. The Data-to-Text component, which uses Anthropic’s Claude 3 Haiku model, produces a concise summary of the retrieved data and answers the user question.

The final outputs are displayed on the frontend application as shown in the following screenshots (protected data is hidden).

A screenshot showing the outputs of the VideoAmp Natural Language Analytics solution

A screenshot showing the outputs of the VideoAmp Natural Language Analytics solution

Evaluation framework workflow details

The GenAIIC team developed a sophisticated automated evaluation pipeline for VideoAmp’s NL Analytics Chatbot, which directly informed prompt optimization and solution improvements and was a critical component in providing high-quality results.

The evaluation framework comprises of two categories:

  • SQL query evaluation – Generated SQL queries are evaluated for overall closeness to the ground truth SQL query. A key feature of the SQL evaluation component was the ability to account for column aliasing and ordering differences when comparing statements and determine equivalency.
  • Retrieved data evaluation – The retrieved data is compared to ground truth data to determine an exact match, after a few processing steps to account for column, formatting, and system differences.

The evaluation pipeline also produces detailed reports of the results and discrepancies between generated data and ground truth data.

Dataset

The dataset used for the prototype solution was hosted in a data warehouse and consisted of performance metrics data such as viewership, ratings, and rankings for television networks and programs. The field names were industry-specific, so a data dictionary was included in the text-to-SQL prompt as part of the schema. The credentials for the database are securely stored and accessed using Secrets Manager and AWS KMS.

Results

A set of test questions were evaluated by the GenAIIC and VideoAmp teams, focusing on three metrics:

  • Accuracy – Different accuracy metrics were analyzed, but exact matches between retrieved data and ground truth data were prioritized
  • Latency – LLM generation latency, excluding the time taken to query the database
  • Cost – Average cost per user question

Both the evaluation pipeline and human review reported high accuracies on the dataset, whereas costs and latencies remained low. Overall, the results were well-aligned with VideoAmp expectations. VideoAmp anticipates this solution will make it simple for users to handle complex data queries with confidence through intuitive natural language interactions, reducing the time to business insights.

Conclusion

In this post, we shared how the GenAIIC team worked with VideoAmp to build a prototype of the VideoAmp NL Analytics Chatbot, an end-to-end generative AI data analytics interface using Amazon Bedrock and Anthropic’s Claude 3 LLMs. The solution is equipped with a variety of state-of-the-art LLM-based techniques, such as question rewriting, text-to-SQL query generation, and summarization of data in natural language. It also includes an automated evaluation module for evaluating the correctness of generated SQL statements and retrieved data. The solution achieved high accuracy on VideoAmp’s evaluation samples. Users can interact with the solution through an intuitive AI assistant interface with conversational capabilities.

VideoAmp will soon be launching their new generative AI-powered analytics interface, which enables customers to analyze data and gain business insights through natural language conversation. Their successful work with the GenAIIC team will allow VideoAmp to use generative AI technology to swiftly deliver valuable insights for both technical and non-technical customers.

This is just one of the ways AWS enables builders to deliver generative AI-based solutions. You can get started with Amazon Bedrock and see how it can be integrated in example code bases. The GenAIIC is a group of science and strategy experts with comprehensive expertise spanning the generative AI journey, helping you prioritize use cases, build a roadmap, and move solutions into production. If you’re interested in working with the GenAIIC, reach out to them today.


About the authors

Suzanne Willard is the VP of Engineering at VideoAmp where she founded and leads the GenAI program, establishing the strategic vision and execution roadmap. With over 20 years experience she is driving innovation in AI technologies, creating transformative solutions that align with business objectives and set the company apart in the market.

Makoto Uchida is a senior architect at VideoAmp in the AI domain, acting as area technical lead of AI portfolio, responsible for defining and driving AI product and technical strategy in the content and ads measurement platform PaaS product. Previously, he was a software engineering lead in generative and predictive AI Platform at a major hyperscaler public Cloud service. He has also engaged with multiple startups, laying the foundation of Data/ML/AI infrastructures.

Shreya Mohanty is a Deep Learning Architect at the AWS Generative AI Innovation Center, where she partners with customers across industries to design and implement high-impact GenAI-powered solutions. She specializes in translating customer goals into tangible outcomes that drive measurable impact.

Long Chen is a Sr. Applied Scientist at AWS Generative AI Innovation Center. He holds a Ph.D. in Applied Physics from University of Michigan – Ann Arbor. With more than a decade of experience for research and development, he works on innovative solutions in various domains using generative AI and other machine learning techniques, ensuring the success of AWS customers. His interest includes generative models, multi-modal systems and graph learning.

Amaran Asokkumar is a Deep Learning Architect at AWS, specializing in infrastructure, automation, and AI. He leads the design of GenAI-enabled solutions across industry segments. Amaran is passionate about all things AI and helping customers accelerate their GenAI exploration and transformation efforts.

Vidya Sagar Ravipati is a Science Manager at the Generative AI Innovation Center, where he leverages his vast experience in large-scale distributed systems and his passion for machine learning to help AWS customers across different industry verticals accelerate their AI and cloud adoption.

Read More

Scaling up image segmentation across data and tasks

Scaling up image segmentation across data and tasks


Scaling up image segmentation across data and tasks

Novel architecture that fuses learnable queries and conditional queries improves a segmentation models ability to transfer across tasks.

Computer vision

June 12, 12:25 PMJune 12, 12:25 PM

The first draft of this blog post was generated by Amazon Nova Pro, based on detailed instructions from Amazon Science editors and multiple examples of prior posts.

In a paper we’re presenting at the 2025 Conference on Computer Vision and Pattern Recognition (CVPR), we introduce a new approach to image segmentation that scales across diverse datasets and tasks. Traditional segmentation models, while effective on isolated tasks, often struggle as the number of new tasks or unfamiliar scenarios grows. Our proposed method, which uses a model we call a mixed-query transformer (MQ-former), aims to enable joint training and evaluation across multiple tasks and datasets.

Scalable segmentation

Image segmentation is a computer vision task that involves partitioning an image into distinct regions or segments. Each segment corresponds to a different object or part of the scene. There are several types of segmentation tasks, including foreground/background segmentation (distinguishing objects at different distances), semantic segmentation (labeling each pixel as belonging to a particular object class), and instance segmentation (identifying each pixel as belonging to a particular instance of an object class).

An example of instance segmentation, given the text prompt “cardinal”.

Scalability means that a segmentation model can effectively improve with an increase in the size of its training dataset, in the diversity of the tasks it performs, or both. Most prior research has focused on one or the other data or task diversity. We address both at once.

A tale of two queries

In our paper, we show that one issue preventing effective scalability in segmentation models is the design of object queries. An object query is a way of representing a hypothesis about objects in a scene a hypothesis that can be tested against images.

There are two main types of object queries. The first, which we refer to as learnable queries, are learned vectors that interact with image features and encode information about location and object class. Learnable queries tend to perform well on semantic segmentation as the they do not contain object-specific priors.

The second type of object query, which we refer to as a conditional query, is akin to two-stage object detection: region proposals are generated by a Transformer encoder, and then high-confidence proposals are fed into the Transformer decoder as queries to generate the final prediction. Conditional queries are closely aligned with the object classes and excel at object detection and instance segmentation on semantically well-defined objects.

Our approach is to combine both types of queries, which improves the models ability to transfer across tasks. Our MQ-Former model represents inputs using both learnable queries and conditional queries, and every layer of the decoder has a cross-attention mechanism, so that the processing of the learnable queries can factor in information from the conditional-query processing, and vice versa.

Architectural schematics for learnable queries, conditional queries, and mixed queries. Solid triangles represent instance segmentation ground truth, and striped triangles represent semantic-segmentation ground truth.

Leveraging synthetic data

Mixed queries aid scalability across segmentation tasks, but the other aspect of scalability in segmentation models is dataset size. One of the key challenges in scaling up segmentation models is the scarcity of high-quality, annotated data. To overcome this limitation, we propose leveraging synthetic data.

Examples of synthetic data. At left are two examples of synthetic masks, at right two examples of synthetic captions.

While segmentation data is scarce, object recognition data is plentiful. Object recognition datasets typically include bounding boxes, or rectangles that identify the image regions in which labeled objects can be found.

Asking a trained segmentation model to segment only the object within a bounding box significantly improves performance; we are thus able to use weaker segmentation models to convert object recognition datasets into segmentation datasets that can be used to train stronger segmentation models.

Bounding boxes can also focus automatic captioning models on regions of interest in an image, to provide the type of object classifications necessary to train semantic-segmentation and instance segmentation models.

Experimental results

We evaluated our approach using 15 datasets covering a range of segmentation tasks and found that, with MQ-Former, scaling up both the volume of training data and the diversity of tasks consistently enhances the models segmentation capabilities.

For example, on the SeginW benchmark, which includes 25 datasets used for open-vocabulary in-the-wild segmentation evaluation, scaling the data and tasks from 100,000 samples to 600,000 boosted performance 16%, as measured by average precision of object masking. Incorporating synthetic data improved performance by another 14%, establishing a new state of the art.

Research areas: Computer vision

Tags: Image segmentation, Data representation

Read More

Adobe enhances developer productivity using Amazon Bedrock Knowledge Bases

Adobe enhances developer productivity using Amazon Bedrock Knowledge Bases

Adobe Inc. excels in providing a comprehensive suite of creative tools that empower artists, designers, and developers across various digital disciplines. Their product landscape is the backbone of countless creative projects worldwide, ranging from web design and photo editing to vector graphics and video production.

Adobe’s internal developers use a vast array of wiki pages, software guidelines, and troubleshooting guides. Recognizing the challenge developers faced in efficiently finding the right information for troubleshooting, software upgrades, and more, Adobe’s Developer Platform team sought to build a centralized system. This led to the initiative Unified Support, designed to help thousands of the company’s internal developers get immediate answers to questions from a centralized place and reduce time and cost spent on developer support. For instance, a developer setting up a continuous integration and delivery (CI/CD) pipeline in a new AWS Region or running a pipeline on a dev branch can quickly access Adobe-specific guidelines and best practices through this centralized system.

The initial prototype for Adobe’s Unified Support provided valuable insights and confirmed the potential of the approach. This early phase highlighted key areas requiring further development to operate effectively at Adobe’s scale, including addressing scalability needs, simplifying resource onboarding, improving content synchronization mechanisms, and optimizing infrastructure efficiency. Building on these learnings, improving retrieval precision emerged as the next critical step.

To address these challenges, Adobe partnered with the AWS Generative AI Innovation Center, using Amazon Bedrock Knowledge Bases and the Vector Engine for Amazon OpenSearch Serverless. This solution dramatically improved their developer support system, resulting in a 20% increase in retrieval accuracy. Metadata filtering empowers developers to fine-tune their search, helping them surface more relevant answers across complex, multi-domain knowledge sources. This improvement not only enhanced the developer experience but also contributed to reduced support costs.

In this post, we discuss the details of this solution and how Adobe enhances their developer productivity.

Solution overview

Our project aimed to address two key objectives:

  • Document retrieval engine enhancement – We developed a robust system to improve search result accuracy for Adobe developers. This involved creating a pipeline for data ingestion, preprocessing, metadata extraction, and indexing in a vector database. We evaluated retrieval performance against Adobe’s ground truth data to produce high-quality, domain-specific results.
  • Scalable, automated deployment – To support Unified Support across Adobe, we designed a reusable blueprint for deployment. This solution accommodates large-scale data ingestion of various types and offers flexible configurations, including embedding model selection and chunk size adjustment.

Using Amazon Bedrock Knowledge Bases, we created a customized, fully managed solution that improved the retrieval effectiveness. Key achievements include a 20% increase in accuracy metrics for document retrieval, seamless document ingestion and change synchronization, and enhanced scalability to support thousands of Adobe developers. This solution provides a foundation for improved developer support and scalable deployment across Adobe’s teams. The following diagram illustrates the solution architecture.

Solution architecture diagram

Let’s take a closer look at our solution:

  • Amazon Bedrock Knowledge Bases index – The backbone of our system is Amazon Bedrock Knowledge Bases. Data is indexed through the following stages:
    • Data ingestion – We start by pulling data from Amazon Simple Storage Service (Amazon S3) buckets. This could be anything from resolutions to past issues or wiki pages.
    • Chunking – Amazon Bedrock Knowledge Bases breaks data down into smaller pieces, or chunks, defining the specific units of information that can be retrieved. This chunking process is configurable, allowing for optimization based on the specific needs of the business.
    • Vectorization – Each chunk is passed through an embedding model (in this case, Amazon Titan V2 on Amazon Bedrock) creating a 1,024-dimension numerical vector. This vector represents the semantic meaning of the chunk, allowing for similarity searches
    • Storage – These vectors are stored in the Amazon OpenSearch Serverless vector database, creating a searchable repository of information.
  • Runtime – When a user poses a question, our system competes the following steps:
    • Query vectorization – With the Amazon Bedrock Knowledge Bases Retrieve API, the user’s question is automatically embedded using the same embedding model used for the chunks during data ingestion.
    • Similarity search and retrieval – The system retrieves the most relevant chunks in the vector database based on similarity scores to the query.
    • Ranking and presentation – The corresponding documents are ranked based on the sematic similarity of their modest relevant chunks to the query, and the top-ranked information is presented to the user.

Multi-tenancy through metadata filtering

As developers, we often find ourselves seeking help across various domains. Whether it’s tackling CI/CD issues, setting up project environments, or adopting new libraries, the landscape of developer challenges is vast and varied. Sometimes, our questions even span multiple domains, making it crucial to have a system for retrieving relevant information. Metadata filtering empowers developers to retrieve not just semantically relevant information, but a well-defined subset of that information based on specific criteria. This powerful tool enables you to apply filters to your retrievals, helping developers narrow the search results to a limited set of documents based on the filter, thereby improving the relevancy of the search.

To use this feature, metadata files are provided alongside the source data files in an S3 bucket. To enable metadata-based filtering, each source data file needs to be accompanied by a corresponding metadata file. These metadata files used the same base name as the source file, with a .metadata.json suffix. Each metadata file included relevant attributes—such as domain, year, or type—to support multi-tenancy and fine-grained filtering in OpenSearch Service. The following code shows what an example metadata file looks like:

{
  "metadataAttributes": 
      {
        "domain": "project A",
        "year": 2016,
        "type": "wiki"
       }
 }

Retrieve API

The Retrieve API allows querying a knowledge base to retrieve relevant information. You can use it as follows:

  1. Send a POST request to /knowledgebases/knowledgeBaseId/retrieve.
  2. Include a JSON body with the following:
    1. retrievalQuery – Contains the text query.
    2. retrievalConfiguration – Specifies search parameters, such as number of results and filters.
    3. nextToken – For pagination (optional).

The following is an example request syntax:

POST /knowledgebases/knowledgeBaseId/retrieve HTTP/1.1
Content-type: application/json
{
   "nextToken": "string",
   "retrievalConfiguration": { 
      "vectorSearchConfiguration": { 
         "filter": { ... },
         "numberOfResults": number,
         "overrideSearchType": "string"
      }
   },
   "retrievalQuery": { 
      "text": "string"
   }
}

Additionally, you can set up the retriever with ease using the langchain-aws package:

from langchain_aws import AmazonKnowledgeBasesRetriever
retriever = AmazonKnowledgeBasesRetriever(
    knowledge_base_id="YOUR-ID",
    retrieval_config={"vectorSearchConfiguration": {"numberOfResults": 4}},
)
retriever.get_relevant_documents(query="What is the meaning of life?")

This approach enables semantic querying of the knowledge base to retrieve relevant documents based on the provided query, simplifying the implementation of search.

Experimentation

To deliver the most accurate and efficient knowledge retrieval system, the Adobe and AWS teams put the solution to the test. The team conducted a series of rigorous experiments to fine-tune the system and find the optimal settings.

Before we dive into our findings, let’s discuss the metrics and evaluation process we used to measure success. We used the open source model evaluation framework Ragas to evaluate the retrieval system across two metrics: document relevance and mean reciprocal rank (MRR). Although Ragas comes with many metrics for evaluating model performance out of the box, we needed to implement these metrics by extending the Ragas framework with custom code.

  • Document relevance – Document relevance offers a qualitative approach to assessing retrieval accuracy. This metric uses a large language model (LLM) as an impartial judge to compare retrieved chunks against user queries. It evaluates how effectively the retrieved information addresses the developer’s question, providing a score between 1–10.
  • Mean reciprocal rank – On the quantitative side, we have the MRR metric. MRR evaluates how well a system ranks the first relevant item for a query. For each query, find the rank k of the highest-ranked relevant document. The score for that query is 1/k. MRR is the average of these 1/k scores over the entire set of queries. A higher score (closer to 1) signifies that the first relevant result is typically ranked high.

These metrics provide complementary insights: document relevance offers a content-based assessment, and MRR provides a ranking-based evaluation. Together, they offer a comprehensive view of the retrieval system’s effectiveness in finding and prioritizing relevant information.In our recent experiments, we explored various data chunking strategies to optimize the performance of retrieval. We tested several approaches, including fixed-size chunking as well as more advanced semantic chunking and hierarchical chunking.Semantic chunking focuses on preserving the contextual relationships within the data by segmenting it based on semantic meaning. This approach aims to improve the relevance and coherence of retrieved results.Hierarchical chunking organizes data into a hierarchical parent-child structure, allowing for more granular and efficient retrieval based on the inherent relationships within your data.

For more information on how to set up different chunking strategies, refer to Amazon Bedrock Knowledge Bases now supports advanced parsing, chunking, and query reformulation giving greater control of accuracy in RAG based applications.

We tested the following chunking methods with Amazon Bedrock Knowledge Bases:

  • Fixed-size short chunking – 400-token chunks with a 20% overlap (shown as the blue variant in the following figure)
  • Fixed-size long chunking – 1,000-token chunks with a 20% overlap
  • Hierarchical chunking – Parent chunks of 1,500 tokens and child chunks of 300 tokens, with a 60-token overlap
  • Semantic chunking – 400-token chunks with a 95% similarity percentile threshold

For reference, a paragraph of approximately 1,000 characters typically translates to around 200 tokens. To assess performance, we measured document relevance and MRR across different context sizes, ranging from 1–5. This comparison aims to provide insights into the most effective chunking strategy for organizing and retrieving information for this use case.The following figures illustrate the MRR and document relevance metrics, respectively.

Experiment results

Experiment results

As a result of these experiments, we found that MRR is a more sensitive metric for evaluating the impact of chunking strategies, particularly when varying the number of retrieved chunks (top-k from 1 to 5). Among the approaches tested, the fixed-size 400-token strategy—shown in blue—proved to be the simplest and most effective, consistently yielding the highest accuracy across different retrieval sizes.

Conclusion

In the journey to design Adobe’s developer Unified Support search and retrieval system, we’ve successfully harnessed the power of Amazon Bedrock Knowledge Bases to create a robust, scalable, and efficient solution. By configuring fixed-size chunking and using the Amazon Titan V2 embedding model, we achieved a remarkable 20% increase in accuracy metrics for document retrieval compared to Adobe’s existing solution, by running evaluations on the customer’s testing system and provided dataset.The integration of metadata filtering emerged as a game changing feature, allowing for seamless navigation across diverse domains and enabling customized retrieval. This capability proved invaluable for Adobe, given the complexity and breadth of their information landscape. Our comprehensive comparison of retrieval accuracy for different configurations of the Amazon Bedrock Knowledge Bases index has yielded valuable insights. The metrics we developed provide an objective framework for assessing the quality of retrieved context, which is crucial for applications demanding high-precision information retrieval. As we look to the future, this customized, fully managed solution lays a solid foundation for continuous improvement in developer support at Adobe, offering enhanced scalability and seamless support infrastructure in tandem with evolving developer needs.

For those interested in working with AWS on similar projects, visit Generative AI Innovation Center. To learn more about Amazon Bedrock Knowledge Bases, see Retrieve data and generate AI responses with knowledge bases.


About the Authors

Kamran Razi is a Data Scientist at the Amazon Generative AI Innovation Center. With a passion for delivering cutting-edge generative AI solutions, Kamran helps customers unlock the full potential of AWS AI/ML services to solve real-world business challenges. With over a decade of experience in software development, he specializes in building AI-driven solutions, including AI agents. Kamran holds a PhD in Electrical Engineering from Queen’s University.

Nay Doummar is an Engineering Manager on the Unified Support team at Adobe, where she’s been since 2012. Over the years, she has contributed to projects in infrastructure, CI/CD, identity management, containers, and AI. She started on the CloudOps team, which was responsible for migrating Adobe’s infrastructure to the AWS Cloud, marking the beginning of her long-term collaboration with AWS. In 2020, she helped build a support chatbot to simplify infrastructure-related assistance, sparking her passion for user support. In 2024, she joined a project to Unify Support for the Developer Platform, aiming to streamline support and boost productivity.

Varsha Chandan Bellara is a Software Development Engineer at Adobe, specializing in AI-driven solutions to boost developer productivity. She leads the development of an AI assistant for the Unified Support initiative, using Amazon Bedrock, implementing RAG to provide accurate, context-aware responses for technical support and issue resolution. With expertise in cloud-based technologies, Varsha combines her passion for containers and serverless architectures with advanced AI to create scalable, efficient solutions that streamline developer workflows.

Jan Michael Ong is a Senior Software Engineer at Adobe, where he supports the developer community and engineering teams through tooling and automation. Currently, he is part of the Developer Experience team at Adobe, working on AI projects and automation contributing to Adobe’s internal Developer Platform.

Justin Johns is a Deep Learning Architect at Amazon Web Services who is passionate about innovating with generative AI and delivering cutting-edge solutions for customers. With over 5 years of software development experience, he specializes in building cloud-based solutions powered by generative AI.

Gaurav Dhamija is a Principal Solutions Architect at Amazon Web Services, where he helps customers design and build scalable, reliable, and secure applications on AWS. He is passionate about developer experience, containers, and serverless technologies, and works closely with engineering teams to modernize application architectures. Gaurav also specializes in generative AI, using AWS generative AI services to drive innovation and enhance productivity across a wide range of use cases.

Sandeep Singh is a Senior Generative AI Data Scientist at Amazon Web Services, helping businesses innovate with generative AI. He specializes in generative AI, machine learning, and system design. He has successfully delivered state-of-the-art AI/ML-powered solutions to solve complex business problems for diverse industries, optimizing efficiency and scalability.

Business portrait photoAnila Joshi has more than a decade of experience building AI solutions. As a Senior Manager, Applied Science at AWS Generative AI Innovation Center, Anila pioneers innovative applications of AI that push the boundaries of possibility and accelerate the adoption of AWS services with customers by helping customers ideate, identify, and implement secure generative AI solutions.

Read More

Amazon Nova Lite enables Bito to offer a free tier option for its AI-powered code reviews

Amazon Nova Lite enables Bito to offer a free tier option for its AI-powered code reviews

This post is co-written by Amar Goel, co-founder and CEO of Bito.

Meticulous code review is a critical step in the software development process, one that helps delivery high-quality code that’s ready for enterprise use. However, it can be a time-consuming process at scale, when experts must review thousands of lines of code, looking for bugs or other issues. Traditional reviews can be subjective and inconsistent, because humans are human. As generative AI becomes more and more integrated into the software development process, a significant number of new AI-powered code review tools have entered the marketplace, helping software development groups raise quality and ship clean code faster—while boosting productivity and job satisfaction among developers.

Bito is an innovative startup that creates AI agents for a broad range of software developers. It emerged as a pioneer in its field in 2023, when it launched its first AI-powered developer agents, which brought large language models (LLMs) to code-writing environments. Bito’s executive team sees developers and AI as key elements of the future and works to bring them together in powerful new ways with its expanding portfolio of AI-powered agents. Its flagship product, AI Code Review Agent, speeds pull request (PR) time-to-merge up to 89%—accelerating development times and allowing developers to focus on writing code and innovating. Intended to assist, not replace, review by senior software engineers, Bito’s AI Code Review Agent focuses on summarizing, organizing, then suggesting line-level fixes with code base context, freeing engineers to focus on the more strategic aspects of code review, such as evaluating business logic.

With more than 100,000 active developers using its products globally, Bito has proven successful at using generative AI to streamline code review—delivering impressive results and business value to its customers, which include Gainsight, Privado, PubMatic, On-Board Data Systems (OBDS), and leading software enterprises. Bito’s customers have found that AI Code Review Agent provides the following benefits:

  • Accelerates PR merges by 89%
  • Reduces regressions by 34%
  • Delivers 87% of a PR’s feedback necessary for review
  • Has a 2.33 signal-to-noise ratio
  • Works in over 50 programming languages

Although these results are compelling, Bito needed to overcome inherent wariness among developers evaluating a flood of new AI-powered tools. To accelerate and expand adoption of AI Code Review Agent, Bito leaders wanted to launch a free tier option, which would let these prospective users experience the capabilities and value that AI Code Review Agent offered—and encourage them to upgrade to the full, for-pay Teams Plan. Its Free Plan would offer AI-generated PR summaries to provide an overview of changes, and its Teams Plan would provide more advanced features, such as downstream impact analysis and one-click acceptance of line-level code suggestions.

In this post, we share how Bito is able to offer a free tier option for its AI-powered code reviews using Amazon Nova.

Choosing a cost-effective model for a free tier offering

To offer a free tier option for AI Code Review Agent, Bito needed a foundation model (FM) that would provide the right level of performance and results at a reasonable cost. Offering code review for free to its potential customers would not be free for Bito, of course, because it would be paying inference costs. To identify a model for its Free Plan, Bito carried out a 2-week evaluation process across a broad range of models, including the high-performing FMs on Amazon Bedrock, as well as OpenAI GPT-4o mini. The Amazon Nova models—fast, cost-effective models that were recently introduced on Amazon Bedrock—were particularly interesting to the team.

At the end of its evaluation, Bito determined that Amazon Nova Lite delivered the right mix of performance and cost-effectiveness for its use cases. Its speed provided fast creation of code review summaries. However, cost—a key consideration for Bito’s Free Plan—proved to be the deciding factor. Ultimately, Amazon Nova Lite met Bito’s criteria for speed, cost, and quality. The combination of Amazon Nova Lite and Amazon Bedrock also make it possible for Bito to offer the reliability and security that their customers needed when entrusting Bito with their code. After all, careful control of code is one of Bito’s core promises to its customers. It doesn’t store code or use it for model training. And its products are SOC 2 Type 2-certified to provide data security, processing integrity, privacy, and confidentiality.

Implementing the right models for different tiers of offerings

Bito has now adopted Amazon Bedrock as its standardized platform to explore, add, and run models. Bito uses Amazon Nova Lite as the primary model for its Free Plan, and Anthropic’s Claude 3.7 Sonnet powers its for-pay Teams Plan, all accessed and integrated through the unified Amazon Bedrock API and controls. Amazon Bedrock provides seamless shifting from Amazon Nova Lite to Anthropic’s Sonnet when customers upgrade, with minimal code changes. Bito leaders are quick to point out that Amazon Nova Lite doesn’t just power its Free Plan—it inspired it. Without the very low cost of Amazon Nova Lite, they wouldn’t have been able to offer a free tier of AI Code Review Agent, which they viewed as a strategic move that would enable it to expand its enterprise customer base. This strategy quickly generated results, attracting three times more prospective customers to its Free Plan than anticipated. At the end of the 14-day trial period, a significant number of users convert to the full AI Code Review Agent to access its full array of capabilities.

Encouraged by the success with AI Code Review Agent, Bito is now using Amazon Nova Lite to power the chat capabilities of its offering for Bito Wingman, its latest AI agentic technology—a full-featured developer assistant in the integrated development environment (IDE) that combines code generation, error handling, architectural advice, and more. Again, the combination of quality and low cost of Amazon Nova Lite made it the right choice for Bito.

Conclusion

In this post, we shared how Bito—an innovative startup that offers a growing portfolio of AI-powered developer agents—chose Amazon Nova Lite to power its free tier offering of AI Code Review Agent, its flagship product. Its AI-powered agents are designed specifically to make developers’ lives easier and their work more impactful:

  • Amazon Nova Lite enabled Bito to meet one of its core business challenges—attracting enterprise customers. By introducing a free tier, Bito attracted three times more prospective new customers to its generative AI-driven flagship product—AI Code Review Agent.
  • Amazon Nova Lite outperformed other models during rigorous internal testing, providing the right level of performance at the very low cost Bito needed to launch a free tier of AI Code Review Agent.
  • Amazon Bedrock empowers Bito to seamlessly switch between models as needed for each tier of AI Code Review Agent—Amazon Nova Lite for its Free Plan and Anthropic’s Claude 3.7 Sonnet for its for-pay Teams Plan. Amazon Bedrock also provided security and privacy, critical considerations for Bito customers.
  • Bito shows how innovative organizations can use the combination of quality, cost-effectiveness, and speed in Amazon Nova Lite to deliver value to their customers—and to their business.

“Our challenge is to push the capabilities of AI to deliver new value to developers, but at a reasonable cost,” shares Amar Goel, co-founder and CEO of Bito. “Amazon Nova Lite gives us the very fast, low-cost model we needed to power the free offering of our AI Code Review Agent—and attract new customers.”

Get started with Amazon Nova on the Amazon Bedrock console. Learn more Amazon Nova Lite at the Amazon Nova product page.


About the authors

Eshan Bhatnagar is the Director of Product Management for Amazon AGI at Amazon Web Services.

Amar Goel is Co-Founder and CEO of Bito. A serial entrepreneur, Amar previously founded PubMatic (went public in 2020), and formerly worked at Microsoft, McKinsey, and was a software engineer at Netscape, the original browser company. Amar attended Harvard University. He is excited about using GenAI to power the next generation of how software gets built!

Read More

How Gardenia Technologies helps customers create ESG disclosure reports 75% faster using agentic generative AI on Amazon Bedrock

How Gardenia Technologies helps customers create ESG disclosure reports 75% faster using agentic generative AI on Amazon Bedrock

This post was co-written with Federico Thibaud, Neil Holloway, Fraser Price, Christian Dunn, and Frederica Schrager from Gardenia Technologies

“What gets measured gets managed” has become a guiding principle for organizations worldwide as they begin their sustainability and environmental, social, and governance (ESG) journeys. Companies are establishing baselines to track their progress, supported by an expanding framework of reporting standards, some mandatory and some voluntary. However, ESG reporting has evolved into a significant operational burden. A recent survey shows that 55% of sustainability leaders cite excessive administrative work in report preparation, while 70% indicate that reporting demands inhibit their ability to execute strategic initiatives. This environment presents a clear opportunity for generative AI to automate routine reporting tasks, allowing organizations to redirect resources toward more impactful ESG programs.

Gardenia Technologies, a data analytics company, partnered with the AWS Prototyping and Cloud Engineering (PACE) team to develop Report GenAI, a fully automated ESG reporting solution powered by the latest generative AI models on Amazon Bedrock. This post dives deep into the technology behind an agentic search solution using tooling with Retrieval Augmented Generation (RAG) and text-to-SQL capabilities to help customers reduce ESG reporting time by up to 75%.

In this post, we demonstrate how AWS serverless technology, combined with agents in Amazon Bedrock, are used to build scalable and highly flexible agent-based document assistant applications.

Scoping the challenge: Growing ESG reporting requirements and complexity

Sustainability disclosures are now a standard part of corporate reporting, with 96% of the 250 largest companies reporting on their sustainability progress based on government and regulatory frameworks. To meet reporting mandates, organizations must overcome many data collection and process-based barriers. Data for a single report includes thousands of data points from a multitude of sources including official documentation, databases, unstructured document stores, utility bills, and emails. The EU Corporate Sustainability Reporting Directive (CSRD) framework, for example, comprises of 1,200 individual data points that need to be collected across an enterprise. Even voluntary disclosures like the CDP, which is approximately 150 questions, cover a wide range of questions related to climate risk and impact, water stewardship, land use, and energy consumption. Collecting this information across an organization is time consuming.

A secondary challenge is that many organizations with established ESG programs need to report to multiple disclosure frameworks, such as SASB, GRI, TCFD, each using different reporting and disclosure standards. To complicate matters, reporting requirements are continually evolving, leaving organizations struggling just to keep up with the latest changes. Today, much of this work is highly manual and leaves sustainability teams spending more time on managing data collection and answering questionnaires rather than developing impactful sustainability strategies.

Solution overview: Automating undifferentiated heavy lifting with AI agents

Gardenia’s approach to strengthen ESG data collection for enterprises is Report GenAI, an agentic framework using generative AI models on Amazon Bedrock to automate large chunks of the ESG reporting process. Report GenAI pre-fills reports by drawing on existing databases, document stores and web searches. The agent then works collaboratively with ESG professionals to review and fine-tune responses. This workflow has five steps to help automate ESG data collection and assist in curating responses. These steps include setup, batch-fill, review, edit, and repeat. Let’s explore each step in more detail.

  • Setup: The Report GenAI agent is configured and authorized to access an ESG and emissions database, client document stores (emails, previous reports, data sheets), and document searches over the public internet. Client data is stored within specified AWS Regions using encrypted Amazon Simple Storage Service (Amazon S3) buckets with VPC endpoints for secure access, while relational data is hosted in Amazon Relational Database Service (Amazon RDS) instances deployed within Gardenia’s virtual private cloud (VPC). This architecture helps make sure data residency requirements can be fulfilled, while maintaining strict access controls through private network connectivity. The agent also has access to the relevant ESG disclosure questionnaire including questions and expected response format (we refer to this as a report specification). The following figure is an example of the Report GenAI user interface at the agent configuration step. As shown in the figure, the user can choose which databases, documents, or other tools the agent will use to answer a given question.

The user can choose which databases, documents or other tools the agent will use to answer a given question

  • Batch-fill: The agent then iterates through each question and data point to be disclosed and then retrieves relevant data from the client document stores and document searches. This information is processed to produce a response in the expected format depending on the disclosure report requirements.
  • Review: Each response includes cited sources and—if the response is quantitative—calculation methodology. This enables users to maintain a clear audit trail and verify the accuracy of batch-filled responses quickly.
  • Edit: While the agentic workflow is automated, our approach allows for a human-in-the-loop to review, validate, and iterate on batch-filled facts and figures. In the following figure, we show how users can chat with the AI assistant to request updates or manually refine responses. When the user is satisfied, the final answer is recorded. The agent will show references from which responses were sourced and allow the user to modify answers either directly or by providing an additional prompt.

The agent will show references from which responses were sourced and allow the user to modify answers either directly or by providing additional prompt

  • Repeat: Users can batch-fill multiple reporting frameworks to simplify and expand their ESG disclosure scope while avoiding extra effort to manually complete multiple questionnaires. After a report has been completed, it can then be added to the client document store so future reports can draw on it for knowledge. Report GenAI also supports bring your own report, which allows users to develop their own reporting specification (question and response model), which can then be imported into the application, as shown in the following figure.

The user can submit their own list of questions and configure the agent to pre-fill all responses in a single batch

Now that you have a description of the Report GenAI workflow, let’s explore how the architecture is built.

Architecture deep-dive: A serverless generative AI agent

The Report GenAI architecture consists of six components as illustrated in the following figure: A user interface (UI), the generative AI executor, the web search endpoint, a text-to-SQL tool, the RAG tool, and an embedding generation pipeline. The UI, generative AI executor, and generation pipeline components help orchestrate the workflow. The remaining three components function together to generate responses to perform the following actions:

  • Web search tool: Uses an internet search engine to retrieve content from public web pages.
  • Text-to-SQL tool: Generates and executes SQL queries to the company’s emissions database hosted by Gardenia Technologies. The tool uses natural language requests, such as “What were our Scope 2 emissions in 2024,” as input and returns the results from the emissions database.
  • Retrieval Augmented Generation (RAG) tool: Accesses information from the corporate document store (such as procedures, emails, and internal reports) and uses it as a knowledge base. This component acts as a retriever to return relevant text from the document store as a plain text query.

Architecture Diagram

Let’s take a look at each of the components.

1: Lightweight UI hosted on auto-scaled Amazon ECS Fargate

Users access Report GenAI by using the containerized Streamlit frontend. Streamlit offers an appealing UI for data and chat apps allowing data scientists and ML to build convincing user experiences with relatively limited effort. While not typically used for large-scale deployments, Streamlit proved to be a suitable choice for the initial iteration of Report GenAI.

The frontend is hosted on a load-balanced and auto-scaled Amazon Elastic Container Service (Amazon ECS) with Fargate launch type. This implementation of the frontend not only reduces the management overhead but also suits the expected intermittent usage pattern of Report GenAI, which is anticipated to be spikey with high-usage periods around the times when new reports must be generated (typically quarterly or yearly) and lower usage outside these windows. User authentication and authorization is handled by Amazon Cognito.

2: Central agent executor

The executor is an agent that uses reasoning capabilities of leading text-based foundation models (FMs) (for example, Anthropic’s Claude Sonnet 3.5 and Haiku 3.5) to break down user requests, gather information from document stores, and efficiently orchestrate tasks. The agent uses Reason and Act (ReAct), a prompt-based technique that enables large language models (LLMs) to generate both reasoning traces and task-specific actions in an interleaved manner. Reasoning traces help the model develop, track, and update action plans, while actions allow it to interface with a set of tools and information sources (also known as knowledge bases) that it can use to fulfil the task. The agent is prompted to think about an optimal sequence of actions to complete a given task with the tools at its disposal, observe the outcome, and iterate and improve until satisfied with the answer.

In combination, these tools provide the agent with capabilities to iteratively complete complex ESG reporting templates. The expected questions and response format for each questionnaire is captured by a report specification (ReportSpec) using Pydantic to enforce the desired output format for each reporting standard (for example, CDP, or TCFD). This ReportSpec definition is inserted into the task prompt. The first iteration of Report GenAI used Claude Sonnet 3.5 on Amazon Bedrock. As more capable and more cost effective LLMs become available on Amazon Bedrock (such as the recent release of Amazon Nova models), foundation models in Report GenAI can be swapped to remain up to date with the latest models.

The agent-executor is hosted on AWS Lambda and uses the open-source LangChain framework to implement the ReAct orchestration logic and implement the needed integration with memory, LLMs, tools and knowledge bases. LangChain offers deep integration with AWS using the first-party langchain-aws module. The module langchain-aws provides useful one-line wrappers to call tools using AWS Lambda, draw from a chat memory backed by Dynamo DB and call LLM models on Amazon Bedrock. LangChain also provides fine-grained visibility into each step of the ReAct decision making process to provide decision transparency.

LangChain provides fine-grained visibility into each step of the ReAct decision making process to provide decision transparency

3: Web-search tool

The web search tool is hosted on Lambda and calls an internet search engine through an API. The agent executor retrieves the information returned from the search engine to formulate a response. Web searches can be used in combination with the RAG tool to retrieve public context needed to formulate responses for certain generic questions, such as providing a short description of the reporting company or entity.

4: Text-to-SQL tool

A large portion of ESG reporting requirements is analytical in nature and requires processing of large amounts of numerical or tabular data. For example, a reporting standard might ask for total emissions in a certain year or quarter. LLMs are ill-equipped for questions of this nature. The Lambda-hosted text-to-SQL tool provides the agent with the required analytical capabilities. The tool uses a separate LLM to generate a valid SQL query given a natural language question along with the schema of an emissions database hosted on Gardenia. The generated query is then executed against this database and the results are passed back to the agent executor. SQL linters and error-correction loops are used for added robustness.

5: Retrieval Augmented Generation (RAG) tool

Much of the information required to complete ESG reporting resides in internal, unstructured document stores and can consist of PDF or Word documents, Excel spreadsheets, and even emails. Given the size of these document stores, a common approach is to use knowledge bases with vector embeddings for semantic search. The RAG tool enables the agent executor to retrieve only the relevant parts to answer questions from the document store. The RAG tool is hosted on Lambda and uses an in-memory Faiss index as a vector store. The index is persisted on Amazon S3 and loaded on demand whenever required. This workflow is advantageous for the given workload because of the intermittent usage of Report GenAI. The RAG tool accepts a plain text query from the agent executor as input, uses an embedding model on Amazon Bedrock to perform a vector search against the vector data base. The retrieved text is returned to the agent executor.

6: Embedding the generation asynchronous pipeline

To make text searchable, it must be indexed in a vector database. Amazon Step Functions provides a straightforward orchestration framework to manage this process: extracting plain text from the various document types, chunking it into manageable pieces, embedding the text, and then loading embeddings into a vector DB. Amazon Textract can be used as the first step for extracting text from visual-heavy documents like presentations or PDFs. An embedding model such as Amazon Titan Text Embeddings can then be used to embed the text and store it into a vector DB such as Lance DB. Note that Amazon Bedrock Knowledge Bases provides an end-to-end retrieval service automating most of the steps that were just described. However, for this application, Gardenia Technologies opted for a fully flexible implementation to retain full control over each design choice of the RAG pipeline (text extraction approach, embedding model choice, and vector database choice) at the expense of higher management and development overhead.

Evaluating agent performance

Making sure of accuracy and reliability in ESG reporting is paramount, given the regulatory and business implications of these disclosures. Report GenAI implements a sophisticated dual-layer evaluation framework that combines human expertise with advanced AI validation capabilities.

Validation is done both at a high level (such as evaluating full question responses) and sub-component level (such as breaking down to RAG, SQL search, and agent trajectory modules). Each of these has separate evaluation sets in addition to specific metrics of interest.

Human expert validation

The solution’s human-in-the-loop approach allows ESG experts to review and validate the AI-generated responses. This expert oversight serves as the primary quality control mechanism, making sure that generated reports align with both regulatory requirements and organization-specific context. The interactive chat interface enables experts to:

  • Verify factual accuracy of automated responses
  • Validate calculation methodologies
  • Verify proper context interpretation
  • Confirm regulatory compliance
  • Flag potential discrepancies or areas requiring additional review

A key feature in this process is the AI reasoning module, which displays the agent’s decision-making process, providing transparency into not only what answers were generated but how the agent arrived at those conclusions.

The user can review the steps in the AI agent’s reasoning to validate answers

These expert reviews provide valuable training data that can be used to enhance system performance through refinements to RAG implementations, agent prompts, or underlying language models.

AI-powered quality assessment

Complementing human oversight, Report GenAI uses state-of-the-art LLMs on Amazon Bedrock as LLM judges. These models are prompted to evaluate:

  • Response accuracy relative to source documentation
  • Completeness of answers against question requirements
  • Consistency with provided context
  • Alignment with reporting framework guidelines
  • Mathematical accuracy of calculations

The LLM judge operates by:

  • Analyzing the original question and context
  • Reviewing the generated response and its supporting evidence
  • Comparing the response against retrieved data from structured and unstructured sources
  • Providing a confidence score and detailed assessment of the response quality
  • Flagging potential issues or areas requiring human review

This dual-validation approach creates a robust quality assurance framework that combines the pattern recognition capabilities of AI with human domain expertise. The system continuously improves through feedback loops, where human corrections and validations help refine the AI’s understanding and response generation capabilities.

How Omni Helicopters International cuts its reporting time by 75%

Omni Helicopters International cut their CDP reporting time by 75% using Gardenia’s Report GenAI solution. In previous years, OHI’s CDP reporting required one month of dedicated effort from their sustainability team. Using Report GenAI, OHI tracked their GHG inventory and relevant KPIs in real time and then prepared their 2024 CDP submission in just one week. Read the full story in Preparing Annual CDP Reports 75% Faster.

“In previous years we needed one month to complete the report, this year it took just one week,” said Renato Souza, Executive Manager QSEU at OTA. “The ‘Ask the Agent’ feature made it easy to draft our own answers. The tool was a great support and made things much easier compared to previous years.”

Conclusion

In this post, we stepped through how AWS and Gardenia collaborated to build Report GenAI, an automated ESG reporting solution that relieves ESG experts of the undifferentiated heavy lifting of data gathering and analysis associated with a growing ESG reporting burden. This frees up time for more impactful, strategic sustainability initiatives. Report GenAI is available on the AWS Marketplace today. To dive deeper and start developing your own generative AI app to fit your use case, explore this workshop on building an Agentic LLM assistant on AWS.


About the Authors

Federico Thibaud is the CTO and Co-Founder of Gardenia Technologies, where he leads the data and engineering teams, working on everything from data acquisition and transformation to algorithm design and product development. Before co-founding Gardenia, Federico worked at the intersection of finance and tech — building a trade finance platform as lead developer and developing quantitative strategies at a hedge fund.

Neil Holloway is Head of Data Science at Gardenia Technologies where he is focused on leveraging AI and machine learning to build and enhance software products. Neil holds a masters degree in Theoretical Physics, where he designed and built programs to simulate high energy collisions in particle physics.

Fraser Price is a GenAI-focused Software Engineer at Gardenia Technologies in London, where he focuses on researching, prototyping and developing novel approaches to automation in the carbon accounting space using GenAI and machine learning. He received his MEng in Computing: AI from Imperial College London.

Christian Dunn is a Software Engineer based in London building ETL pipelines, web-apps, and other business solutions at Gardenia Technologies.

Frederica Schrager is a Marketing Analyst at Gardenia Technologies.

Karsten Schroer is a Senior ML Prototyping Architect at AWS. He supports customers in leveraging data and technology to drive sustainability of their IT infrastructure and build cloud-native data-driven solutions that enable sustainable operations in their respective verticals. Karsten joined AWS following his PhD studies in applied machine learning & operations management. He is truly passionate about technology-enabled solutions to societal challenges and loves to dive deep into the methods and application architectures that underlie these solutions.

Mohamed Ali Jamaoui is a Senior ML Prototyping Architect with over 10 years of experience in production machine learning. He enjoys solving business problems with machine learning and software engineering, and helping customers extract business value with ML. As part of AWS EMEA Prototyping and Cloud Engineering, he helps customers build business solutions that leverage innovations in MLOPs, NLP, CV and LLMs.

Marco Masciola is a Senior Sustainability Scientist at AWS. In his role, Marco leads the development of IT tools and technical products to support AWS’s sustainability mission. He’s held various roles in the renewable energy industry, and leans on this experience to build tooling to support sustainable data center operations.

Hin Yee Liu is a Senior Prototyping Engagement Manager at Amazon Web Services. She helps AWS customers to bring their big ideas to life and accelerate the adoption of emerging technologies. Hin Yee works closely with customer stakeholders to identify, shape and deliver impactful use cases leveraging Generative AI, AI/ML, Big Data, and Serverless technologies using agile methodologies.

Read More

NVIDIA Nemotron Super 49B and Nano 8B reasoning models now available in Amazon Bedrock Marketplace and Amazon SageMaker JumpStart

NVIDIA Nemotron Super 49B and Nano 8B reasoning models now available in Amazon Bedrock Marketplace and Amazon SageMaker JumpStart

This post is co-written with Eliuth Triana Isaza, Abhishek Sawarkar, and Abdullahi Olaoye from NVIDIA.

Today, we are excited to announce that the Llama 3.3 Nemotron Super 49B V1 and Llama 3.1 Nemotron Nano 8B V1 are available in Amazon Bedrock Marketplace and Amazon SageMaker JumpStart. With this launch, you can now deploy NVIDIA’s newest reasoning models to build, experiment, and responsibly scale your generative AI ideas on AWS.

In this post, we demonstrate how to get started with these models on Amazon Bedrock Marketplace and SageMaker JumpStart.

About NVIDIA NIMs on AWS

NVIDIA NIM inference microservices integrate closely with AWS managed services such as Amazon Elastic Compute Cloud (Amazon EC2), Amazon Elastic Kubernetes Service (Amazon EKS), and Amazon SageMaker AI, to enable the deployment of generative AI models at scale. As part of NVIDIA AI Enterprise, available in the AWS Marketplace, NVIDIA NIM is a set of easy-to-use microservices designed to accelerate the deployment of generative AI. These prebuilt containers support a broad spectrum of generative AI models from open source community models to NVIDIA AI Foundation and custom models. NIM microservices are deployed with a single command for easy integration into generative AI applications using industry-standard APIs and just a few lines of code, or with a few actions in the SageMaker JumpStart console. Engineered to facilitate seamless generative AI inferencing at scale, NIM ensures generative AI applications can be deployed anywhere.

Overview of NVIDIA Nemotron models

In this section, we provide an overview of the NVIDIA Nemotron Super and Nano NIM microservices discussed in this post.

Llama 3.3 Nemotron Super 49B V1

Llama-3.3-Nemotron-Super-49B-v1 is an LLM which is a derivative of Meta Llama-3.3-70B-Instruct (the reference model). It is a reasoning model that is post-trained for reasoning, human chat preferences, and task executions, such as Retrieval-Augmented Generation (RAG) and tool calling. The model supports a context length of 128K tokens. Using a novel Neural Architecture Search (NAS) approach, we greatly reduced the model’s memory footprint and increased efficiency to support larger workloads and for the model to fit onto a single Hopper GPU (P5 instances) at high workloads (H200).

Llama 3.1 Nemotron Nano 8B V1

Llama-3.1-Nemotron-Nano-8B-v1 is an LLM which is a derivative of Meta Llama-3.1-8B-Instruct (the reference model). It is a reasoning model that is post trained for reasoning, human chat preferences, and task execution, such as RAG and tool calling. The model supports a context length of 128K tokens. It is created from Llama 3.1 8B Instruct and offers improvements in model accuracy. The model fits on a single H100 or A100 GPU (P5 or P4 instances) and can be used locally.

About Amazon Bedrock Marketplace

Amazon Bedrock Marketplace plays a pivotal role in democratizing access to advanced AI capabilities through several key advantages:

  • Comprehensive model selection – Amazon Bedrock Marketplace offers an exceptional range of models, from proprietary to publicly available options, allowing organizations to find the perfect fit for their specific use cases.
  • Unified and secure experience – By providing a single access point for all models through the Amazon Bedrock APIs, Bedrock Marketplace significantly simplifies the integration process. Organizations can use these models securely, and for models that are compatible with the Amazon Bedrock Converse API, you can use the robust toolkit of Amazon Bedrock, including Amazon Bedrock Agents, Amazon Bedrock Knowledge Bases, Amazon Bedrock Guardrails, and Amazon Bedrock Flows.
  • Scalable infrastructure – Amazon Bedrock Marketplace offers configurable scalability through managed endpoints, allowing organizations to select their desired number of instances, choose appropriate instance types, define custom auto scaling policies that dynamically adjust to workload demands, and optimize costs while maintaining performance.

Deploy NVIDIA Llama Nemotron models in Amazon Bedrock Marketplace

Amazon Bedrock Marketplace gives you access to over 100 popular, emerging, and specialized foundation models (FMs) through Amazon Bedrock. To access the Nemotron reasoning models in Amazon Bedrock, complete the following steps:

  1. On the Amazon Bedrock console, in the navigation pane under Foundation models, choose Model catalog.
    You can also use the InvokeModel API to invoke the model. The InvokeModel API doesn’t support Converse APIs and other Amazon Bedrock tooling.
  2. On the Model catalog page, filter for NVIDIA as a provider and choose the Llama 3.3 Nemotron Super 49B V1 model.

The Model detail page provides essential information about the model’s capabilities, pricing structure, and implementation guidelines. You can find detailed usage instructions, including sample API calls and code snippets for integration.

  1. To begin using the Llama 3.3 Nemotron Super 49B V1 model, choose Subscribe to subscribe to the marketplace offer.
  2. On the model detail page, choose Deploy.

You will be prompted to configure the deployment details for the model. The model ID will be pre-populated.

  1. For Endpoint name, enter an endpoint name (between 1–50 alphanumeric characters).
  2. For Number of instances, enter a number of instances (between 1–100).
  3. For Instance type, choose your instance type. For optimal performance with Nemotron Super, a GPU-based instance type like ml.g6e.12xlarge is recommended.
    Optionally, you can configure advanced security and infrastructure settings, including virtual private cloud (VPC) networking, service role permissions, and encryption settings. For most use cases, the default settings will work well. However, for production deployments, you should review these settings to align with your organization’s security and compliance requirements.
  4. Choose Deploy to begin using the model.

When the deployment is complete, you can test its capabilities directly in the Amazon Bedrock playground.This is an excellent way to explore the model’s reasoning and text generation abilities before integrating it into your applications. The playground provides immediate feedback, helping you understand how the model responds to various inputs and letting you fine-tune your prompts for optimal results. A similar process can be followed for deploying the Llama 3.1 Nemotron Nano 8B V1 model as well.

Run inference with the deployed Nemotron endpoint

The following code example demonstrates how to perform inference using a deployed model through Amazon Bedrock using the InvokeModel api. The script initializes the bedrock_runtime client, configures inference parameters, and sends a request to generate text based on a user prompt. With Nemotron Super and Nano models, we can use a soft switch to toggle reasoning on and off. In the content field, set detailed thinking on or detailed thinking off.

Request

import boto3
import json

# Initialize Bedrock client
bedrock_runtime = boto3.client("bedrock-runtime")

# Configuration
MODEL_ID = ""  # Replace with Bedrock Marketplace endpoint arn

def invoke_model(prompt, max_tokens=1000, temperature=0.6, top_p=0.9):
    """
    Simple Bedrock model invocation
    """
    # Prepare model input
    body = {
        "model": "nvidia/llama-3.3-nemotron-super-49b-v1",
        "messages": [
            {   
                "role": "system",
                "content": "detailed thinking on" # to turn thinking off, simply toggle this
            },
            {
                "role": "user",
                "content": prompt  # Use the prompt parameter
            }
        ],
        "max_tokens": max_tokens,
        "temperature": temperature,
        "top_p": top_p
    }
    
    # Invoke model
    response = bedrock_runtime.invoke_model(
        modelId=MODEL_ID,
        body=json.dumps(body)
    )

    # Parse and return response
    model_output = json.loads(response['body'].read())
    print(json.dumps(model_output, indent=2))
    
# Example usage
if __name__ == "__main__":
    prompt = "Explain how a transformer neural network works."
    result = invoke_model(prompt)
    print(result)

Response body

{
  "id": "chat-43c35c87d5a646c9b3dcaba461eefeaa",
  "object": "chat.completion",
  "created": 1749566119,
  "model": "nvidia/llama-3.3-nemotron-super-49b-v1",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "<think>nOkay, I need to explain how a transformer neural network works. Let me start by recalling what I know about transformers. They were introduced in the paper "Attention Is All You Need" by Vaswani et al. in 2017, right? Before transformers, sequence models like RNNs and LSTMs were common, but they had issues with parallelization because of their sequential processing. Transformers changed that by using self-attention mechanisms, which allow parallel processing of the entire input sequence at once.nnSo, the main components of a transformer are the encoder and the decoder. Each encoder layer has two main sub-layers: multi-head self-attention and a fully connected feed-forward network. The decoder has these two plus an additional multi-head attention layer to interact with the encoder's output.nnLet me break down the encoder first. The self-attention mechanism is key here. Self-attention allows the model to attend to all positions in the input sequence simultaneously and weigh their importance. The multi-head part means that instead of just one attention mechanism, you have multiple (like 8 or 12) running in parallel, each with a different linear projection. This helps capture different types of relationships between tokens.nnEach head computes attention by using Query (Q), Key (K), and Value (V) matrices. These are derived from the input by linear transformations. The attention weights are computed as the dot product of Q and K transposed, divided by the square root of the key's dimensionality (for stability). Then, these weights are used to compute a weighted sum of the Value vectors. This gives the output for one attention head. Since there are multiple heads, their outputs are concatenated and then linearly transformed.nnAfter the multi-head attention, there's a feed-forward network (FFN) applied individually to each position. It's a simple neural network with a ReLU activation in the middle. Both the attention and FFN layers use residual connections followed by layer normalization.nnThe decoder is similar but has an extra attention layer that attends to the encoder's output. Also, in the decoder's self-attention, they use masking to prevent looking ahead in the sequence, which is important for tasks like translation where you generate one token at a time.nnPositional encodings are added to the input embeddings because transformers don't have recurrence or convolution, so they need a way to preserve sequence order. The encodings are fixed and based on sine and cosine functions, which allows the model to handle sequences of varying lengths.nnI should also mention the overall architecture: the input goes through an embedding layer plus positional encoding, then through a stack of encoder layers. The output from the encoder goes into the decoder's stack of layers, and the final output goes through a linear layer and softmax for tasks like machine translation.nnWait, maybe I should explain each part step by step. Start with the overall structure, then dive into encoder and decoder layers, then the self-attention mechanism, positional encodings, and how it's different from RNNs.nnPotential pitfalls: Confusing encoder and decoder layers, not explaining why multi-head attention is useful, or omitting the role of layer normalization and residual connections. Also, ensuring that the explanation of attention weights calculation is clear.nnLet me check if I missed anything. Oh, the decoder's structure: after self-attention, there's encoder-decoder attention, then FFN. And the output linear layer. Also, the training process involves teacher forcing, but maybe that's beyond the basic explanation.nnI need to make sure the explanation is clear for someone who knows basics of neural networks but not transformers. Use analogies if possible, like how attention is like selectively focusing on different parts of the input when processing each element.nnAnother point: the efficiency. Since attention is O(n^2) with sequence length n, but it's parallelizable, which is better than O(n) for RNNs which can't be parallelized. But for very long sequences, transformers can be computationally expensive.nnAlso, the difference between BERT-style transformers (encoder-only) and sequence-to-sequence models (both encoder and decoder). But the question is about transformer neural networks in general, so covering the basic encoder-decoder structure is sufficient.nnOkay, putting it all together in a structured way: introduction, architecture overview, encoder details, decoder details, key components (self-attention, positional encodings), advantages, and maybe an example application.n</think>nnA transformer neural network is a revolutionary architecture introduced in 2017 for sequence-to-sequence tasks, replacing recurrent neural networks (RNNs) with self-attention mechanisms for parallel processing. Here's a structured explanation:nn### **1. Architecture Overview**n- **Encoder**: Takes in a sequence (e.g., text) and outputs a continuous representation.n- **Decoder**: Generates an output sequence (e.g., translated text) one token at a time, conditioned on the"
      },
      "logprobs": null,
      "finish_reason": "length",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 29,
    "total_tokens": 1029,
    "completion_tokens": 1000
  },
  "prompt_logprobs": null
}

Amazon SageMaker JumpStart overview

SageMaker JumpStart is a fully managed service that offers state-of-the-art foundation models for various use cases such as content writing, code generation, question answering, copywriting, summarization, classification, and information retrieval. It provides a collection of pre-trained models that you can deploy quickly, accelerating the development and deployment of ML applications. One of the key components of SageMaker JumpStart is model hubs, which offer a vast catalog of pre-trained models, such as Mistral, for a variety of tasks. You can now discover and deploy Llama 3.3 Nemotron Super 49B V1 and Llama-3.1-Nemotron-Nano-8B-v1 in Amazon SageMaker Studio or programmatically through the SageMaker Python SDK, so you can derive model performance and MLOps controls with Amazon SageMaker AI features such as Amazon SageMaker Pipelines, Amazon SageMaker Debugger, or container logs. The model is deployed in a secure AWS environment and in your VPC, helping to support data security for enterprise security needs.

Prerequisites

Before getting started with deployment, make sure your AWS Identity and Access Management (IAM) service role for Amazon SageMaker has the AmazonSageMakerFullAccess permission policy attached. To deploy the NVIDIA Llama Nemotron models successfully, confirm one of the following:

  • Make sure your IAM role has the following permissions and you have the authority to make AWS Marketplace subscriptions in the AWS account used:
    • aws-marketplace:ViewSubscriptions
    • aws-marketplace:Unsubscribe
    • aws-marketplace:Subscribe
  • If your account is already subscribed to the model, you can skip to the Deploy section below. Otherwise, please start by subscribing to the model package and then move to the Deploy section.

Subscribe to the model package

To subscribe to the model package, complete the following steps:

  1. Open the model package listing page and choose Llama 3.3 Nemotron Super 49B V1 or Llama 3.1 Nemotron Nano 8B V1.
  2. On the AWS Marketplace listing, choose Continue to subscribe.
  3. On the Subscribe to this software page, review and choose Accept Offer if you and your organization agree with EULA, pricing, and support terms.
  4. Choose Continue to with the configuration and then choose an AWS Region where you have the service quota for the desired instance type.

A product ARN will be displayed. This is the model package ARN that you need to specify while creating a deployable model using Boto3.

(Option-1) Deploy NVIDIA Llama Nemotron Super and Nano models on SageMaker JumpStart

For those new to SageMaker Jumpstart, we will go to SageMaker Studio to access models on SageMaker Jumpstart. The Llama 3.3 Nemotron Super 49B V1and Llama 3.1 Nemotron Nano 8B V1 models are available on SageMaker Jumpstart. Deployment starts when you choose the Deploy option, you may be prompted to subscribe to this model on the Marketplace. If you are already subscribed, then you can move forward with selecting the second Deploy button. After deployment finishes, you will see that an endpoint is created. You can test the endpoint by passing a sample inference request payload or by selecting the testing option using the SDK.

(Option-2) Deploy NVIDIA Llama Nemotron using the SageMaker SDK

In this section we will walk through deploying the Llama 3.3 Nemotron Super 49B V1 model through the SageMaker SDK. A similar process can be followed for deploying the Llama 3.1 Nemotron Nano 8B V1 model as well.

Define the SageMaker model using the Model Package ARN

To deploy the model using the SDK, copy the product ARN from the previous step and specify it in the model_package_arn in the following code:

sm_model_name  "nim-llama-3-3-nemotron-super-49b-v1"
create_model_response  smcreate_model(
    ModelNamesm_model_name,
    PrimaryContainer{
        'ModelPackageName': model_package_arn
    },
    ExecutionRoleArnrole,
    EnableNetworkIsolation
)
print("Model Arn: "  create_model_response["ModelArn"])

Create the endpoint configuration

Next, we can create endpoint configuration by specifying instance type, in this case it’s ml.g6e.12xlarge. Make sure you have the account-level service limit for using ml.g6e.12xlarge for endpoint usage as one or more instances. NVIDIA also provides a list of supported instance types that supports deployment. Refer to the AWS Marketplace listing for both of these models to see supported instance types. To request a service quota increase, see AWS service quotas.

endpoint_config_name  sm_model_name

create_endpoint_config_response  smcreate_endpoint_config(
    EndpointConfigNameendpoint_config_name,
    ProductionVariants[
        {
            'VariantName': 'AllTraffic',
            'ModelName': sm_model_name,
            'InitialInstanceCount': 1,
            'InstanceType': 'ml.g6e.12xlarge',
            'InferenceAmiVersion': 'al2-ami-sagemaker-inference-gpu-2',
            'RoutingConfig': {'RoutingStrategy': 'LEAST_OUTSTANDING_REQUESTS'},
            'ModelDataDownloadTimeoutInSeconds': 3600, 
            'ContainerStartupHealthCheckTimeoutInSeconds': 3600, 
        }
    ]
)
print("Endpoint Config Arn: "  create_endpoint_config_response["EndpointConfigArn"])

Create the endpoint

Using the previous endpoint configuration we create a new SageMaker endpoint and add a wait and loop as shown below until the deployment finishes. This typically takes around 5-10 minutes. The status will change to InService once the deployment is successful.

endpoint_name  endpoint_config_name
create_endpoint_response  smcreate_endpoint(
    EndpointNameendpoint_name,
    EndpointConfigNameendpoint_config_name
)
print("Endpoint Arn: "  create_endpoint_response["EndpointArn"

Deploy the endpoint

Let’s now deploy and track the status of the endpoint.

resp = sm.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

while status == "Creating":
    time.sleep(60)
    resp = sm.describe_endpoint(EndpointName=endpoint_name)
    status = resp["EndpointStatus"]
    print("Status: " + status)
    
print("Arn: " + resp["EndpointArn"])
print("Status: " + status)

Run Inference with Llama 3.3 Nemotron Super 49B V1

Once we have the model, we can use a sample text to do an inference request. NIM on SageMaker supports the OpenAI API inference protocol inference request format. For explanation of supported parameters please see Creates a model in the NVIDIA documentation.

Real-Time inference example

The following code examples illustrate how to perform real-time inference using the Llama 3.3 Nemotron Super 49B V1 model in non-reasoning and reasoning mode.

Non-reasoning mode

Perform real-time inference in non-reasoning mode:

payload_model  "nvidia/llama-3.3-nemotron-super-49b-v1"
messages  [
    {
    "role": "system",
    "content": "detailed thinking off"
    },
    {
    "role":"user",
    "content":"Explain how a transformer neural network works."
    }
    ]
    
payload  {
    "model": payload_model,
    "messages": messages,
    "max_tokens": 3000
}

response  clientinvoke_endpoint(
    EndpointNameendpoint_name, ContentType"application/json", Bodyjsondumps(payload)
)

output  jsonloads(response["Body"]read()decode("utf8"))
print(jsondumps(output, indent2))

Reasoning mode

Perform real-time inference in reasoning mode:

payload_model  "nvidia/llama-3.3-nemotron-super-49b-v1"
messages  [
    {
    "role": "system",
    "content": "detailed thinking on"
    },
    {
    "role":"user",
    "content":"Explain how a transformer neural network works."
    }
    ]
payload  {
    "model": payload_model,
    "messages": messages,
    "max_tokens": 3000
}

response  clientinvoke_endpoint(
    EndpointNameendpoint_name, ContentType"application/json", Bodyjsondumps(payload)
)

output  jsonloads(response["Body"]read()decode("utf8"))
print(jsondumps(output, indent2))

Streaming inference

NIM on SageMaker also supports streaming inference and you can enable that by setting stream as True in the payload and by using the invoke_endpoint_with_response_stream method.

Streaming inference:

payload_model = "nvidia/llama-3.3-nemotron-super-49b-v1"
messages = [
    {   
      "role": "system",
      "content": "detailed thinking on"# this can be toggled off to disable reasoning
    },
    {
      "role":"user",
      "content":"Explain how a transformer neural network works."
    }
  ]

payload = {
  "model": payload_model,
  "messages": messages,
  "max_tokens": 3000,
  "stream": True
}

response = client.invoke_endpoint_with_response_stream(
    EndpointName=endpoint_name,
    Body=json.dumps(payload),
    ContentType="application/json",
    Accept="application/jsonlines",
)

We can use some post-processing code for the streaming output that reads the byte-chunks coming from the endpoint, pieces them into full JSON messages, extracts any new text the model produced, and immediately prints that text to output.

event_stream = response['Body']
accumulated_data = ""
start_marker = 'data:'
end_marker = '"finish_reason":null}]}'

for event in event_stream:
    try:
        payload = event.get('PayloadPart', {}).get('Bytes', b'')
        if payload:
            data_str = payload.decode('utf-8')

            accumulated_data += data_str

            # Process accumulated data when a complete response is detected
            while start_marker in accumulated_data and end_marker in accumulated_data:
                start_idx = accumulated_data.find(start_marker)
                end_idx = accumulated_data.find(end_marker) + len(end_marker)
                full_response = accumulated_data[start_idx + len(start_marker):end_idx]
                accumulated_data = accumulated_data[end_idx:]

                try:
                    data = json.loads(full_response)
                    content = data.get('choices', [{}])[0].get('delta', {}).get('content', "")
                    if content:
                        print(content, end='', flush=True)
                except json.JSONDecodeError:
                    continue
    except Exception as e:
        print(f"nError processing event: {e}", flush=True)
        continue

Clean up

To avoid unwanted charges, complete the steps in this section to clean up your resources.

Delete the Amazon Bedrock Marketplace deployment

If you deployed the model using Amazon Bedrock Marketplace, complete the following steps:

  1. On the Amazon Bedrock console, in the navigation pane in the Foundation models section, choose Marketplace deployments.
  2. In the Managed deployments section, locate the endpoint you want to delete.
  3. Select the endpoint, and on the Actions menu, choose Delete.
  4. Verify the endpoint details to make sure you’re deleting the correct deployment:
    1. Endpoint name
    2. Model name
    3. Endpoint status
  5. Choose Delete to delete the endpoint.
  6. In the Delete endpoint confirmation dialog, review the warning message, enter confirm, and choose Delete to permanently remove the endpoint.

Delete the SageMaker JumpStart Endpoint

The SageMaker JumpStart model you deployed will incur costs if you leave it running. Use the following code to delete the endpoint if you want to stop incurring charges. For more details, see Delete Endpoints and Resources.

sm.delete_model(ModelName=sm_model_name)
sm.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
sm.delete_endpoint(EndpointName=endpoint_name)

Conclusion

NVIDIA’s Nemotron Llama3 models deliver optimized AI reasoning capabilities and are now available on AWS through Amazon Bedrock Marketplace and Amazon SageMaker JumpStart. The Llama 3.3 Nemotron Super 49B V1, derived from Meta’s 70B model, uses Neural Architecture Search (NAS) to achieve a reduced 49B parameter count while maintaining high accuracy, enabling deployment on a single H200 GPU despite its sophisticated capabilities. Meanwhile, the compact Llama 3.1 Nemotron Nano 8B V1 fits on a single fits on a single H100 or A100 GPU (P5 or P4 instances) while improving on Meta’s reference model accuracy, making it ideal for efficiency-conscious applications. Both models support extensive 128K token context windows and are post-trained for enhanced reasoning, RAG capabilities, and tool calling, offering organizations flexible options to balance performance and computational requirements for enterprise AI applications.

With this launch, organizations can now leverage the advanced reasoning capabilities of these models while benefiting from the scalable infrastructure of AWS. Through either the intuitive UI or just a few lines of code, you can quickly deploy these powerful language models to transform your AI applications with minimal effort. These complementary platforms provide straightforward access to NVIDIA’s robust technologies, allowing teams to immediately begin exploring and implementing sophisticated reasoning capabilities in their enterprise solutions.


About the authors

Niithiyn Vijeaswaran is a Generative AI Specialist Solutions Architect with the Third-Party Model Science team at AWS. His area of focus is AWS AI accelerators (AWS Neuron). He holds a Bachelor’s in Computer Science and Bioinformatics.

Chase Pinkerton is a Startups Solutions Architect at Amazon Web Services. He holds a Bachelor’s in Computer Science with a minor in Economics from Tufts University. He’s passionate about helping startups grow and scale their businesses. When not working, he enjoys road cycling, hiking, playing volleyball, and photography.

Varun Morishetty is a Software Engineer with Amazon SageMaker JumpStart and Bedrock Marketplace. Varun received his Bachelor’s degree in Computer Science from Northeastern University. In his free time, he enjoys cooking, baking and exploring New York City.

Brian Kreitzer is a Partner Solutions Architect at Amazon Web Services (AWS). He works with partners to define business requirements, provide architectural guidance, and design solutions for the Amazon Marketplace.

Eliuth Triana Isaza is a Developer Relations Manager at NVIDIA, empowering Amazon’s AI MLOps, DevOps, scientists, and AWS technical experts to master the NVIDIA computing stack for accelerating and optimizing generative AI foundation models spanning from data curation, GPU training, model inference, and production deployment on AWS GPU instances. In addition, Eliuth is a passionate mountain biker, skier, and tennis and poker player.

Abhishek Sawarkar is a product manager in the NVIDIA AI Enterprise team working on integrating NVIDIA AI Software in Cloud MLOps platforms. He focuses on integrating the NVIDIA AI end-to-end stack within cloud platforms and enhancing user experience on accelerated computing.

Abdullahi Olaoye is a Senior AI Solutions Architect at NVIDIA, specializing in integrating NVIDIA AI libraries, frameworks, and products with cloud AI services and open source tools to optimize AI model deployment, inference, and generative AI workflows. He collaborates with AWS to enhance AI workload performance and drive adoption of NVIDIA-powered AI and generative AI solutions.

Read More