Learning from other domains to advance AI evaluation and testing

Learning from other domains to advance AI evaluation and testing

Illustrated headshots of the Guests from the limited podcast series, AI Testing and Evaluation: Learnings from Science and Industry

As generative AI becomes more capable and widely deployed, familiar questions from the governance of other transformative technologies have resurfaced. Which opportunities, capabilities, risks, and impacts should be evaluated? Who should conduct evaluations, and at what stages of the technology lifecycle? What tests or measurements should be used? And how can we know if the results are reliable?  

Recent research and reports from Microsoft (opens in new tab), the UK AI Security Institute (opens in new tab), The New York Times (opens in new tab), and MIT Technology Review (opens in new tab) have highlighted gaps in how we evaluate AI models and systems. These gaps also form foundational context for recent international expert consensus reports: the inaugural International AI Safety Report (opens in new tab) (2025) and the Singapore Consensus (opens in new tab) (2025). Closing these gaps at a pace that matches AI innovation will lead to more reliable evaluations that can help guide deployment decisions, inform policy, and deepen trust. 

Today, we’re launching a limited-series podcast, AI Testing and Evaluation: Learnings from Science and Industry, to share insights from domains that have grappled with testing and measurement questions. Across four episodes, host Kathleen Sullivan speaks with academic experts in genome editing, cybersecurity, pharmaceuticals, and medical devices to find out which technical and regulatory steps have helped to close evaluation gaps and earn public trust.

We’re also sharing written case studies from experts, along with top-level lessons we’re applying to AI. At the close of the podcast series, we’ll offer Microsoft’s deeper reflections on next steps toward more reliable and trustworthy approaches to AI evaluation. 

Lessons from eight case studies 

Our research on risk evaluation, testing, and assurance models in other domains began in December 2024, when Microsoft’s Office of Responsible AI (opens in new tab) gathered independent experts from the fields of civil aviation, cybersecurity, financial services, genome editing, medical devices, nanoscience, nuclear energy, and pharmaceuticals. In bringing this group together, we drew on our own learnings and feedback received on our e-book, Global Governance: Goals and Lessons for AI (opens in new tab), in which we studied the higher-level goals and institutional approaches that had been leveraged for cross-border governance in the past. 

While approaches to risk evaluation and testing vary significantly across the case studies, there was one consistent, top-level takeaway: evaluation frameworks always reflect trade-offs among different policy objectives, such as safety, efficiency, and innovation.  

Experts across all eight fields noted that policymakers have had to weigh trade-offs in designing evaluation frameworks. These frameworks must account for both the limits of current science and the need for agility in the face of uncertainty. They likewise agreed that early design choices, often reflecting the “DNA” of the historical moment in which they’re made, as cybersecurity expert Stewart Baker described it, are important as they are difficult to scale down or undo later. 

Strict, pre-deployment testing regimes—such as those used in civil aviation, medical devices, nuclear energy, and pharmaceuticals—offer strong safety assurances but can be resource-intensive and slow to adapt. These regimes often emerged in response to well-documented failures and are backed by decades of regulatory infrastructure and detailed technical standards.  

In contrast, fields marked by dynamic and complex interdependencies between the tested system and its external environment—such as cybersecurity and bank stress testing—rely on more adaptive governance frameworks, where testing may be used to generate actionable insights about risk rather than primarily serve as a trigger for regulatory enforcement.  

Moreover, in pharmaceuticals, where interdependencies are at play and there is emphasis on pre-deployment testing, experts highlighted a potential trade-off with post-market monitoring of downstream risks and efficacy evaluation. 

These variations in approaches across domains—stemming from differences in risk profiles, types of technologies, maturity of the evaluation science, placement of expertise in the assessor ecosystem, and context in which technologies are deployed, among other factors—also inform takeaways for AI.

Applying risk evaluation and governance lessons to AI 

While no analogy perfectly fits the AI context, the genome editing and nanoscience cases offer interesting insights for general-purpose technologies like AI, where risks vary widely depending on how the technology is applied.  

Experts highlighted the benefits of governance frameworks that are more flexible and tailored to specific use cases and application contexts. In these fields, it is challenging to define risk thresholds and design evaluation frameworks in the abstract. Risks become more visible and assessable once the technology is applied to a particular use case and context-specific variables are known.  

These and other insights also helped us distill qualities essential to ensuring that testing is a reliable governance tool across domains, including: 

  1. Rigor in defining what is being examined and why it matters. This requires detailed specification of what is being measured and understanding how the deployment context may affect outcomes.
  2. Standardization of how tests should be conducted to achieve valid, reliable results. This requires establishing technical standards that provide methodological guidance and ensure quality and consistency. 
  3. Interpretability of test results and how they inform risk decisions. This requires establishing expectations for evidence and improving literacy in how to understand, contextualize, and use test results—while remaining aware of their limitations. 

Toward stronger foundations for AI testing 

Establishing robust foundations for AI evaluation and testing requires effort to improve rigor, standardization, and interpretability—and to ensure that methods keep pace with rapid technological progress and evolving scientific understanding.  

Taking lessons from other general-purpose technologies, this foundational work must also be pursued for both AI models and systems. While testing models will continue to be important, reliable evaluation tools that provide assurance for system performance will enable broad adoption of AI, including in high-risk scenarios. A strong feedback loop on evaluations of AI models and systems could not only accelerate progress on methodological challenges but also bring focus to which opportunities, capabilities, risks, and impacts are most appropriate and efficient to evaluate at what points along the AI development and deployment lifecycle.

Acknowledgements 

We would like to thank the following external experts who have contributed to our research program on lessons for AI testing and evaluation: Mateo Aboy, Paul Alp, Gerónimo Poletto Antonacci, Stewart Baker, Daniel Benamouzig, Pablo Cantero, Daniel Carpenter, Alta Charo, Jennifer Dionne, Andy Greenfield, Kathryn Judge, Ciaran Martin, and Timo Minssen.  

Case studies 

Civil aviation: Testing in Aircraft Design and Manufacturing, by Paul Alp 

Cybersecurity: Cybersecurity Standards and Testing—Lessons for AI Safety and Security, by Stewart Baker 

Financial services (bank stress testing): The Evolving Use of Bank Stress Tests, by Kathryn Judge 

Genome editing: Governance of Genome Editing in Human Therapeutics and Agricultural Applications, by Alta Charo and Andy Greenfield 

Medical devices: Medical Device Testing: Regulatory Requirements, Evolution and Lessons for AI Governance, by Mateo Aboy and Timo Minssen 

Nanoscience: The regulatory landscape of nanoscience and nanotechnology, and applications to future AI regulation, by Jennifer Dionne 

Nuclear energy: Testing in the Nuclear Industry, by Pablo Cantero and Gerónimo Poletto Antonacci 

Pharmaceuticals: The History and Evolution of Testing in Pharmaceutical Regulation, by Daniel Benamouzig and Daniel Carpenter

The post Learning from other domains to advance AI evaluation and testing appeared first on Microsoft Research.

Read More

Fault Tolerant Llama: training with 2000 synthetic failures every ~15 seconds and no checkpoints on Crusoe L40S

Fault Tolerant Llama: training with 2000 synthetic failures every ~15 seconds and no checkpoints on Crusoe L40S

Collaborators: Less Wright, Howard Huang, Chien-Chin Huang, Crusoe: Martin Cala, Ethan Petersen

tl;dr: we used torchft and torchtitan to train a model in a real-world environment with extreme synthetic failure rates to prove reliability and correctness of fault tolerant training

Training loss across 1200 failures with no checkpoints. 

NOTE: Each small spike is a non-participating worker recovering which affects the metrics but not the model

Introduction

We want to demonstrate torchft in worst case scenarios by running a training job with the most extreme failure rates possible.

Most LLM pre-training uses sharded models using FSDP. torchft supports sharded models using HSDP2, which combines a sharded model with the fault tolerant DDP all reduce from torchft. We’ve integrated torchft into torchtitan so you can use fault tolerance out of the box. torchft+titan also support other sharding/parallelisms within each replica group, such as tensor parallelism (TP), pipeline parallelism (PP) and more.

Here’s the structure of a training job with torchft:

 

The structure of the training job. torchft’s fault tolerant DDP implementation is used across the replica groups to synchronize the gradients. Standard FSDP2 and other parallelisms are used within each replica group.

torchft uses a global Lighthouse server and per replica group Managers to do the real time coordination of workers. The Lighthouse knows the state of all workers and which ones are healthy via heartbeats.

torchft implements a few different algorithms for fault tolerance. The two most primary ones are:

  • Fault Tolerant HSDP: An extension of FSDPv2 that uses a fault tolerant all-reduce. This exactly emulates standard HSDP training with per step all_reduce of the gradients and per step fault tolerance. This works best for large scale training with fast backend networks such as infiniband.
  • LocalSGD/DiLoCo: A fault tolerant implementation of semi-sync training. These algorithms minimize communication overhead by synchronizing at specified intervals instead of every step like HSDP. This is often used in communication limited training scenarios such as over ethernet/TCP or in geographically separate locations (federated learning or multidatacenter training).

We’re always keeping an eye out for new algorithms, such as our upcoming support for streaming DiLoCo. If you have a new use case you’d like to collaborate on, please reach out!

Cluster Setup

Crusoe graciously lent us a cluster of 300 L40S GPUs. The GPUs were split up across 30 hosts, each with 10 NVIDIA L40S GPUs.

For the model, we used torchtitan with a Llama 3 model with 1B parameters to match the hardware available. 

NVIDIA L40S GPUs are typically used for inference and thus gave us an opportunity to test torchft in a non-traditional environment where things such as DiLoCo really shine due to the lower TCP-only (no infiniband/nvlink) network bottleneck. The L40S has 48GB of VRAM (closer to consumer GPUs) so we used a smaller model and batch size. The average step time for training was ~9s each.

To maximize performance with the limited network, we trained the model in a 30x1x10 configuration. We had 30 replica groups (fault tolerant domains), each with 1 host and 10 gpus/workers. torchft can have many, many hosts in each replica group, but for this cluster, a single host/10 gpus per replica group had the best performance due to limited network bandwidth. We ran with 30 replica groups, as more groups stressed the coordination and reconfiguration algorithms more.

For network communication, we used NCCL for all communication (i.e., FSDP) within each replica group and Gloo for communication across replica groups. Gloo, while often not as performant, initializes much faster and can also fail much faster, which is important for quick detection of failures. torchft does support fault tolerance using NCCL for IB clusters with some caveats but wasn’t used in this demo. Since we wanted to maximize the total number of failures and recoveries, we used Gloo since it can reinitialize in <1s for our use case, and we were able to set the timeout on all operations at 5s.

For the fault tolerance algorithms, we did the bulk of the testing with Fault Tolerant HSDP, as it stresses the communication and quorum layers the most. For the final test, we used DiLoCo, which is a better fit for the ethernet based cluster.

Recovering with No Checkpoints

Traditional machine learning achieves “fault tolerance” by reloading from checkpoints when an error occurs. This involves a complete stop-the-world operation where all workers restart and load from the most recently persisted checkpoint.

With torchft, we instead focus on isolating failures to an individual group of GPUs. When an error occurs within that group we can restart that group asynchronously and all other groups can reconfigure and continue training without that group.

When that group recovers through a restart or the scheduler replaces the machines, those workers no longer have a valid copy of the weights and optimizer states. If we tried to recover from a checkpoint, the other groups would have already moved on. Instead, we rely on an asynchronous weight transfer at runtime. This does a peer-to-peer transfer of the weights from a healthy replica.

Since we’re always recovering from another worker – it turns out that we actually don’t need any checkpoints as long as we can guarantee that at least one group is healthy. For this demonstration, we turned off checkpointing entirely as a persistent checkpoint save and load is much longer than our P2P recovery time.

Here’s a diagram showing how a recovering replica (replica 1) can join the quorum and recover from a healthy peer (replica 0) without having any downtime or impacting the healthy worker training:

torchft adapts a number of concepts from distributed databases:

  • The quorum operation determines which workers are healthy using frequent heartbeats and guarantees that we can quickly determine which workers are alive, exchange metadata in a fault tolerant way, and enforce no split-brain conditions.
  • To ensure consistency and identify when we need to recover a worker, we effectively treat training with traditional database semantics. Traditional databases use “transactions” where each operation is either committed (entirely applied) or rolledback (discarded). torchft treats each training step the same way. Each training step within a replica group is handled as a distributed transaction, where we ensure all workers commit the step by stepping the optimizer or if an error occurs they all rollback by discarding the gradients.

For more details, please see the torchft README, which has links to the documentation, design docs, and presentations. 

Training Loop Integration

TorchFT has already been integrated with TorchTitan, and thus, enabling it is just a matter of setting a configuration flag. For a typical model, torchft provides wrappers which automatically call hooks into torchft’s Manager to provide fault tolerance.

from torchft import Manager, DistributedDataParallel, Optimizer, ProcessGroupGloo

# Instantiate your model and optimizer as normal
m = nn.Linear(2, 3)
optimizer = optim.AdamW(m.parameters())

# Setup torchft Manager and wrap the model and optimizer.
manager = Manager(
    pg=ProcessGroupGloo(),
    load_state_dict=lambda state_dict: m.load_state_dict(state_dict),
    state_dict=lambda: m.state_dict(),
)
m = DistributedDataParallel(manager, m)
optimizer = Optimizer(manager, optimizer)

for batch in dataloader:
    # When you call zero_grad, we start the asynchronous quorum operation 
    # and perform the async weights recovery if necessary.
    optimizer.zero_grad()

    out = m(batch)
    loss = out.sum()

    # The gradient allreduces will be done via torchft's fault tolerant 
    # ProcessGroupGloo wrapper.
    loss.backward()

    # The optimizer will conditionally step depending on if any errors occured. 
    # The batch will be discarded if the gradient sync was interrupted.
    optimizer.step()

Fault Tolerant Scheduling

We can use standard ML job schedulers such as Slurm since the semantics for the workers within a replica group are the same as a normal job. If an error occurs on any of the workers within a group we expect the entire group to restart simultaneously. Within each replica group, the application is a completely standard training job using standard non-fault tolerant operations. 

To achieve fault tolerance on a traditional scheduler, we run multiple of these jobs. Each replica group ran on Slurm as a separate training job with the Lighthouse and a monitoring script running on the head node. All the cross-group communication is done via torchft’s managed ProcessGroup and quorum APIs. To restart groups on failure and inject failures we used a small script using the torchx Python API.

The monitoring script looks something like this:

from torchx.runner import get_runner

NUM_REPLICA_GROUPS = 30

with get_runner() as runner:
    while True:
        jobs = runner.list(scheduler)
        active_replicas = {
            parse_replica_id(job.name)
            for job in jobs
            if not job.is_terminal()
        }

        missing_replicas = set(range(NUM_REPLICA_GROUPS)) - active_replicas

        for replica_id in missing_replicas:
            app_def = make_app_def(replica_id=replica_id)
            app_handle = runner.run(
                app_def, 
                scheduler="slurm", 
                cfg={"partition": "batch"},
            )
            print("launched:", replica_id, app_handle)

        time.sleep(5.0)

The failures were injected by cancelling the specific replica group’s Slurm job using scancel. In a real world scenario we would expect the failure to be triggered by an error in the training process which would crash that replica group in isolation rather than an external failure.

Metrics and Logs

To ensure we had a consistent view of the job, we avoided injecting failures into one replica group to make it simpler to track metrics and quorum events for the job. That one group was able to consistently log the number of participants, step success/failures, and the loss.

Since we’re doing per step fault tolerance, the number of participants and thus batch size changes per step depending on which workers are healthy.

The loss is averaged across all workers/replica groups in the job using an allreduce across replica groups.

Note: the small little spikes in the loss graphs below are due to how we average the loss across all hosts, including recovering workers, which have out of date weights, which leads to incorrectly higher loss on those steps.

Runs

We ran three different runs showcasing various failure scenarios and features of torchft.

Run 1: Injected Failure Every 60s for 1100 Failures

This run lasted a little over 19 hours and 6249 steps. On average, each step took 10.9 seconds.

For the initial run, we injected a failure every 60 seconds with a very repeatable pattern. We initially had a bad machine in the cluster, so we briefly shrunk the world size to 25 hosts until the machine was replaced, and we scaled the job back up with zero downtime.

With the failure every 60s we expected to be able to do ~5 steps between each failure without any issue. Looking at the results, we see that there were 6249 steps and 5145 successful commits. torchft is designed to be as safe as possible, and if any errors occurred, it will discard the step via “should_commit” prior to running the optimizer.

For the overall step efficiency, we have:

5145 successful steps / 6249 total steps = 82.3%

With a step time of ~11 seconds and a failure every 60 seconds we should be able to complete 5 out of every 6 steps (83.3%) and that matches almost exactly with the measured performance.

We averaged 29.6 participating replica groups per step, so the total training efficiency of this was 81.2%. Not bad for over 1000 failures.

Run 2: Injected Failure Every 15s for 1015 Failures

We wanted to see how much further we could push this and also make it even more challenging. For the second run, we ran with a failure injected between 0-30 seconds with a failure on average every 15 seconds. 

This failure rate is extreme compared to training jobs, which typically have mean time between failures in the 10s of minutes to hours range, but lets us validate that we can recover no matter when the error happens and lets us run a huge amount of test cycles to gain confidence in our implementation.

By randomizing the failure interval, we cause failures to happen while workers are still initializing rather than in steady state and are much more likely to hit edge cases. We’re happy to report that torchft behaved as expected with no unrecoverable errors.

As you can see, this job is behaving much more erratically. Rather than the very close to 30 machines we had with a 60 second failure rate, with a failure every 15 seconds we’re anywhere from 1 machine to 30 machines available on each step. 

On average, we had 18.9 (18.9/30 = 63%) workers healthy and participating on any given step and an average step time of 15.46 seconds.

Out of the first 888 steps, 268 of those steps were committed successfully, which gives us a 30.2% step efficiency.

This gives us training efficiency of just 13.4%, which in any normal training job would be terrible but it’s remarkable that the model is converging despite a crash every 15 seconds! Just loading a model from a checkpoint often takes longer than 1 minute.

The loss converges slower as compared to our 60s MTBF run, but that’s expected as many more batches are being discarded due to errors.

We do see some bigger spikes in the loss, which are correlated with times when only 1 participant is healthy and thus has 1/30th the batch size. This is easily avoided by adjusting the minimum number of replicas. We had it set to 1 for this test.

Run 3: Semi-synchronous Training

TorchFT also supports semi-synchronous training algorithms, including LocalSGD and DiLoCo, with plans to add more in the future. Unlike HSDP2, these algorithms do not synchronize at every step. Instead, they perform local training for several steps before synchronizing weights through averaging parameters or gradients. This approach enhances performance by reducing communication costs to once every N steps (a configurable hyperparameter), rather than at every step. Our tests on the cluster demonstrate a noticeable improvement in throughput. When synchronizing every 40 steps, we minimize the communication overhead, resulting in higher overall throughput. Below is a comparison of DiLoCo’s throughput (yellow), averaging around 4000 tps, compared with that of regular HSDP2 (purple), which averages around 1200 tps.

Naturally, the longer the interval between synchronizations, the more the models within replica groups will diverge. This divergence can potentially impact the convergence of the model. However, in our testing, we observed that the model was still able to train effectively and reach convergence despite these longer synchronization intervals. This resilience is beneficial in dynamic environments where replicas might leave the group unexpectedly. Even in such scenarios, the model demonstrated the ability to continue training without significant disruption.

Next Steps

torchft is under active development, and we have a lot of planned improvements around newer algorithms such as streaming DiLoCo, making PyTorch Distributed more robust to failures (even on infiniband/nvlink!), and even more efficient. 

If you’re interested in using torchft please take a look at torchft README and torchft Documentation. We’d also love to chat with you, so feel free to reach out directly via GitHub, LinkedIn, or Slack.

Read More

Scaling Laws for Unsupervised Finetuning of LLMs

A widespread strategy for obtaining a language model that performs well in a target domain is to fine-tune it by training it to do unsupervised next-token prediction on data from that domain.
Fine-tuning presents two challenges: i) if the amount of target data is limited, as is the case in most practical applications, the model will quickly overfit, and ii) the model will drift away from the original model and forget the pre-training distribution.
This paper quantifies these two phenomena for several target domains, available target data, and model scales.
We also measure the efficiency of…Apple Machine Learning Research

Build a scalable AI video generator using Amazon SageMaker AI and CogVideoX

Build a scalable AI video generator using Amazon SageMaker AI and CogVideoX

In recent years, the rapid advancement of artificial intelligence and machine learning (AI/ML) technologies has revolutionized various aspects of digital content creation. One particularly exciting development is the emergence of video generation capabilities, which offer unprecedented opportunities for companies across diverse industries. This technology allows for the creation of short video clips that can be seamlessly combined to produce longer, more complex videos. The potential applications of this innovation are vast and far-reaching, promising to transform how businesses communicate, market, and engage with their audiences. Video generation technology presents a myriad of use cases for companies looking to enhance their visual content strategies. For instance, ecommerce businesses can use this technology to create dynamic product demonstrations, showcasing items from multiple angles and in various contexts without the need for extensive physical photoshoots. In the realm of education and training, organizations can generate instructional videos tailored to specific learning objectives, quickly updating content as needed without re-filming entire sequences. Marketing teams can craft personalized video advertisements at scale, targeting different demographics with customized messaging and visuals. Furthermore, the entertainment industry stands to benefit greatly, with the ability to rapidly prototype scenes, visualize concepts, and even assist in the creation of animated content. The flexibility offered by combining these generated clips into longer videos opens up even more possibilities. Companies can create modular content that can be quickly rearranged and repurposed for different displays, audiences, or campaigns. This adaptability not only saves time and resources, but also allows for more agile and responsive content strategies. As we delve deeper into the potential of video generation technology, it becomes clear that its value extends far beyond mere convenience, offering a transformative tool that can drive innovation, efficiency, and engagement across the corporate landscape.

In this post, we explore how to implement a robust AWS-based solution for video generation that uses the CogVideoX model and Amazon SageMaker AI.

Solution overview

Our architecture delivers a highly scalable and secure video generation solution using AWS managed services. The data management layer implements three purpose-specific Amazon Simple Storage Service (Amazon S3) buckets—for input videos, processed outputs, and access logging—each configured with appropriate encryption and lifecycle policies to support data security throughout its lifecycle.

For compute resources, we use AWS Fargate for Amazon Elastic Container Service (Amazon ECS) to host the Streamlit web application, providing serverless container management with automatic scaling capabilities. Traffic is efficiently distributed through an Application Load Balancer. The AI processing pipeline uses SageMaker AI processing jobs to handle video generation tasks, decoupling intensive computation from the web interface for cost optimization and enhanced maintainability. User prompts are refined through Amazon Bedrock, which feeds into the CogVideoX-5b model for high-quality video generation, creating an end-to-end solution that balances performance, security, and cost-efficiency.

The following diagram illustrates the solution architecture.

Solution Architecture

CogVideoX model

CogVideoX is an open source, state-of-the-art text-to-video generation model capable of producing 10-second continuous videos at 16 frames per second with a resolution of 768×1360 pixels. The model effectively translates text prompts into coherent video narratives, addressing common limitations in previous video generation systems.

The model uses three key innovations:

  • A 3D Variational Autoencoder (VAE) that compresses videos along both spatial and temporal dimensions, improving compression efficiency and video quality
  • An expert transformer with adaptive LayerNorm that enhances text-to-video alignment through deeper fusion between modalities
  • Progressive training and multi-resolution frame pack techniques that enable the creation of longer, coherent videos with significant motion elements

CogVideoX also benefits from an effective text-to-video data processing pipeline with various preprocessing strategies and a specialized video captioning method, contributing to higher generation quality and better semantic alignment. The model’s weights are publicly available, making it accessible for implementation in various business applications, such as product demonstrations and marketing content. The following diagram shows the architecture of the model.

Model Architecture

Prompt enhancement

To improve the quality of video generation, the solution provides an option to enhance user-provided prompts. This is done by instructing a large language model (LLM), in this case Anthropic’s Claude, to take a user’s initial prompt and expand upon it with additional details, creating a more comprehensive description for video creation. The prompt consists of three parts:

  • Role section – Defines the AI’s purpose in enhancing prompts for video generation
  • Task section – Specifies the instructions needed to be performed with the original prompt
  • Prompt section – Where the user’s original input is inserted

By adding more descriptive elements to the original prompt, this system aims to provide richer, more detailed instructions to video generation models, potentially resulting in more accurate and visually appealing video outputs. We use the following prompt template for this solution:

"""
<Role>
Your role is to enhance the user prompt that is given to you by 
providing additional details to the prompt. The end goal is to
covert the user prompt into a short video clip, so it is necessary 
to provide as much information you can.
</Role>
<Task>
You must add details to the user prompt in order to enhance it for
 video generation. You must provide a 1 paragraph response. No 
more and no less. Only include the enhanced prompt in your response. 
Do not include anything else.
</Task>
<Prompt>
{prompt}
</Prompt>
"""

Prerequisites

Before you deploy the solution, make sure you have the following prerequisites:

  • The AWS CDK Toolkit – Install the AWS CDK Toolkit globally using npm:
    npm install -g aws-cdk
    This provides the core functionality for deploying infrastructure as code to AWS.
  • Docker Desktop – This is required for local development and testing. It makes sure container images can be built and tested locally before deployment.
  • The AWS CLI – The AWS Command Line Interface (AWS CLI) must be installed and configured with appropriate credentials. This requires an AWS account with necessary permissions. Configure the AWS CLI using aws configure with your access key and secret.
  • Python Environment – You must have Python 3.11+ installed on your system. We recommend using a virtual environment for isolation. This is required for both the AWS CDK infrastructure and Streamlit application.
  • Active AWS account – You will need to raise a service quota request for SageMaker to ml.g5.4xlarge for processing jobs.

Deploy the solution

This solution has been tested in the us-east-1 AWS Region. Complete the following steps to deploy:

  1. Create and activate a virtual environment:
python -m venv .
venv source .venv/bin/activate
  1. Install infrastructure dependencies:
cd infrastructure
pip install -r requirements.txt
  1. Bootstrap the AWS CDK (if not already done in your AWS account):
cdk bootstrap
  1. Deploy the infrastructure:
cdk deploy -c allowed_ips='["'$(curl -s ifconfig.me)'/32"]'

To access the Streamlit UI, choose the link for StreamlitURL in the AWS CDK output logs after deployment is successful. The following screenshot shows the Streamlit UI accessible through the URL.

User interface screenshot

Basic video generation

Complete the following steps to generate a video:

  1. Input your natural language prompt into the text box at the top of the page.
  2. Copy this prompt to the text box at the bottom.
  3. Choose Generate Video to create a video using this basic prompt.

The following is the output from the simple prompt “A bee on a flower.”

Enhanced video generation

For higher-quality results, complete the following steps:

  1. Enter your initial prompt in the top text box.
  2. Choose Enhance Prompt to send your prompt to Amazon Bedrock.
  3. Wait for Amazon Bedrock to expand your prompt into a more descriptive version.
  4. Review the enhanced prompt that appears in the lower text box.
  5. Edit the prompt further if desired.
  6. Choose Generate Video to initiate the processing job with CogVideoX.

When processing is complete, your video will appear on the page with a download option.The following is an example of an enhanced prompt and output:

"""
A vibrant yellow and black honeybee gracefully lands on a large, 
blooming sunflower in a lush garden on a warm summer day. The 
bee's fuzzy body and delicate wings are clearly visible as it 
moves methodically across the flower's golden petals, collecting 
pollen. Sunlight filters through the petals, creating a soft, 
warm glow around the scene. The bee's legs are coated in pollen 
as it works diligently, its antennae twitching occasionally. In 
the background, other colorful flowers sway gently in a light 
breeze, while the soft buzzing of nearby bees can be heard
"""

Add an image to your prompt

If you want to include an image with your text prompt, complete the following steps:

  1. Complete the text prompt and optional enhancement steps.
  2. Choose Include an Image.
  3. Upload the photo you want to use.
  4. With both text and image now prepared, choose Generate Video to start the processing job.

The following is an example of the previous enhanced prompt with an included image.

To view more samples, check out the CogVideoX gallery.

Clean up

To avoid incurring ongoing charges, clean up the resources you created as part of this post:

cdk destroy

Considerations

Although our current architecture serves as an effective proof of concept, several enhancements are recommended for a production environment. Considerations include implementing an API Gateway with AWS Lambda backed REST endpoints for improved interface and authentication, introducing a queue-based architecture using Amazon Simple Queue Service (Amazon SQS) for better job management and reliability, and enhancing error handling and monitoring capabilities.

Conclusion

Video generation technology has emerged as a transformative force in digital content creation, as demonstrated by our comprehensive AWS-based solution using the CogVideoX model. By combining powerful AWS services like Fargate, SageMaker, and Amazon Bedrock with an innovative prompt enhancement system, we’ve created a scalable and secure pipeline capable of producing high-quality video clips. The architecture’s ability to handle both text-to-video and image-to-video generation, coupled with its user-friendly Streamlit interface, makes it an invaluable tool for businesses across sectors—from ecommerce product demonstrations to personalized marketing campaigns. As showcased in our sample videos, the technology delivers impressive results that open new avenues for creative expression and efficient content production at scale. This solution represents not just a technological advancement, but a glimpse into the future of visual storytelling and digital communication.

To learn more about CogVideoX, refer to CogVideoX on Hugging Face. Try out the solution for yourself, and share your feedback in the comments.


About the Authors

Nick Biso is a Machine Learning Engineer at AWS Professional Services. He solves complex organizational and technical challenges using data science and engineering. In addition, he builds and deploys AI/ML models on the AWS Cloud. His passion extends to his proclivity for travel and diverse cultural experiences.

Natasha Tchir is a Cloud Consultant at the Generative AI Innovation Center, specializing in machine learning. With a strong background in ML, she now focuses on the development of generative AI proof-of-concept solutions, driving innovation and applied research within the GenAIIC.

Katherine Feng is a Cloud Consultant at AWS Professional Services within the Data and ML team. She has extensive experience building full-stack applications for AI/ML use cases and LLM-driven solutions.

Jinzhao Feng is a Machine Learning Engineer at AWS Professional Services. He focuses on architecting and implementing large-scale generative AI and classic ML pipeline solutions. He is specialized in FMOps, LLMOps, and distributed training.

Read More

Building trust in AI: The AWS approach to the EU AI Act

As AI adoption accelerates and reshapes our future, organizations are adapting to evolving regulatory frameworks. In our report commissioned to Strand Partners, Unlocking Europe’s AI Potential in the Digital Decade 2025, 68% of European businesses surveyed underlined that they struggle to understand their responsibilities under the EU AI Act. European businesses also highlighted that an estimated 40% of their IT spend goes towards compliance-related costs, and those uncertain about regulations plan to invest 28% less in AI over the next year. More clarity around regulation and compliance is critical to meet the competitiveness targets set out by the European Commission.

The EU AI Act

The European Union’s Artificial Intelligence Act (EU AI Act) establishes comprehensive regulations for the development, deployment, use, and provision of AI within the EU. It brings a risk-based regulatory framework with the overarching goal of protecting fundamental rights and safety. The EU AI Act entered into force on August 1, 2024, and will apply in phases, with most requirements becoming applicable over the next 14 months. The first group of obligations on prohibited AI practices and AI literacy became enforceable on February 1, 2025, with the remaining obligations to follow gradually.

AWS customers across industries use our AI services for a myriad of purposes, such as to provide better customer service, optimize their businesses, or create new experiences for their customers. We are actively evaluating how our services can best support customers to meet their compliance obligations, while maintaining AWS’s own compliance with the applicable provisions of the EU AI Act. As the European Commission continues to publish compliance guidance, such as the Guidelines of Prohibited AI Practices and the Guidelines on AI System Definition, we will continue to provide updates to our customers through our AWS Blog posts and other AWS channels.

The AWS approach to the EU AI Act

AWS has long been committed to AI solutions that are safe and respect fundamental rights. We take a people-centric approach that prioritizes education, science, and our customers’ needs to integrate responsible AI across the end-to-end AI lifecycle. As a leader in AI technology, AWS prioritizes trust in our AI offerings and supports the EU AI Act’s goal of promoting trustworthy AI products and services. We do this in several ways:

The EU AI Act requires all AI systems to meet certain requirements for fairness, transparency, accountability, and fundamental rights protection. Taking a risk-based approach, the EU AI Act establishes different categories of AI systems with corresponding requirements, and it brings obligations for all actors across the AI supply chain, including providers, deployers, distributors, users, and importers. AI systems deemed to pose unacceptable risks are prohibited. High-risk AI systems are allowed, but they are subject to stricter requirements for documentation, data governance, human oversight, and risk management procedures. In addition, certain AI systems (for example, those intended to interact directly with natural persons) are considered low risk and subject to transparency requirements. Apart from the requirements for AI systems, the EU AI Act also brings a separate set of obligations for providers of general-purpose AI (GPAI) models, depending on whether they pose systemic risks or not. The EU AI Act may apply to activities both inside and outside the EU. Therefore, even if your organization is not established in the EU, you may still be required to comply with the EU AI Act. We encourage all AWS customers to conduct a thorough assessment of their AI activities to determine whether they are subject to the EU AI Act and their specific obligations, regardless of their location.

Prohibited use cases

Beginning February 1, 2025, the EU AI Act has prohibited certain AI practices deemed to present unacceptable risks to fundamental rights. These prohibitions, a full list of which is available under Article 5 of the EU AI Act, generally focus on manipulative or exploitative practices that can be harmful or abusive and the evaluation or classification of individuals based on social behavior, personal traits, or biometric data.

AWS is committed to making sure our AI services meet applicable regulatory requirements, including those of the EU AI Act. Although AWS services support a wide range of customer use case categories, none are designed or intended for practices prohibited under the EU AI Act, and we maintain this commitment through our policies, including the AWS Acceptable Use Policy, Responsible AI Policy, and Responsible Use of AI Guide.

Compliance with the EU AI Act is a shared journey as set out by the regulation and responsibilities for developers (providers) and deployers of AI systems, and although AWS provides the building blocks for compliant solutions, AWS customers remain responsible for assessing how their use of AWS services falls under the EU AI Act, implementing appropriate controls for their AI applications, and making sure their specific use cases are compliant with the EU AI Act’s restrictions. We encourage AWS customers to carefully review the list of prohibited practices under the EU AI Act when building AI solutions using AWS services and review the European Commission’s recently published guidelines on prohibited practices.

Moving forward with the EU AI Act

As the regulatory landscape continues to evolve, customers should stay informed about the EU AI Act and assess how it applies to their organization’s use of AI. AWS remains engaged with EU institutions and relevant authorities across EU member states on the enforcement of the EU AI Act. We participate in industry dialogues and contribute our knowledge and experience to support balanced outcomes that safeguard against risks of this technology, particularly where AI use cases have the potential to affect individuals’ health and safety or fundamental rights, while enabling continued AI innovation in ways that will benefit all. We will continue to update our customers through our AWS ML Blog posts and other AWS channels as new guidance emerges and additional portions of the EU AI Act take effect.

If you have questions about compliance with the EU AI Act, or if you require additional information on AWS AI governance tools and resources, please contact your account representative or request to be contacted.

If you’d like to join our community of innovators and learn about upcoming events and gain expert insights, practical guidance, and connections that help you navigate the regulatory landscape, please express interest by registering.

Read More

Update on the AWS DeepRacer Student Portal

Update on the AWS DeepRacer Student Portal

The AWS DeepRacer Student Portal will no longer be available starting September 15, 2025. This change comes as part of the broader transition of AWS DeepRacer from a service to an AWS Solution, representing an evolution in how we deliver AI & ML education. Since its launch, the AWS DeepRacer Student Portal has helped thousands of learners begin their AI & ML journey through hands-on reinforcement learning experiences. The portal has served as a foundational stepping stone for many who have gone on to pursue career development in AI through the AWS AI & ML Scholars program, which has been re-launched with a generative AI focused curriculum.

Starting July 14, 2025, the AWS DeepRacer Student Portal will enter a maintenance phase where new registrations will be disabled. Until September 15, 2025, existing users will retain full access to their content and training materials, with updates limited to critical security fixes, after which the portal will no longer be available. Going forward, AWS DeepRacer will be available as a solution in the AWS Solutions Library in the future, providing educational institutions and organizations with greater capabilities to build and customize their own DeepRacer learning experiences.

As part of our commitment to advancing AI & ML education, we recently launched the enhanced AWS AI & ML Scholars program on May 28, 2025. This new program embraces the latest developments in generative AI, featuring hands-on experience with AWS PartyRock and Amazon Q. The curriculum focuses on practical applications of AI technologies and emerging skills, reflecting the evolving needs of the technology industry and preparing students for careers in AI. To learn more about the new AI & ML Scholars program and continue your learning journey, visit awsaimlscholars.com. In addition, users can also explore AI learning content and build in-demand cloud skills using AWS Skill Builder.

We’re grateful to the entire AWS DeepRacer Student community for their enthusiasm and engagement, and we look forward to supporting the next chapter of your AI & ML learning journey.


About the author

Jayadev Kalla is a Product Manager with the AWS Social Responsibility and Impact team, focusing on AI & ML education. His goal is to expand access to AI education through hands-on learning experiences. Outside of work, Jayadev is a sports enthusiast and loves to cook.

Read More

Accelerate foundation model training and inference with Amazon SageMaker HyperPod and Amazon SageMaker Studio

Accelerate foundation model training and inference with Amazon SageMaker HyperPod and Amazon SageMaker Studio

Modern generative AI model providers require unprecedented computational scale, with pre-training often involving thousands of accelerators running continuously for days, and sometimes months. Foundation Models (FMs) demand distributed training clusters — coordinated groups of accelerated compute instances, using frameworks like PyTorch — to parallelize workloads across hundreds of accelerators (like AWS Trainium and AWS Inferentia chips or NVIDIA GPUs).

Orchestrators like SLURM and Kubernetes manage these complex workloads, scheduling jobs across nodes, managing cluster resources, and processing requests. Paired with AWS infrastructure like Amazon Elastic Compute Cloud (Amazon EC2) accelerated computing instances, Elastic Fabric Adapter (EFA), and distributed file systems like Amazon Elastic File System (Amazon EFS) and Amazon FSx, these ultra clusters can run large-scale machine learning (ML) training and inference, handling parallelism, gradient synchronization and collective communications, and even routing and load balancing. However, at scale, even robust orchestrators face challenges around cluster resilience. Distributed training workloads specifically run synchronously, because each training step requires participating instances to complete their calculations before proceeding to the next step. This means that if a single instance fails, the entire job fails. The likelihood of these failures increases with the size of the cluster.

Although resilience and infrastructure reliability can be a challenge, developer experience remains equally pivotal. Traditional ML workflows create silos, where data and research scientists prototype on local Jupyter notebooks or Visual Studio Code instances, lacking access to cluster-scale storage, and engineers manage production jobs through separate SLURM or Kubernetes (kubectl or helm, for example) interfaces. This fragmentation has consequences, including mismatches between notebook and production environments, lack of local access to cluster storage, and most importantly, sub-optimal use of ultra clusters.

In this post, we explore these challenges. In particular, we propose a solution to enhance the data scientist experience on Amazon SageMaker HyperPod—a resilient ultra cluster solution.

Amazon SageMaker HyperPod

SageMaker HyperPod is a compute environment purpose built for large-scale frontier model training. You can build resilient clusters for ML workloads and develop state-of-the-art frontier models. SageMaker HyperPod runs health monitoring agents in the background for each instance. When it detects a hardware failure, SageMaker HyperPod automatically repairs or replaces the faulty instance and resumes training from the last saved checkpoint. This automation alleviates the need for manual intervention, which means you can train in distributed settings for weeks or months with minimal disruption.

To learn more about the resilience and Total Cost of Ownership (TCO) benefits of SageMaker HyperPod, check out Reduce ML training costs with Amazon SageMaker HyperPod. As of writing this post, SageMaker HyperPod supports both SLURM and Amazon Elastic Kubernetes Service (Amazon EKS) as orchestrators.

To deploy a SageMaker HyperPod cluster, refer to the SageMaker HyperPod workshops (SLURM, Amazon EKS). To learn more about what’s being deployed, check out the architecture diagrams later in this post. You can choose to use either of the two orchestrators based on your preference.

Amazon SageMaker Studio

Amazon SageMaker Studio is a fully integrated development environment (IDE) designed to streamline the end-to-end ML lifecycle. It provides a unified, web-based interface where data scientists and developers can perform ML tasks, including data preparation, model building, training, tuning, evaluation, deployment, and monitoring.

By centralizing these capabilities, SageMaker Studio alleviates the need to switch between multiple tools, significantly enhancing productivity and collaboration. SageMaker Studio supports a variety of IDEs, such as JupyterLab Notebooks, Code Editor based on Code-OSS, Visual Studio Code Open Source, and RStudio, offering flexibility for diverse development preferences. SageMaker Studio supports private and shared spaces, so teams can collaborate effectively while optimizing resource allocation. Shared spaces allow multiple users to access the same compute resources across profiles, and private spaces provide dedicated environments for individual users. This flexibility empowers data scientists and developers to seamlessly scale their compute resources and enhance collaboration within SageMaker Studio. Additionally, it integrates with advanced tooling like managed MLflow and Partner AI Apps to streamline experiment tracking and accelerate AI-driven innovation.

Distributed file systems: Amazon FSx

Amazon FSx for Lustre is a fully managed file storage service designed to provide high-performance, scalable, and cost-effective storage for compute-intensive workloads. Powered by the Lustre architecture, it’s optimized for applications requiring access to fast storage, such as ML, high-performance computing, video processing, financial modeling, and big data analytics.

FSx for Lustre delivers sub-millisecond latencies, scaling up to 1 GBps per TiB of throughput, and millions of IOPS. This makes it ideal for workloads demanding rapid data access and processing. The service integrates with Amazon Simple Storage Service (Amazon S3), enabling seamless access to S3 objects as files and facilitating fast data transfers between Amazon FSx and Amazon S3. Updates in S3 buckets are automatically reflected in FSx file systems and vice versa. For more information on this integration, check out Exporting files using HSM commands and Linking your file system to an Amazon S3 bucket.

Theory behind mounting an FSx for Lustre file system to SageMaker Studio spaces

You can use FSx for Lustre as a shared high-performance file system to connect SageMaker Studio domains with SageMaker HyperPod clusters, streamlining ML workflows for data scientists and researchers. By using FSx for Lustre as a shared volume, you can build and refine your training or fine-tuning code using IDEs like JupyterLab and Code Editor in SageMaker Studio, prepare datasets, and save your work directly in the FSx for Lustre volume.This same volume is mounted by SageMaker HyperPod during the execution of training workloads, enabling direct access to prepared data and code without the need for repetitive data transfers or custom image creation. Data scientists can iteratively make changes, prepare data, and submit training workloads directly from SageMaker Studio, providing consistency across development and execution environments while enhancing productivity. This integration alleviates the overhead of moving data between environments and provides a seamless workflow for large-scale ML projects requiring high throughput and low-latency storage. You can configure FSx for Lustre volumes to provide file system access to SageMaker Studio user profiles in two distinct ways, each tailored to different collaboration and data management needs.

Option 1: Shared file system partition across every user profile

Infrastructure administrators can set up a single FSx for Lustre file system partition shared across user profiles within a SageMaker Studio domain, as illustrated in the following diagram.

Figure 1: A FSx for Lustre file system partition shared across multiple user profiles within a single SageMaker Studio Domain

  • Shared project directories – Teams working on large-scale projects can collaborate seamlessly by accessing a shared partition. This makes it possible for multiple users to work on the same files, datasets, and FMs without duplicating resources.
  • Simplified file management – You don’t need to manage private storage; instead, you can rely on the shared directory for your file-related needs, reducing complexity.
  • Improved data governance and security – The shared FSx for Lustre partition is centrally managed by the infrastructure admin, enabling robust access controls and data policies to maintain security and integrity of shared resources.

Option 2: Shared file system partition across each user profile

Alternatively, administrators can configure dedicated FSx for Lustre file system partitions for each individual user profile in SageMaker Studio, as illustrated in the following diagram.

Figure 2: A FSx for Lustre file system with a dedicated partition per user

This setup provides personalized storage and facilitates data isolation. Key benefits include:

  • Individual data storage and analysis – Each user gets a private partition to store personal datasets, models, and files. This facilitates independent work on projects with clear segregation by user profile.
  • Centralized data management – Administrators retain centralized control over the FSx for Lustre file system, facilitating secure backups and direct access while maintaining data security for users.
  • Cross-instance file sharing – You can access your private files across multiple SageMaker Studio spaces and IDEs, because the FSx for Lustre partition provides persistent storage at the user profile level.

Solution overview

The following diagram illustrates the architecture of SageMaker HyperPod with SLURM integration.

Figure 3: Architecture Diagram for SageMaker HyperPod with Slurm as the orchestrator

The following diagram illustrates the architecture of SageMaker HyperPod with Amazon EKS integration.

Figure 4: Architecture Diagram for SageMaker HyperPod with EKS as the orchestrator

These diagrams illustrate what you would provision as part of this solution. In addition to the SageMaker HyperPod cluster you already have, you provision a SageMaker Studio domain, and attach the cluster’s FSx for Lustre file system to the SageMaker Studio domain. Depending on whether or not you choose a SharedFSx, you can either attach the file system to be mounted with a single partition shared across user profiles (that you configure) within your SageMaker domain, or attach it to be mounted with multiple partitions for multiple isolated users. To learn more about this distinction, refer to the section earlier in this post discussing the theory behind mounting an FSx for Lustre file system to SageMaker Studio spaces.

In the following sections, we present a walkthrough of this integration by demonstrating on a SageMaker HyperPod with Amazon EKS cluster how you can:

  1. Attach a SageMaker Studio domain.
  2. Use that domain to fine-tune the DeepSeek-R1-Distill-Qwen-14B using the FreedomIntelligence/medical-o1-reasoning-SFT dataset.

Prerequisites

This post assumes that you have a SageMaker HyperPod cluster.

Deploy resources using AWS CloudFormation

As part of this integration, we provide an AWS CloudFormation stack template (SLURM, Amazon EKS). Before deploying the stack, make sure you have a SageMaker HyperPod cluster set up.

In the stack for SageMaker HyperPod with SLURM, you create the following resources:

  • A SageMaker Studio domain.
  • Lifecycle configurations for installing necessary packages for the SageMaker Studio IDE, including SLURM. Lifecycle configurations will be created for both JupyterLab and Code Editor. We set it up so that your Code Editor or JupyterLab instance will essentially be configured as a login node for your SageMaker HyperPod cluster.
  • An AWS Lambda function that:
    • Associates the created security-group-for-inbound-nfs security group to the SageMaker Studio domain.
    • Associates the security-group-for-inbound-nfs security group to the FSx for Lustre ENIs.
    • Optional:
      • If SharedFSx is set to True, the created partition is shared in the FSx for Lustre volume and associated to the SageMaker Studio domain.
      • If SharedFSx is set to False, a Lambda function creates the partition /{user_profile_name} and associates it to the SageMaker Studio user profile.

In the stack for SageMaker HyperPod with Amazon EKS, you create the following resources:

  • A SageMaker Studio domain.
  • Lifecycle configurations for installing necessary packages for SageMaker Studio IDE, such as kubectl and jq. Lifecycle configurations will be created for both JupyterLab and Code Editor.
  • A Lambda function that:
    • Associates the created security-group-for-inbound-nfs security group to the SageMaker Studio domain.
    • Associates the security-group-for-inbound-nfs security group to the FSx for Lustre ENIs.
    • Optional:
      • If SharedFSx is set to True, the created partition is shared in the FSx for Lustre volume and associated to the SageMaker Studio domain.
      • If SharedFSx is set to False, a Lambda function creates the partition /{user_profile_name} and associates it to the SageMaker Studio user profile.

The main difference in the implementation of the two is in the lifecycle configurations for the JupyterLab or Code Editor servers running on the two implementations of SageMaker HyperPod—this is because of the difference in how you interact with the cluster using the different orchestrators (kubectl or helm for Amazon EKS, and ssm or ssh for SLURM). In addition to mounting your cluster’s FSx for Lustre file system, for SageMaker HyperPod with Amazon EKS, the lifecycle scripts configure your JupyterLab or Code Editor server to be able to run known Kubernetes-based command line interfaces, including kubectl, eksctl, and helm. Additionally, it preconfigures your context, so that your cluster is ready to use as soon as your JupyterLab or Code Editor instance is up.

You can find the lifecycle configuration for SageMaker HyperPod with Amazon EKS on the deployed CloudFormation stack template. SLURM works a bit differently. We designed the lifecycle configuration so that your JupyterLab or Code Editor instance would serve as a login node to your SageMaker HyperPod with SLURM cluster. Login nodes allow you to log in to the cluster, submit jobs, and view and manipulate data without running on the critical slurmctld scheduler node. This also makes it possible to run monitoring servers like aim, TensorBoard, or Grafana or Prometheus. Therefore, the lifecycle configuration here automatically installs SLURM and configures it so that you can interface with your cluster using your JupyterLab or Code Editor instance. You can find the script used to configure SLURM on these instances on GitHub.

Both these configurations use the same logic to mount the file systems. The instructions found in Adding a custom file system to a domain were achieved in a custom resource (Lambda function) defined in the CloudFormation stack template.

For more details on deploying these provided stacks, check out the respective workshop pages for SageMaker HyperPod with SLURM and SageMaker HyperPod with Amazon EKS.

Data science journey on SageMaker HyperPod with SageMaker Studio

As a data scientist, after you set up the SageMaker HyperPod and SageMaker Studio integration, you can log in to the SageMaker Studio environment through your user profile.

Figure 5: You can log in to your SageMaker Studio environment through your created user profile.

In SageMaker Studio, you can select your preferred IDE to start prototyping your fine-tuning workload, and create the MLFlow tracking server to track training and system metrics during the execution of the workload.

Figure 6: Select your preferred IDE to connect to your HyperPod cluster

The SageMaker HyperPod clusters page provides information about the available clusters and details on the nodes.

Figures 7,8: You can also see information about your SageMaker HyperPod cluster on SageMaker Studio

For this post, we selected Code Editor as our preferred IDE. The automation provided by this solution preconfigured the FSx for Lustre file system and the lifecycle configuration to install the necessary modules for submitting workloads on the cluster by using the hyperpod-cli or kubectl. For the instance type, you can choose a wide range of available instances. In our case, we opted for the default ml.t3.medium.

Figure 9: CodeEditor configuration

The development environment already presents the partition mounted as a file system, where you can start prototyping your code for data preparation of model fine-tuning. For the purpose of this example, we fine-tune DeepSeek-R1-Distill-Qwen-14B using the FreedomIntelligence/medical-o1-reasoning-SFT dataset.

Figure 10: Your cluster’s files are accessible directly on your CodeEditor space, as a result of your file system being mounted directly to your CodeEditor space! This means you can develop locally, and deploy onto your ultra-cluster.

The repository is organized as follows:

  • download_model.py – The script to download the open source model directly in the FSx for Lustre volume. This way, we provide a faster and consistent execution of the training workload on SageMaker HyperPod.
  • scripts/dataprep.py – The script to download and prepare the dataset for the fine-tuning workload. In the script, we format the dataset by using the prompt style defined for the DeepSeek R1 models and save the dataset in the FSx for Lustre volume. This way, we provide a faster execution of the training workload by avoiding asset copy from other data repositories.
  • scripts/train.py – The script containing the fine-tuning logic, using open source modules like Hugging Face transformers and optimization and distribution techniques using FSDP and QLoRA.
  • scripts/evaluation.py – The script to run ROUGE evaluation on the fine-tuned model.
  • pod-finetuning.yaml – The manifest file containing the definition of the container used to execute the fine-tuning workload on the SageMaker HyperPod cluster.
  • pod-evaluation.yaml – The manifest file containing the definition of the container used to execute the evaluation workload on the SageMaker HyperPod cluster.

After downloading the model and preparing the dataset for the fine-tuning, you can start prototyping the fine-tuning script directly in the IDE.

Figure 11: You can start developing locally!

The updates done in the script will be automatically reflected in the container for the execution of the workload. When you’re ready, you can define the manifest file for the execution of the workload on SageMaker HyperPod. In the following code, we highlight the key components of the manifest. For a complete example of a Kubernetes manifest file, refer to the awsome-distributed-training GitHub repository.

...

apiVersion: "kubeflow.org/v1"
kind: PyTorchJob
metadata:
  name: deepseek-r1-qwen-14b-fine-tuning
spec:
  ...
  pytorchReplicaSpecs:
    Worker:
      replicas: 8
      restartPolicy: OnFailure
      template:
        metadata:
          labels:
            app: deepseek-r1-distill-qwen-14b-fine-tuning
        spec:
          volumes:
            - name: shmem
              hostPath: 
                path: /dev/shm
            - name: local
              hostPath:
                path: /mnt/k8s-disks/0
            - name: fsx-volume
              persistentVolumeClaim:
                claimName: fsx-claim
          serviceAccountName: eks-hyperpod-sa
          containers:
            - name: pytorch
              image: 123456789012.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:2.6.0-gpu-py312-cu126-ubuntu22.04-ec2
              imagePullPolicy: Always
              resources:
                requests:
                  nvidia.com/gpu: 1
                  vpc.amazonaws.com/efa: 1
                limits:
                  nvidia.com/gpu: 1
                  vpc.amazonaws.com/efa: 1
              ...
              command:
                - /bin/bash
                - -c
                - |
                  pip install -r /data/Data-Scientist/deepseek-r1-distill-qwen-14b/requirements.txt && 
                  torchrun 
                  --nnodes=8 
                  --nproc_per_node=1 
                  /data/Data-Scientist/deepseek-r1-distill-qwen-14b/scripts/train.py 
                  --config /data/Data-Scientist/deepseek-r1-distill-qwen-14b/args-fine-tuning.yaml
              volumeMounts:
                - name: shmem
                  mountPath: /dev/shm
                - name: local
                  mountPath: /local
                - name: fsx-volume
                  mountPath: /data

The key components are as follows:

  • replicas: 8 – This specifies that eight worker pods will be created for this PyTorchJob. This is particularly important for distributed training because it determines the scale of your training job. Having eight replicas means your PyTorch training will be distributed across eight separate pods, allowing for parallel processing and faster training times.
  • Persistent volume configuration – This includes the following:
    • name: fsx-volume – Defines a named volume that will be used for storage.
    • persistentVolumeClaim – Indicates this is using Kubernetes’s persistent storage mechanism.
    • claimName: fsx-claim – References a pre-created PersistentVolumeClaim, pointing to an FSx for Lustre file system used in the SageMaker Studio environment.
  • Container image – This includes the following:
  • Training command – The highlighted command shows the execution instructions for the training workload:
    • pip install -r /data/Data-Scientist/deepseek-r1-distill-qwen-14b/requirements.txt – Installs dependencies at runtime, to customize the container with packages and modules required for the fine-tuning workload.
    • torchrun … /data/Data-Scientist/deepseek-r1-distill-qwen-14b/scripts/train.py – The actual training script, by pointing to the shared FSx for Lustre file system, in the partition created for the SageMaker Studio user profile Data-Scientist.
    • –config /data/Data-Scientist/deepseek-r1-distill-qwen-14b/args-fine-tuning.yaml – Arguments provided to the training script, which contains definition of the training parameters, and additional variables used during the execution of the workload.

The args-fine-tuning.yaml file contains the definition of the training parameters to provide to the script. In addition, the training script was defined to save training and system metrics on the managed MLflow server in SageMaker Studio, in case the Amazon Resource Name (ARN) and experiment name are provided:

# Location in the FSx for Lustre file system where the base model was saved
model_id: "/data/Data-Scientist/deepseek-r1-distill-qwen-14b/DeepSeek-R1-Distill-Qwen-14B"
mlflow_uri: "${MLFLOW_ARN}"
mlflow_experiment_name: "deepseek-r1-distill-llama-8b-agent"
# sagemaker specific parameters
# File system path where the workload will store the model 
output_dir: "/data/Data-Scientist/deepseek-r1-distill-qwen-14b/model/"
# File system path where the workload can access the dataset train dataset
train_dataset_path: "/data/Data-Scientist/deepseek-r1-distill-qwen-14b/data/train/"
# File system path where the workload can access the dataset test dataset
test_dataset_path: "/data/Data-Scientist/deepseek-r1-distill-qwen-14b/data/test/"
# training parameters
lora_r: 8
lora_alpha: 16
lora_dropout: 0.1                 
learning_rate: 2e-4                    # learning rate scheduler
num_train_epochs: 1                    # number of training epochs
per_device_train_batch_size: 2         # batch size per device during training
per_device_eval_batch_size: 2          # batch size for evaluation
gradient_accumulation_steps: 2         # number of steps before performing a backward/update pass
gradient_checkpointing: true           # use gradient checkpointing
bf16: true                             # use bfloat16 precision
tf32: false                            # use tf32 precision
fsdp: "full_shard auto_wrap offload"
fsdp_config: 
    backward_prefetch: "backward_pre"
    cpu_ram_efficient_loading: true
    offload_params: true
    forward_prefetch: false
    use_orig_params: true
merge_weights: true

The parameters model_id, output_dir, train_dataset_path, and test_dataset_path follow the same logic described for the manifest file and refer to the location where the FSx for Lustre volume is mounted in the container, under the partition Data-Scientist created for the SageMaker Studio user profile.

When you have finished the development of the fine-tuning script and defined the training parameters for the workload, you can deploy the workload with the following commands:

$ kubectl apply -f pod-finetuning.yaml
service/etcd unchanged
deployment.apps/etcd unchanged
pytorchjob.kubeflow.org/deepseek-r1-qwen-14b-fine-tuning created
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
deepseek-r1-qwen-14b-fine-tuning-worker-0 1/1 Running 0 2m7s
deepseek-r1-qwen-14b-fine-tuning-worker-1 1/1 Running 0 2m7s
deepseek-r1-qwen-14b-fine-tuning-worker-2 1/1 Running 0 2m7s
deepseek-r1-qwen-14b-fine-tuning-worker-3 1/1 Running 0 2m7s
deepseek-r1-qwen-14b-fine-tuning-worker-4 1/1 Running 0 2m7s
deepseek-r1-qwen-14b-fine-tuning-worker-5 1/1 Running 0 2m7s
deepseek-r1-qwen-14b-fine-tuning-worker-6 1/1 Running 0 2m7s
deepseek-r1-qwen-14b-fine-tuning-worker-7 1/1 Running 0 2m7s
...

You can explore the logs of the workload execution directly from the SageMaker Studio IDE.

Figure 12: View the logs of the submitted training run directly in your CodeEditor terminal

You can track training and system metrics from the managed MLflow server in SageMaker Studio.

Figure 13: SageMaker Studio directly integrates with a managed MLFlow server. You can use it to track training and system metrics directly from your Studio Domain

In the SageMaker HyperPod cluster sections, you can explore cluster metrics thanks to the integration of SageMaker Studio with SageMaker HyperPod observability.

Figure 14: You can view additional cluster level/infrastructure metrics in the “Compute” -> “SageMaker HyperPod clusters” section, including GPU utilization.

At the conclusion of the fine-tuning workload, you can use the same cluster to run batch evaluation workloads on the model by deploying the manifest pod-evaluation.yaml file to run an evaluation on the fine-tuned model by using ROUGE metrics (ROUGE-1, ROUGE-2, ROUGE-L, and ROUGE-L-Sum), which measure the similarity between machine-generated text and human-written reference text.

The evaluation script uses the same SageMaker HyperPod cluster and compares results with the previously downloaded base model.

Clean up

To clean up your resources to avoid incurring more charges, follow these steps:

  1. Delete unused SageMaker Studio resources.
  2. Optionally, delete the SageMaker Studio domain.
  3. If you created a SageMaker HyperPod cluster, delete the cluster to stop incurring costs.
  4. If you created the networking stack from the SageMaker HyperPod workshop, delete the stack as well to clean up the virtual private cloud (VPC) resources and the FSx for Lustre volume.

Conclusion

In this post, we discussed how SageMaker HyperPod and SageMaker Studio can improve and speed up the development experience of data scientists by using IDEs and tooling of SageMaker Studio and the scalability and resiliency of SageMaker HyperPod with Amazon EKS. The solution simplifies the setup for the system administrator of the centralized system by using the governance and security capabilities offered by the AWS services.

We recommend starting your journey by exploring the workshops Amazon EKS Support in Amazon SageMaker HyperPod and Amazon SageMaker HyperPod, and prototyping your customized large language model by using the resources available in the awsome-distributed-training GitHub repository.

A special thanks to our colleagues Nisha Nadkarni (Sr. WW Specialist SA GenAI), Anoop Saha (Sr. Specialist WW Foundation Models), and Mair Hasco (Sr. WW GenAI/ML Specialist) in the AWS ML Frameworks team, for their support in the publication of this post.


About the authors

Bruno Pistone is a Senior Generative AI and ML Specialist Solutions Architect for AWS based in Milan. He works with large customers helping them to deeply understand their technical needs and design AI and Machine Learning solutions that make the best use of the AWS Cloud and the Amazon Machine Learning stack. His expertise include: Machine Learning end to end, Machine Learning Industrialization, and Generative AI. He enjoys spending time with his friends and exploring new places, as well as travelling to new destinations

Aman Shanbhag is a Specialist Solutions Architect on the ML Frameworks team at Amazon Web Services (AWS), where he helps customers and partners with deploying ML training and inference solutions at scale. Before joining AWS, Aman graduated from Rice University with degrees in computer science, mathematics, and entrepreneurship.

Read More

Step Inside the Vault: The ‘Borderland’ Series Arrives on GeForce NOW

Step Inside the Vault: The ‘Borderland’ Series Arrives on GeForce NOW

GeForce NOW is throwing open the vault doors to welcome the legendary Borderland series to the cloud.

Whether a seasoned Vault Hunter or new to the mayhem of Pandora, prepare to experience the high-octane action and humor that define the series that includes Borderlands Game of the Year Enhanced, Borderlands 2, Borderlands 3 and Borderlands: The Pre-Sequel.

Members can explore it all before the highly anticipated Borderlands 4 arrives in the cloud at launch.

In addition, leap into the flames and save the day in the pulse-pounding FBC: Firebreak from Remedy Entertainment on GeForce NOW.

It’s all part of the 13 new games in the cloud this week, including the latest Genshin Impact update and advanced access for REMATCH.

Plus, GeForce NOW’s Summer Sale is still in full swing. For a limited time, get 40% off a six-month GeForce NOW Performance membership — perfect for diving into role-playing game favorites like the Borderlands series or any of the 2,200 titles in the platform’s cloud gaming library.

Vault Hunters Assemble

Gear up for a world where loot is king and chaos is always just a trigger pull away. The Borderlands series is known for its wild humor, outrageous characters and nonstop action — and now, its chaotic adventures can be streamed on GeForce NOW.

Borderlands GOTY on GeForce NOW
Welcome to Pandora.

Members revisiting the classics or jumping in for the first time can start with Borderlands Game of the Year Enhanced, the original mayhem-fueled classic now polished and packed with downloadable content. The title brings Pandora to life with a fresh coat of paint, crazy loot and the same iconic humor that started it all.

Borderlands 2 on GeForce NOW
New worlds, same chaos.

In Borderlands 2, Handsome Jack steals the show with his mix of charm and villainy. This sequel cranks up the fun and insanity with unforgettable characters and a zany storyline. For more laughs and even wilder chaos, Borderlands 3 delivers the biggest loot explosion yet, with new worlds to explore. Face off against the Calypso twins and enjoy nonstop action.

Borderlands 3 on GeForce NOW
The rise of Handsome Jack.

The adventure blasts off with Borderlands: The Pre-Sequel, revealing how Handsome Jack became so handsome. The game throws in zero gravity, moon boots and enough sarcasm to fuel a spaceship.

Jump in with GeForce NOW and get ready to laugh, loot and blast through Pandora, all from the cloud. With instant access and seamless streaming at up to 4K resolution with an Ultimate membership, enter the chaos of Borderlands anytime, anywhere. No downloads, no waiting.

Suit Up, Clean Up

FBC Firebreak on GeForce NOW
The Oldest House needs you.

Step into the shoes of the Federal Bureau of Control’s elite first responders in the highly anticipated three-player co-op first-person shooter FBC: Firebreak. Taking place six years after Control, the game is set in the Oldest House — under siege by reality-warping threats. It’s up to players to restore order before chaos wins.

Equip unique Crisis Kits packed with weapons, specialized tools and paranatural augments, like a garden gnome that summons a thunderstorm or a piggy bank that spews coins. As each mission, or “Job,” drops players into unpredictable environments with shifting objectives, bizarre crises and wacky enemies, teamwork and quick thinking are key.

Jump into the fray with friends and stream it on GeForce NOW instantly across devices. Experience the mind-bending action and stunning visuals powered by cloud streaming. Contain the chaos, save the Oldest House and enjoy a new kind of co-op adventure, all from the cloud.

No Rules Included

REMATCH on GeForce NOW
Score big laughs in the cloud.

REMATCH gives soccer a bold twist, transforming the classic sport into a fast-paced, third-person action experience where every player controls a single athlete on the field.

With no fouls, offsides or breaks, matches are nonstop and skills-based, demanding quick reflexes and seamless teamwork. Dynamic role-switching lets players jump between attack, defense and goalkeeping, while seasonal updates and various multiplayer modes keep the competition fresh and the action intense.

Where arcade flair meets tactical depth, REMATCH is football, unleashed. Get instant access to the soccer pitch by streaming the title on GeForce NOW and jump into the action wherever the match calls.

Time To Game

Genshin Impact V5.7 on GeForce NOW
Skirk has arrived.

Genshin Impact’s next major update launches this week, and members can stream the latest adventures from Teyvat at GeForce quality on any device. Version 5.7 includes the new playable characters Skirk and Dahlia — as well as fresh story quests and the launch of a Stygian Onslaught combat mode.

Look for the following games available to stream in the cloud this week:

  • REMATCH (New release on Steam, Xbox, available on PC Game Pass, June 16)
  • Broken Arrow (New release on Steam, June 19)
  • Crime Simulator (New release on Steam, June 17)
  • Date Everything! (New release on Steam, June 17)
  • FBC: Firebreak (New release on Steam, Xbox, available on PC Game Pass, June 17)
  • Lost in Random: The Eternal Die (New release on Steam, Xbox, available on PC Game Pass, June 17)
  • Architect Life: A House Design Simulator (New release on Steam, June 19)
  • Borderlands Game of the Year Enhanced (Steam)
  • Borderlands 2 (Steam, Epic Games Store)
  • Borderlands 3 (Steam, Epic Games Store)
  • Borderlands: The Pre-Sequel (Steam, Epic Games Store)
  • METAL EDEN Demo (Steam)
  • Torque Drift 2 (Epic Games Store)

What are you planning to play this weekend? Let us know on X or in the comments below.

Read More

PyTorch Docathon 2025: Wrap Up

PyTorch Docathon 2025: Wrap Up

Huge congratulations and a massive thank you to all the amazing participants of the PyTorch Docathon 2025!

Over the past two weeks (June 3rd-15th), our virtual Docathon brought together over 150+ registrants who actively contributed to resolving long-standing documentation issues. We’re thrilled to announce that your efforts resulted in more than 60+ merged pull requests across two PyTorch repositories!

We’d like to extend a special shout-out to our top contributors who went above and beyond during this event. Your dedication, expertise, and commitment to improving PyTorch documentation are truly inspiring. You’re the driving force behind open source projects like PyTorch, and we’re grateful for your contributions. 

First place: j-silv, kiszk, windsonsea

Second place: Rachel0619, jafraustro, loganthomas, nirajkamal, Dhia-naouali

Third place: Juliandlb, ggsmith842, ParagEkbote

PyTorch Docathon Top Community Contributors

Check out the full list of contributors here.

As we wrap up this Docathon, we encourage you to keep pushing the boundaries of what’s possible with PyTorch. Your collective efforts are revolutionizing the AI community, and we can’t wait to see what you achieve next.

Thank you again for being part of this incredible journey. Keep contributing, innovating, and inspiring others!

Team PyTorch

Read More

Meeting summarization and action item extraction with Amazon Nova

Meeting summarization and action item extraction with Amazon Nova

Meetings play a crucial role in decision-making, project coordination, and collaboration, and remote meetings are common across many organizations. However, capturing and structuring key takeaways from these conversations is often inefficient and inconsistent. Manually summarizing meetings or extracting action items requires significant effort and is prone to omissions or misinterpretations.

Large language models (LLMs) offer a more robust solution by transforming unstructured meeting transcripts into structured summaries and action items. This capability is especially useful for project management, customer support and sales calls, legal and compliance, and enterprise knowledge management.

In this post, we present a benchmark of different understanding models from the Amazon Nova family available on Amazon Bedrock, to provide insights on how you can choose the best model for a meeting summarization task.

LLMs to generate meeting insights

Modern LLMs are highly effective for summarization and action item extraction due to their ability to understand context, infer topic relationships, and generate structured outputs. In these use cases, prompt engineering provides a more efficient and scalable approach compared to traditional model fine-tuning or customization. Rather than modifying the underlying model architecture or training on large labeled datasets, prompt engineering uses carefully crafted input queries to guide the model’s behavior, directly influencing the output format and content. This method allows for rapid, domain-specific customization without the need for resource-intensive retraining processes. For tasks such as meeting summarization and action item extraction, prompt engineering enables precise control over the generated outputs, making sure they meet specific business requirements. It allows for the flexible adjustment of prompts to suit evolving use cases, making it an ideal solution for dynamic environments where model behaviors need to be quickly reoriented without the overhead of model fine-tuning.

Amazon Nova models and Amazon Bedrock

Amazon Nova models, unveiled at AWS re:Invent in December 2024, are built to deliver frontier intelligence at industry-leading price performance. They’re among the fastest and most cost-effective models in their respective intelligence tiers, and are optimized to power enterprise generative AI applications in a reliable, secure, and cost-effective manner.

The understanding model family has four tiers of models: Nova Micro (text-only, ultra-efficient for edge use), Nova Lite (multimodal, balanced for versatility), Nova Pro (multimodal, balance of speed and intelligence, ideal for most enterprise needs) and Nova Premier (multimodal, the most capable Nova model for complex tasks and teacher for model distillation). Amazon Nova models can be used for a variety of tasks, from summarization to structured text generation. With Amazon Bedrock Model Distillation, customers can also bring the intelligence of Nova Premier to a faster and more cost-effective model such as Nova Pro or Nova Lite for their use case or domain. This can be achieved through the Amazon Bedrock console and APIs such as the Converse API and Invoke API.

Solution overview

This post demonstrates how to use Amazon Nova understanding models, available through Amazon Bedrock, for automated insight extraction using prompt engineering. We focus on two key outputs:

  • Meeting summarization – A high-level abstractive summary that distills key discussion points, decisions made, and critical updates from the meeting transcript
  • Action items – A structured list of actionable tasks derived from the meeting conversation that apply to the entire team or project

The following diagram illustrates the solution workflow.

Meeting summary and action item summarization pipeline

Prerequisites

To follow along with this post, familiarity with calling LLMs using Amazon Bedrock is expected. For detailed steps on using Amazon Bedrock for text summarization tasks, refer to Build an AI text summarizer app with Amazon Bedrock. For additional information about calling LLMs, refer to the Invoke API and Using the Converse API reference documentation.

Solution components

We developed the two core features of the solution—meeting summarization and action item extraction—by using popular models available through Amazon Bedrock. In the following sections, we look at the prompts that were used for these key tasks.

For the meeting summarization task, we used a persona assignment, prompting the LLM to generate a summary in <summary> tags to reduce redundant opening and closing sentences, and a one-shot approach by giving the LLM one example to make sure the LLM consistently follows the right format for summary generation. As part of the system prompt, we give clear and concise rules emphasizing the correct tone, style, length, and faithfulness towards the provided transcript.

For the action item extraction task, we gave specific instructions on generating action items in the prompts and used chain-of-thought to improve the quality of the generated action items. In the assistant message, the prefix <action_items> tag is provided as a prefilling to nudge the model generation in the right direction and to avoid redundant opening and closing sentences.

Different model families respond to the same prompts differently, and it’s important to follow the prompting guide defined for the particular model. For more information on best practices for Amazon Nova prompting, refer to Prompting best practices for Amazon Nova understanding models.

Dataset

To evaluate the solution, we used the samples for the public QMSum dataset. The QMSum dataset is a benchmark for meeting summarization, featuring English language transcripts from academic, business, and governance discussions with manually annotated summaries. It evaluates LLMs on generating structured, coherent summaries from complex and multi-speaker conversations, making it a valuable resource for abstractive summarization and discourse understanding. For testing, we used 30 randomly sampled meetings from the QMSum dataset. Each meeting contained 2–5 topic-wise transcripts and contained approximately 8,600 tokens for each transcript in average.

Evaluation framework

Achieving high-quality outputs from LLMs in meeting summarization and action item extraction can be a challenging task. Traditional evaluation metrics such as ROUGE, BLEU, and METEOR focus on surface-level similarity between generated text and reference summaries, but they often fail to capture nuances such as factual correctness, coherence, and actionability. Human evaluation is the gold standard but is expensive, time-consuming, and not scalable. To address these challenges, you can use LLM-as-a-judge, where another LLM is used to systematically assess the quality of generated outputs based on well-defined criteria. This approach offers a scalable and cost-effective way to automate evaluation while maintaining high accuracy. In this example, we used Anthropic’s Claude 3.5 Sonnet v1 as the judge model because we found it to be most aligned with human judgment. We used the LLM judge to score the generated responses on three main metrics: faithfulness, summarization, and question answering (QA).

The faithfulness score measures the faithfulness of a generated summary by measuring the portion of the parsed statements in a summary that are supported by given context (for example, a meeting transcript) with respect to the total number of statements.

The summarization score is the combination of the QA score and the conciseness score with the same weight (0.5). The QA score measures the coverage of a generated summary from a meeting transcript. It first generates a list of question and answer pairs from a meeting transcript and measures the portion of the questions that are asked correctly when the summary is used as a context instead of a meeting transcript. The QA score is complimentary to the faithfulness score because the faithfulness score doesn’t measure the coverage of a generated summary. We only used the QA score to measure the quality of a generated summary because the action items aren’t supposed to cover all aspects of a meeting transcript. The conciseness score measures the ratio of the length of a generated summary divided by the length of the total meeting transcript.

We used a modified version of the faithfulness score and the summarization score that had much lower latency than the original implementation.

Results

Our evaluation of Amazon Nova models across meeting summarization and action item extraction tasks revealed clear performance-latency patterns. For summarization, Nova Premier achieved the highest faithfulness score (1.0) with a processing time of 5.34s, while Nova Pro delivered 0.94 faithfulness in 2.9s. The smaller Nova Lite and Nova Micro models provided faithfulness scores of 0.86 and 0.83 respectively, with faster processing times of 2.13s and 1.52s. In action item extraction, Nova Premier again led in faithfulness (0.83) with 4.94s processing time, followed by Nova Pro (0.8 faithfulness, 2.03s). Interestingly, Nova Micro (0.7 faithfulness, 1.43s) outperformed Nova Lite (0.63 faithfulness, 1.53s) in this particular task despite its smaller size. These measurements provide valuable insights into the performance-speed characteristics across the Amazon Nova model family for text-processing applications. The following graphs show these results. The following screenshot shows a sample output for our summarization task, including the LLM-generated meeting summary and a list of action items.

results on meeting summary

faithfulness score on action item summarization

example of meeting and action item summarization

Conclusion

In this post, we showed how you can use prompting to generate meeting insights such as meeting summaries and action items using Amazon Nova models available through Amazon Bedrock. For large-scale AI-driven meeting summarization, optimizing latency, cost, and accuracy is essential. The Amazon Nova family of understanding models (Nova Micro, Nova Lite, Nova Pro, and Nova Premier) offers a practical alternative to high-end models, significantly improving inference speed while reducing operational costs. These factors make Amazon Nova an attractive choice for enterprises handling large volumes of meeting data at scale.

For more information on Amazon Bedrock and the latest Amazon Nova models, refer to the Amazon Bedrock User Guide and Amazon Nova User Guide, respectively. The AWS Generative AI Innovation Center has a group of AWS science and strategy experts with comprehensive expertise spanning the generative AI journey, helping customers prioritize use cases, build a roadmap, and move solutions into production. Check out the Generative AI Innovation Center for our latest work and customer success stories.


About the Authors

Baishali Chaudhury is an Applied Scientist at the Generative AI Innovation Center at AWS, where she focuses on advancing Generative AI solutions for real-world applications. She has a strong background in computer vision, machine learning, and AI for healthcare. Baishali holds a PhD in Computer Science from University of South Florida and PostDoc from Moffitt Cancer Centre.

Sungmin Hong is a Senior Applied Scientist at Amazon Generative AI Innovation Center where he helps expedite the variety of use cases of AWS customers. Before joining Amazon, Sungmin was a postdoctoral research fellow at Harvard Medical School. He holds Ph.D. in Computer Science from New York University. Outside of work, he prides himself on keeping his indoor plants alive for 3+ years.

Mengdie (Flora) Wang is a Data Scientist at AWS Generative AI Innovation Center, where she works with customers to architect and implement scalable Generative AI solutions that address their unique business challenges. She specializes in model customization techniques and agent-based AI systems, helping organizations harness the full potential of generative AI technology. Prior to AWS, Flora earned her Master’s degree in Computer Science from the University of Minnesota, where she developed her expertise in machine learning and artificial intelligence.

Anila Joshi has more than a decade of experience building AI solutions. As a AWSI Geo Leader at AWS Generative AI Innovation Center, Anila pioneers innovative applications of AI that push the boundaries of possibility and accelerate the adoption of AWS services with customers by helping customers ideate, identify, and implement secure generative AI solutions.

Read More