March 2025 – Page 17

📣 Submit to Speak at PyTorch Conference + Save on Registration

Step into the Future of AI at PyTorch Conference 2025.

The Call for Proposals for PyTorch Conference 2025 is officially open!

Join us in San Francisco from October 22–23, 2025, to showcase your expertise and innovations with PyTorch—the industry-leading, open-source machine learning framework powering innovations from bare-metal infrastructure to sophisticated application and agent layers. This is your opportunity to share insights, breakthroughs, and case studies with a global audience of AI and Generative AI practitioners, researchers, and developers.

Submit your proposals and prepare to engage, learn, and network alongside some of the brightest minds in the AI/ML community. We’re seeking sessions, Birds of a Feather discussions, lightning talks, and poster sessions on the following topics:

Core PyTorch Framework
PyTorch on Accelerator Hardware
PyTorch Ecosystem and Tools
AI Applications and Use Cases
AI in Research and Academia
AI in Industry and Enterprise Applications
AI Infrastructure and Scalability
Ethical AI, Governance, and Regulation
Training, Fine-Tuning, and Alignment
Inference, Deployment, and Serving
Performance Measurement and Benchmarking
Data Engineering and Management for AI
Generative AI and Large Language Models (LLMs)
Model Optimization and Efficiency
Open Source Collaboration, Education and Community Building
Edge AI and On-Device
DL Compilers and Kernel Authoring

Learn more and submit your talk by Sunday, June 1, at 11:59 PDT!

SUBMIT TO SPEAK

Save up to USD$500 with Super Early Bird Pricing!

Reserve your pass by 11:59 PM PDT on March 21 and score Super Early Bird pricing for just USD$499. That’s a savings of up to USD$500!
Student or faculty? Learn more about our discounted academic rate.
Need help covering travel costs? We offer discretionary travel funding for those community members who would otherwise not be able to attend. Learn more.

Become a Sponsor at PyTorch Conference 2025!

Seize your opportunity to influence the future of Generative AI and Machine Learning by sponsoring PyTorch Conference 2025. PyTorch is at the forefront of innovation—empowering rapid experimentation, flexible model development, and efficient deployment into production environments with its powerful, versatile ecosystem of tools and thriving community of dedicated users.

As a sponsor, you’ll gain more than visibility; you’ll strategically position your organization at the heart of a vibrant, global AI/ML ecosystem. Connect directly with 3,000+ expert attendees, researchers, engineers, and decision-makers, and actively shape the conversations driving the next generation of AI advancements.

BECOME A SPONSOR

For more details on CFP submissions, registration, and sponsorship, visit the PyTorch Conference Website.

Customize DeepSeek-R1 distilled models using Amazon SageMaker HyperPod recipes – Part 1

Increasingly, organizations across industries are turning to generative AI foundation models (FMs) to enhance their applications. To achieve optimal performance for specific use cases, customers are adopting and adapting these FMs to their unique domain requirements. This need for customization has become even more pronounced with the emergence of new models, such as those released by DeepSeek.

However, customizing DeepSeek models effectively while managing computational resources remains a significant challenge. Tuning model architecture requires technical expertise, training and fine-tuning parameters, and managing distributed training infrastructure, among others. This often forces companies to choose between model performance and practical implementation constraints, creating a critical need for more accessible and streamlined model customization solutions.

In this two-part series, we discuss how you can reduce the DeepSeek model customization complexity by using the pre-built fine-tuning workflows (also called “recipes”) for both DeepSeek-R1 model and its distilled variations, released as part of Amazon SageMaker HyperPod recipes.

In this first post, we will build a solution architecture for fine-tuning DeepSeek-R1 distilled models and demonstrate the approach by providing a step-by-step example on customizing the DeepSeek-R1 Distill Qwen 7b model using recipes, achieving an average of 25% on all the Rouge scores, with a maximum of 49% on Rouge 2 score with both SageMaker HyperPod and SageMaker training jobs. The second part of the series will focus on fine-tuning the DeepSeek-R1 671b model itself.

At the time of this writing, the DeepSeek-R1 model and its distilled variations for Llama and Qwen were the latest released recipe. Check out sagemaker-hyperpod-recipes on GitHub for the latest released recipes, including support for fine-tuning the DeepSeek-R1 671b parameter model.

Amazon SageMaker HyperPod recipes

At re:Invent 2024, we announced the general availability of Amazon SageMaker HyperPod recipes. SageMaker HyperPod recipes help data scientists and developers of all skill sets to get started training and fine-tuning popular publicly available generative AI models in minutes with state-of-the-art training performance. These recipes include a training stack validated by Amazon Web Services (AWS), which removes the tedious work of experimenting with different model configurations, minimizing the time it takes for iterative evaluation and testing. They automate several critical steps, such as loading training datasets, applying distributed training techniques, automating checkpoints for faster recovery from faults, and managing the end-to-end training loop.

Recipes, paired with the resilient infrastructure of AWS, (Amazon SageMaker HyperPod and Amazon SageMaker Model Training) provide a resilient training environment for fine-tuning FMs such as DeepSeek-R1 with out-of-the-box customization.

To help customers quickly use DeepSeek’s powerful and cost-efficient models to accelerate generative AI innovation, we released new recipes to fine-tune six DeepSeek models, including DeepSeek-R1 distilled Llama and Qwen models using supervised fine-tuning (SFT), Quantized Low-Rank Adaptation (QLoRA), Low-Rank Adaptation (LoRA) techniques. In this post, we introduce these new recipes and walk you through a solution to fine-tune a DeepSeek Qwen 7b model for an advanced medical reasoning use case.

Solution overview

At its core, as depicted in the following diagram, the recipe architecture implements a hierarchical workflow that begins with a recipe specification that covers a comprehensive configuration defining the training parameters, model architecture, and distributed training strategies. These recipes are processed through the HyperPod recipe launcher, which serves as the orchestration layer responsible for launching a job on the corresponding architecture. The launcher interfaces with underlying cluster management systems such as SageMaker HyperPod (Slurm or Kubernetes) or training jobs, which handle resource allocation and scheduling. It’s a familiar NeMo-style launcher with which you can choose a recipe and run it on your infrastructure of choice (SageMaker HyperPod or training).

For example, after choosing your recipe, you can pre-train or fine-tune a model by running python3 main.py recipes=recipe-name. Alternatively, you can use a launcher script, which is a bash script that is preconfigured to run the chosen training or fine-tuning job on your cluster. You can check out main.py (NeMo style launcher) and launcher scripts for DeepSeek on the GitHub repository hosting SageMaker HyperPod recipes.

A key component of this architecture is the HyperPod training adapter for NeMo, which is built on the NVIDIA NeMo framework and Neuronx Distributed training package, which loads data, creates models, and facilitates efficient data parallelism, model parallelism, and hybrid parallelism strategies, which enables optimal utilization of computational resources across the distributed infrastructure. The architecture’s modular design allows for scalability and flexibility, making it particularly effective for training LLMs that require distributed computing capabilities.

You can run these recipes using SageMaker HyperPod or as SageMaker training jobs. For organizations that require granular control over training infrastructure and extensive customization options, SageMaker HyperPod is the ideal choice. SageMaker training jobs, on the other hand, is tailored for organizations that want a fully managed experience for their training workflows. To learn more details about these service features, refer to Generative AI foundation model training on Amazon SageMaker.

In the next sections, we go over the solution architecture for these services before presenting a step-by-step implementation example for each.

SageMaker HyperPod

To submit jobs using SageMaker HyperPod, you can use the HyperPod recipes launcher, which provides an straightforward mechanism to run recipes on both Slurm and Kubernetes. After you choose your orchestrator, you can choose your recipe’s launcher and have it run on your HyperPod cluster. The launcher will interface with your cluster with Slurm or Kubernetes native constructs. For this post, we use the HyperPod recipes launcher mechanism to run the training on a Slurm cluster. The following image shows the solution architecture for SageMaker HyperPod.

SageMaker training jobs

The workflow for SageMaker training jobs begins with an API request that interfaces with the SageMaker control plane, which manages the orchestration of training resources. The system uses the training jobs launcher to efficiently run workloads on a managed cluster.

The architecture uses Amazon Elastic Container Registry (Amazon ECR) for container image management. Training jobs are executed across a distributed cluster, with seamless integration to multiple storage solutions, including Amazon Simple Storage Service (Amazon S3), Amazon Elastic File Storage (Amazon EFS), and Amazon FSx for Lustre. All of this runs under the SageMaker managed environment, providing optimal resource utilization and security.

This design simplifies the complexity of distributed training while maintaining the flexibility needed for diverse machine learning (ML) workloads, making it an ideal solution for enterprise AI development. The following image shows the solution architecture for SageMaker training jobs.

Solution walkthrough

For this solution, consider a use case for a healthcare industry startup that aims to create an accurate, medically verified chat assistant application that bridges complex medical information with patient-friendly explanations. By fine-tuning DeepSeek-R1 Distill Qwen 7b using the FreedomIntelligence/medical-o1-reasoning-SFT dataset, you can use its medical reasoning capabilities to produce content that maintains clinical accuracy.

Prerequisites

You need to complete the following prerequisites before you can run the DeepSeek-R1 Distill Qwen 7B model fine-tuning notebook.

Make the following quota increase requests for SageMaker. You need to request a minimum of one p4d.24xlarge instance (with 8 x NVIDIA A100 GPUs) ranging to a maximum of two p4d.24xlarge instances (depending on time-to-train and cost-to-train trade-offs for your use case).

On the Service Quotas console, request the following SageMaker quotas:

- P4 instances (p4d.24xlarge) for training job usage: 1–2
- P4 instances (p4d.24xlarge) for HyperPod clusters (“ml.p4d.24xlarge for cluster usage“): 1-2

If you choose to use HyperPod clusters to run your training, set up a HyperPod Slurm cluster following the documentation at Tutuorial for getting started with SageMaker HyperPod. Alternatively, you can use the AWS CloudFormation template provided in the AWS Workshop Studio at Amazon SageMaker HyperPod Own Account and follow the instructions to set up a cluster and a development environment to access and submit jobs to the cluster.
(Optional) If you choose to use SageMaker training jobs, you can create an Amazon SageMaker Studio domain (refer to Use quick setup for Amazon SageMaker AI) to access Jupyter notebooks with the preceding role. (You can use JupyterLab in your local setup, too.)

- Create an AWS Identity and Access Management (IAM) role with managed policies AmazonSageMakerFullAccess and AmazonS3FullAccess to give required access to SageMaker to run the examples.

Clone the GitHub repository with the assets for this deployment. This repository consists of a notebook that references training assets:

git clone https://github.com/aws-samples/sagemaker-distributed-training-workshop.git 
cd 18_sagemaker_training_recipes/ft_deepseek_qwen_lora

Next, we run the model_trainer_deepseek_r1_recipe_lora.ipynb notebook to fine-tune the DeepSeek-R1 model using QLoRA on SageMaker.

Prepare the dataset

To prepare the dataset, you need to load the FreedomIntelligence/medical-o1-reasoning-SFT dataset, tokenize and chunk the dataset, and configure the data channels for SageMaker training on Amazon S3. Complete the following steps:

Format the dataset by applying the prompt format for DeepSeek-R1 Distill Qwen 7B:

def generate_prompt(data_point):
    full_prompt = f"""
    Below is an instruction that describes a task, paired with an input that provides further context.
    Write a response that appropriately completes the request.
    Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.

    ### Instruction:
    You are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning.
    Please answer the following medical question.

    ### Question:
    {data_point["Question"]}

    ### Response:
    {data_point["Complex_CoT"]}

    """
    return {"prompt": full_prompt.strip()}

Load the FreedomIntelligence/medical-o1-reasoning-SFT dataset and split it into training and validation datasets:

# Load dataset from the hub
train_set = load_dataset(dataset_name, 'en', split="train[5%:]")
test_set = load_dataset(dataset_name, 'en', split="train[:5%]")

...

train_dataset = train_set.map(
    generate_and_tokenize_prompt,
    remove_columns=columns_to_remove,
    batched=False
)

test_dataset = test_set.map(
    generate_and_tokenize_prompt,
    remove_columns=columns_to_remove,
    batched=False
)

Load the DeepSeek-R1 Distill Qwen 7B tokenizer from the Hugging Face Transformers library and generate tokens for the train and validation datasets:

model_id = "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B"
max_seq_length=1024

# Initialize a tokenizer by loading a pre-trained tokenizer configuration, using the fast tokenizer implementation if available.
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)

...

train_dataset = train_dataset.map(tokenize, remove_columns=["prompt"])
test_dataset = test_dataset.map(tokenize, remove_columns=["prompt"])

Prepare the training and validation datasets for SageMaker training by saving them as arrow files, which is required by SageMaker HyperPod recipes, and constructing the S3 paths where these files will be uploaded:

train_dataset_s3_path = f"s3://{bucket_name}/{input_path}/train"
val_dataset_s3_path = f"s3://{bucket_name}/{input_path}/test"

train_dataset.save_to_disk(train_dataset_s3_path)
val_dataset.save_to_disk(val_dataset_s3_path)

The dataset above will be used in the examples for both SageMaker training jobs and SageMaker HyerPod.

Option A: Fine-tune using SageMaker training jobs

To fine-tune the model using SageMaker training jobs with recipes, this example uses the ModelTrainer class.

The ModelTrainer class is a newer and more intuitive approach to model training that significantly enhances user experience and supports distributed training, Build Your Own Container (BYOC), and recipes. For additional information about ModelTrainer, you can refer to Accelerate your ML lifecycle using the new and improved Amazon SageMaker Python SDK – Part 1: ModelTrainer

To set up the fine-tuning workload, complete the following steps:

Select the instance type, the container image for the training job, and define the checkpoint path where the model will be stored:

instance_type = "ml.p4d.24xlarge"

image_uri = (
    f"658645717510.dkr.ecr.{sagemaker_session.boto_session.region_name}.amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121"
)

checkpoint_s3_path = f"s3://{bucket_name}/deepseek-r1-distilled-qwen-7b-recipe-lora/checkpoints"

Create the ModelTrainer function to encapsulate the training setup from a selected recipe:

from sagemaker.modules.configs import CheckpointConfig, Compute, InputData, SourceCode, StoppingCondition
from sagemaker.modules.distributed import Torchrun
from sagemaker.modules.train import ModelTrainer

instance_count = 1

# Working override for custom dataset
recipe_overrides = {
    ...
    "trainer": {
        "num_nodes": instance_count,
        ...
    },
    ...
    "use_smp_model": False, # Required for PEFT
    "model": {
        "hf_model_name_or_path": model_id,
        "data": {
            "train_dir": "/opt/ml/input/data/train",
            "val_dir": "/opt/ml/input/data/test",
        },
    },
}

# Define the compute
compute_configs = Compute(
    instance_type=instance_type,
    instance_count=instance_count,
    keep_alive_period_in_seconds=0
)

model_trainer = ModelTrainer.from_recipe(
    training_image=image_uri,
    training_recipe="fine-tuning/deepseek/hf_deepseek_r1_distilled_qwen_7b_seq8k_gpu_lora",
    recipe_overrides=recipe_overrides,
    requirements="./requirements.txt",
    compute=compute_configs,
    ...
    checkpoint_config=CheckpointConfig(
        s3_uri=f"{checkpoint_s3_path}/{job_prefix}"
    ),
)

You can point to the specific recipe with the training_recipe argument and override the recipe arguments by providing a dictionary as argument of recipe_overrides. In the previous example:

num_nodes: Indicates the number of instances that will be used for the fine-tuning execution
checkpoint_dir: Location in the container where the job will save model checkpoints

The ModelTrainer class simplifies the experience by encapsulating code and training setup directly from the selected recipe. In this example:

training_recipe: hf_deepseek_r1_distilled_qwen_7b_seq8k_gpu_lora is defining fine-tuning setup for the LoRA technique

Set up the input channels for ModelTrainer by creating an InputData objects from the provided S3 bucket paths for the training and test and validation datasets
Submit the training job:

# starting the train job with our uploaded datasets as input
model_trainer.train(input_data_config=data, wait=True)

Option B: Fine-tune using SageMaker HyperPod with Slurm

To fine-tune the model using HyperPod, make sure your cluster is up and ready by following the prerequisites. To access the login or head node of the HyperPod Slurm cluster from your development environment, follow the login instructions at Log in to your cluster in the Amazon SageMaker HyperPod workshop.

Alternatively, you can also use AWS Systems Manager and run a command like the following to start the session. You can find the cluster ID, instance group name, and instance ID on the Amazon SageMaker console.

aws ssm start-session --target sagemaker-cluster:[cluster-id]_[instance-group-name]-[instance-id] --region region_name

In the cluster’s login or head node, run the following commands to set up the environment. Run sudo su - ubuntu to run the remaining commands as the root user unless you have a specific user ID to access the cluster and your POSIX user is created through a lifecycle script on the cluster. Refer to the multi-user setup for more details.

# create a virtual environment 
python3 -m venv ${PWD}/venv
source venv/bin/activate

# clone the recipes repository and set up the environment
git clone --recursive https://github.com/aws/sagemaker-hyperpod-recipes.git
cd sagemaker-hyperpod-recipes
pip3 install -r requirements.txt

Create a squash file using Enroot to run the job on the cluster. Enroot runtime offers GPU acceleration, rootless container support, and seamless integration with high performance computing (HPC) environments, making it ideal for running our workflows securely.

# create a squash file using Enroot
REGION=<region>
IMAGE="658645717510.dkr.ecr.${REGION}.amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121"
aws ecr get-login-password --region "${REGION}" | docker login --username AWS --password-stdin 658645717510.dkr.ecr.${REGION}.amazonaws.com
enroot import -o $PWD/smdistributed-modelparallel.sqsh dockerd://${IMAGE}

After you’ve created the squash file, update the recipes_collection/config.yaml file with the absolute path to the squash file (created in the preceding step), and update the instance_type if needed. The final config file should have the following parameters:

...

cluster_type: slurm 
...

instance_type: p4d.24xlarge
...

container: /fsx/<path-to-smdistributed-modelparallel>.sqsh
...

Download the prepared dataset that you uploaded to S3 into the FSx for Lustre volume attached to the cluster. Run the following commands to download the files from Amazon S3:

aws s3 cp s3://{bucket_name}/{input_path}/train /fsx/ubuntu/deepseek/data/train --recursive
aws s3 cp s3://{bucket_name}/{input_path}/test /fsx/ubuntu/deepseek/data/test --recursive

Update the launcher script for fine-tuning the DeepSeek-R1 Distill Qwen 7B model. The launcher scripts serve as convenient wrappers for executing the training script main.py file), which streamlines the process of fine-tuning and parameter adjustment. For fine-tuning the DeepSeek-R1 Qwen 7B model, you can find the specific script at:

launcher_scripts/deepseek/run_hf_deepseek_r1_qwen_7b_seq16k_gpu_fine_tuning.sh

Before running the script, you need to modify the location of the training and validation files and update the HuggingFace model ID and optionally the access token for private models and datasets. The script should look like the following (update recipes.trainer.num_nodes if you’re using a multi-node cluster):

SAGEMAKER_TRAINING_LAUNCHER_DIR=${SAGEMAKER_TRAINING_LAUNCHER_DIR:-"$(pwd)"}

HF_MODEL_NAME_OR_PATH="deepseek-ai/DeepSeek-R1-Distill-Qwen-7B" # HuggingFace pretrained model name or path
HF_ACCESS_TOKEN="hf_xxxx" # Optional HuggingFace access token

TRAIN_DIR="/fsx/ubuntu/deepseek/data/train" # Location of training dataset 
VAL_DIR="/fsx/ubuntu/deepseek/data/test" # Location of validation dataset

EXP_DIR="/fsx/ubuntu/deepseek/results" # Location to save experiment info including logging, checkpoints, etc

HYDRA_FULL_ERROR=1 python3 "${SAGEMAKER_TRAINING_LAUNCHER_DIR}/main.py" 
    recipes=fine-tuning/deepseek/hf_deepseek_r1_distilled_qwen_7b_seq16k_gpu_fine_tuning 
    base_results_dir="${SAGEMAKER_TRAINING_LAUNCHER_DIR}/results" 
    recipes.run.name="hf-deepseek-r1-distilled-qwen-7b-fine-tuning" 
    recipes.exp_manager.exp_dir="$EXP_DIR" 
    recipes.trainer.num_nodes=1 
    recipes.model.data.train_dir="$TRAIN_DIR" 
    recipes.model.data.val_dir="$VAL_DIR" 
    recipes.model.hf_model_name_or_path="$HF_MODEL_NAME_OR_PATH" 
    recipes.model.hf_access_token="$HF_ACCESS_TOKEN"

You can view the recipe for this fine-tuning task under, overriding any additional parameters as needed:

recipes_collection/recipes/fine-tuning/deepseek/hf_deepseek_r1_distilled_qwen_7b_seq16k_gpu_fine_tuning.yaml

Submit the job by running the launcher script:

bash launcher_scripts/deepseek/run_hf_deepseek_r1_qwen_7b_seq16k_gpu_fine_tuning.sh

You can monitor the job using Slurm commands such as squeue and scontrol show to view the status of the job and the corresponding logs. After the job is complete, the trained model will also be available in the results folder, as shown in the following code:

cd results
 ls -R
.:
checkpoints  experiment

./checkpoints:
full

./checkpoints/full:
steps_50

./checkpoints/full/steps_50:
config.json  pytorch_model.bin

./experiment:
...

Upload the fine-tuned model checkpoint to Amazon S3 for evaluating the model using the validation data:

aws s3 cp /fsx/<path_to_checkpoint> s3://{bucket_name}/{model_prefix}/qwen7b --recursive

Evaluate the fine-tuned model

To objectively evaluate your fine-tuned model, you can run an evaluation job on the validation portion of the dataset.

You can run a SageMaker training job and use ROUGE metrics (ROUGE-1, ROUGE-2, ROUGE-L, and ROUGE-L-Sum), which measure the similarity between machine-generated text and human-written reference text. The SageMaker training job will compute ROUGE metrics for both the base DeepSeek-R1 Distill Qwen 7B model and the fine-tuned one. You can access the code sample for ROUGE evaluation in the sagemaker-distributed-training-workshop on GitHub. Please refer this notebook for details.

Complete the following steps:

Define the S3 path where the fine-tuned checkpoints are stored, the instance_type, and the image uri to use in the training job:

trained_model = <S3_PATH>
instance_type = "ml.p4d.24xlarge"

image_uri = sagemaker.image_uris.retrieve(
    framework="pytorch",
    region=sagemaker_session.boto_session.region_name,
    version="2.4",
    instance_type=instance_type,
    image_scope="training"
)
#763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:2.4-gpu-py311

Create the ModelTrainer function to encapsulate the evaluation script and define the input data:

from sagemaker.modules.configs import Compute, InputData, OutputDataConfig, SourceCode, StoppingCondition
from sagemaker.modules.distributed import Torchrun
from sagemaker.modules.train import ModelTrainer

# Define the script to be run
source_code = SourceCode(
    source_dir="./scripts",
    requirements="requirements.txt",
    entry_script="evaluate_recipe.py",
)

# Define the compute
...

# Define the ModelTrainer
model_trainer = ModelTrainer(
    training_image=image_uri,
    source_code=source_code,
    compute=compute_configs,
    ...
    hyperparameters={
        "model_id": model_id,  # Hugging Face model id
        "dataset_name": dataset_name
    }
)

# Pass the input data
train_input = InputData(
   channel_name="adapterdir",
   data_source=trained_model,
)

test_input = InputData(
   channel_name="testdata",
   data_source=test_dataset_s3_path, # S3 path where training data is stored
)

# Check input channels configured
data = [train_input, test_input]

Submit the training job:

# starting the train job with our uploaded datasets as input
model_trainer.train(input_data_config=data, wait=True)

The following table shows the task output for the fine-tuned model and the base model.

Model	Rouge 1	Rouge 2	Rouge L	Rouge L Sum
Base	0.36362	0.08739	0.16345	0.3204
Fine-tuned	0.44232	0.13022	0.17769	0.38989
% Difference	21.64207	49.01703	8.7121	21.68871

Our fine-tuned model demonstrates remarkable efficiency, achieving about 22% overall improvement on the reasoning task after only one training epoch. The most significant gain appears in Rouge 2 scores—which measure bigram overlap—with about 49% increase, indicating better alignment between generated and reference summaries.

Notably, preliminary experiments suggest these results could be further enhanced by extending the training duration. Increasing the number of epochs shows promising potential for additional performance gains while maintaining computational efficiency.

Clean up

To clean up your resources to avoid incurring any more charges, follow these steps:

Delete any unused SageMaker Studio resources
(Optional) Delete the SageMaker Studio domain
Verify that your training job isn’t running anymore. To do so, on your SageMaker console, choose Training and check Training jobs.
If you created a HyperPod cluster, delete the cluster to stop incurring costs. If you created the networking stack from the HyperPod workshop, delete the stack as well to clean up the virtual private cloud (VPC) resources and the FSx for Lustre volume.

Conclusion

In the first post of this two-part DeepSeek-R1 series, we discussed how SageMaker HyperPod recipes provide a powerful yet accessible solution for organizations to scale their AI model training capabilities with large language models (LLMs) including DeepSeek. The architecture streamlines complex distributed training workflows through its intuitive recipe-based approach, reducing setup time from weeks to minutes.

We recommend starting your LLM customization journey by exploring our sample recipes in the Amazon SageMaker HyperPod documentation. The AWS AI/ML community offers extensive resources, including workshops and technical guidance, to support your implementation journey.

To begin using the SageMaker HyperPod recipes, visit the sagemaker-hyperpod-recipes repo on GitHub for comprehensive documentation and example implementations. Our team continues to expand the recipe ecosystem based on customer feedback and emerging ML trends, making sure that you have the tools needed for successful AI model training.

In our second post, we discuss how these recipes could further be used to fine-tune DeepSeek-R1 671b model. Stay tuned!

About the Authors

Kanwaljit Khurmi is a Principal Worldwide Generative AI Solutions Architect at AWS. He collaborates with AWS product teams, engineering departments, and customers to provide guidance and technical assistance, helping them enhance the value of their hybrid machine learning solutions on AWS. Kanwaljit specializes in assisting customers with containerized applications and high-performance computing solutions.

Bruno Pistone is a Senior World Wide Generative AI/ML Specialist Solutions Architect at AWS based in Milan, Italy. He works with AWS product teams and large customers to help them fully understand their technical needs and design AI and Machine Learning solutions that take full advantage of the AWS cloud and Amazon Machine Learning stack. His expertise includes: End-to-end Machine Learning, model customization, and generative AI. He enjoys spending time with friends, exploring new places, and traveling to new destinations.

Arun Kumar Lokanatha is a Senior ML Solutions Architect with the Amazon SageMaker team. He specializes in large language model training workloads, helping customers build LLM workloads using SageMaker HyperPod, SageMaker training jobs, and SageMaker distributed training. Outside of work, he enjoys running, hiking, and cooking.

Durga Sury is a Senior Solutions Architect on the Amazon SageMaker team. Over the past 5 years, she has worked with multiple enterprise customers to set up a secure, scalable AI/ML platform built on SageMaker.

Aman Shanbhag is an Associate Specialist Solutions Architect on the ML Frameworks team at Amazon Web Services, where he helps customers and partners with deploying ML training and inference solutions at scale. Before joining AWS, Aman graduated from Rice University with degrees in computer science, mathematics, and entrepreneurship.

Anirudh Viswanathan is a Sr Product Manager, Technical – External Services with the SageMaker AI Training team. He holds a Masters in Robotics from Carnegie Mellon University, an MBA from the Wharton School of Business, and is named inventor on over 40 patents. He enjoys long-distance running, visiting art galleries, and Broadway shows.

The latest AI news we announced in February

Here are Google’s latest AI updates from February 2025Read More

Survey Shows How AI Is Reshaping Healthcare and Life Sciences, From Lab to Bedside

From research and discovery to patient care and administrative tasks, AI is showing transformative potential across nearly every part of healthcare and life sciences.

For example, generative AI can be used to help automate repetitive, time-consuming tasks such as summarizing and creating documents and extracting and analyzing data from reports. It can also aid in drug discovery by finding new protein structures and offer assistance to patients through chatbots and AI assistants, easing the burden on clinical and administrative staff.

This wide range of applications was among key insights of NVIDIA’s inaugural “State of AI in Healthcare and Life Sciences” survey.

The survey — which polled more than 600 professionals across the globe from fields spanning digital healthcare, medical tools and technologies, pharmaceutical and biotech, and payers and practitioners — revealed robust AI adoption in the industry, with about two-thirds of respondents saying their companies are actively using the technology.

AI is also having a tangible impact on the industry’s bottom line, with 81% of respondents saying AI has helped increase revenue, and 45% percent realizing these benefits in less than a year after implementation.

Here are some of the key insights and use cases from the survey:

83% of overall respondents agreed with the statement that “AI will revolutionize healthcare and life sciences in the next three to five years”
73% said AI is helping to reduce operational costs
58% cited data analytics as the top AI workload, with generative AI second at 54%, and large language models third at 53%
59% of respondents from pharmaceutical and biotech companies cited drug discovery and development among their top AI use cases

Business Impact of AI in Healthcare and Life Sciences

The healthcare and life sciences industry is seeing how AI can help increase annual revenue and reduce operational costs. Forty-one percent of respondents indicated that the acceleration of research and development has had a positive impact. Thirty-six percent of respondents said AI has helped create a competitive advantage. And 35% have said it’s helped reduce project cycles, deliver better clinical or research insights, and enhance precision and accuracy, respectively.

Given the positive results across a broad range of AI use cases, it comes as no surprise that 78% of respondents said they intend to increase their budget for AI infrastructure this year. In addition, more than a third of respondents noted their investments in AI will increase by more than 10%.

The survey also revealed the top three spending priorities: identifying additional AI use cases (47%), optimizing workflow and production cycles (34%) and hiring more AI experts (26%).

AI Applied Across Healthcare

Each industry segment in the survey had differing priorities in AI implementation. For instance, in the payers and providers industry segment, which includes health insurance companies, hospitals, clinical services and home healthcare, 48% of respondents said their top AI use case was administrative tasks and workflow optimization.

For the medical tools and technologies field, 71% of respondents said their top AI use case was medical imaging and diagnostics, such as using AI to analyze MRI or CAT scans. And for digital healthcare, 54% of respondents said their top use case was clinical decision support, while 54% from the pharmaceutical and biotech fields prioritized drug discovery and development.

AI use cases expected to have the most significant impact in healthcare and life sciences in the next five years include advanced medical imaging and diagnostics (51%), virtual healthcare assistants (34%) and precision medicine — treatment tailored to individual patient characteristics — (29%).

A Growing Dose of Generative AI

Overall, 54% of survey respondents said they’re using generative AI. Of these users, 63% said they’re actively using it, with another 36% assessing the technology through pilots or trials.

Digital healthcare was the leader in generative AI use, according to 71% of respondents from the field. Second was pharmaceutical and biotech at 69%, then medical technologies at 60%, and payers and providers at 44%.

Among all generative AI use cases, coding and document summarization — specific to clinical notes — was the top use case, at 55%. Medical chatbots and AI agents were second, at 53%, and literature analysis was third, at 45%. One notable exception was within the pharmaceutical biotech industry segment, in which respondents stated that drug discovery was the top generative AI use case, at 62%.

Download the “State of AI in Healthcare and Life Sciences: 2025 Trends” report for in-depth results and insights.

Explore NVIDIA’s AI technologies and platforms for healthcare, and sign up for NVIDIA’s healthcare newsletter to stay up to date.

Reduce conversational AI response time through inference at the edge with AWS Local Zones

Recent advances in generative AI have led to the proliferation of new generation of conversational AI assistants powered by foundation models (FMs). These latency-sensitive applications enable real-time text and voice interactions, responding naturally to human conversations. Their applications span a variety of sectors, including customer service, healthcare, education, personal and business productivity, and many others.

Conversational AI assistants are typically deployed directly on users’ devices, such as smartphones, tablets, or desktop computers, enabling quick, local processing of voice or text input. However, the FM that powers the assistant’s natural language understanding and response generation is usually cloud-hosted, running on powerful GPUs. When a user interacts with the AI assistant, their device first processes the input locally, including speech-to-text (STT) conversion for voice agents, and compiles a prompt. This prompt is then securely transmitted to the cloud-based FM over the network. The FM analyzes the prompt and begins generating an appropriate response, streaming it back to the user’s device. The device further processes this response, including text-to-speech (TTS) conversion for voice agents, before presenting it to the user. This efficient workflow strikes a balance between the powerful capabilities of cloud-based FMs and the convenience and responsiveness of local device interaction, as illustrated in the following figure.

A critical challenge in developing such applications is reducing response latency to enable real-time, natural interactions. Response latency refers to the time between the user finishing their speech and beginning to hear the AI assistant’s response. This delay typically comprises two primary components:

On-device processing latency – This encompasses the time required for local processing, including TTS and STT operations.
Time to first token (TTFT) – This measures the interval between the device sending a prompt to the cloud and receiving the first token of the response. TTFT consists of two components. First is the network latency, which is the round-trip time for data transmission between the device and the cloud. Second is the first token generation time, which is the period between the FM receiving a complete prompt and generating the first output token. TTFT is crucial for user experience in conversational AI interfaces that use response streaming with FMs. With response streaming, users start receiving the response while it’s still being generated, significantly improving perceived latency.

The ideal response latency for humanlike conversation flow is generally considered to be in the 200–500 milliseconds (ms) range, closely mimicking natural pauses in human conversation. Given the additional on-device processing latency, achieving this target requires a TTFT well below 200 ms.

Although many customers focus on optimizing the technology stack behind the FM inference endpoint through techniques such as model optimization, hardware acceleration, and semantic caching to reduce the TTFT, they often overlook the significant impact of network latency. This latency can vary considerably due to geographic distance between users and cloud services, as well as the diverse quality of internet connectivity.

Hybrid architecture with AWS Local Zones

To minimize the impact of network latency on TTFT for users regardless of their locations, a hybrid architecture can be implemented by extending AWS services from commercial Regions to edge locations closer to end users. This approach involves deploying additional inference endpoints on AWS edge services and using Amazon Route 53 to implement dynamic routing policies, such as geolocation routing, geoproximity routing, or latency-based routing. These strategies dynamically distribute traffic between edge locations and commercial Regions, providing fast response times based on real-time network conditions and user locations.

AWS Local Zones are a type of edge infrastructure deployment that places select AWS services close to large population and industry centers. They enable applications requiring very low latency or local data processing using familiar APIs and tool sets. Each Local Zone is a logical extension of a corresponding parent AWS Region, which means customers can extend their Amazon Virtual Private Cloud (Amazon VPC) connections by creating a new subnet with a Local Zone assignment.

This guide demonstrates how to deploy an open source FM from Hugging Face on Amazon Elastic Compute Cloud (Amazon EC2) instances across three locations: a commercial AWS Region and two AWS Local Zones. Through comparative benchmarking tests, we illustrate how deploying FMs in Local Zones closer to end users can significantly reduce latency—a critical factor for real-time applications such as conversational AI assistants.

Prerequisites

To run this demo, complete the following prerequisites:

Create an AWS account, if you don’t already have one.
Enable the Local Zones in Los Angeles and Honolulu in the parent Region US West (Oregon). For a full list of available Local Zones, refer to the Local Zones locations page. Next, create a subnet inside each Local Zone. Detailed instructions for enabling Local Zones and creating subnets within them can be found at Getting started with AWS Local Zones.
Submit an Amazon EC2 service quota increase for access to Amazon EC2 G4dn instances. Select the Running On-Demand G and VT instances as the quota type and at least 24 vCPUs for the quota size.
Create a Hugging Face read token from huggingface.co/settings/tokens.

Solution walkthrough

This section walks you through the steps to launch an Amazon EC2 G4dn instance and deploy an FM for inference in the Los Angeles Local Zone. The instructions are also applicable for deployments in the parent Region, US West (Oregon), and the Honolulu Local Zone.

We use Meta’s open source Llama 3.2-3B as the FM for this demonstration. This is a lightweight FM from the Llama 3.2 family, classified as a small language model (SLM) due to its small number of parameters. Compared to large language models (LLMs), SLMs are more efficient and cost-effective to train and deploy, excel when fine-tuned for specific tasks, offer faster inference times, and have lower resource requirements. These characteristics make SLMs particularly well-suited for deployment on edge services such as AWS Local Zones.

To launch an EC2 instance in the Los Angeles Local Zone subnet, follow these steps:

On the Amazon EC2 console dashboard, in the Launch instance box, choose Launch instance.
Under Name and tags, enter a descriptive name for the instance (for example, la-local-zone-instance).
Under Application and OS Images (Amazon Machine Image), select an AWS Deep Learning AMI that comes preconfigured with NVIDIA OSS driver and PyTorch. For our deployment, we used Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.3.1 (Amazon Linux 2).
Under Instance type, from the Instance type list, select the hardware configuration for your instance that’s supported in a Local Zone. We selected G4dn.2xlarge for this solution. This instance is equipped with one NVIDIA T4 Tensor Core GPU and 16 GB of GPU memory, which makes it ideal for high performance and cost-effective inference of SLMs on the edge. Available instance types for each Local Zone can be found at AWS Local Zones features. Review the hardware requirements for your FM to select the appropriate instance.
Under Key pair (login), choose an existing key pair or create a new one.
Next to Network settings, choose Edit, and then:
1. Select your VPC.
2. Select your Local Zone subnet.
3. Create a security group or select an existing one. Configure the security group’s inbound rules to allow traffic only from your client’s IP address on port 8080.
You can keep the default selections for the other configuration settings for your instance. To determine the storage types that are supported, refer to the Compute and storage section in AWS Local Zones features.
Review the summary of your instance configuration in the Summary panel and, when you’re ready, choose Launch instance.
A confirmation page lets you know that your instance is launching. Choose View all instances to close the confirmation page and return to the console.

Next, complete the following steps to deploy Llama 3.2-3B using the Hugging Face Text Generation Inference (TGI) as the model server:

Connect by using Secure Shell (SSH) into the instance
Start the docker service using the following command. This comes preinstalled with the AMI we selected.

sudo service docker start

Run the following command to download and run the Docker image for TGI server as well as Llama 3.2-3B model. In our deployment, we used Docker image version 2.4.0, but results might vary based on your selected version. The full list of supported models by TGI can be found at Hugging Face Supported Models. For more details about the deployment and optimization of TGI, refer to this text-generation-inference GitHub page.

model=meta-llama/Llama-3.2-3B
volume=$PWD/data
token=<ENTER YOUR HUGGING FACE TOKEN>

sudo docker run -d --gpus all 
    --shm-size 1g 
    -e HF_TOKEN=$token 
    -p 8080:80 
    -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.4.0 
    --model-id $model

After the TGI container is running, you can test your endpoint by running the following command from your local environment:

curl <REPLACE WITH YOUR EC2 PUBLIC IP >:8080/generate -X POST 
    -d '{"inputs":"What is deep learning?","parameters":{"max_new_tokens":200, "temperature":0.2, "top_p":0.9}}' 
    -H 'Content-Type: application/json'

Performance evaluation

To demonstrate TTFT improvements with FM inference on Local Zones, we followed the steps in the previous section to deploy Llama 3.2 3B in three locations: in the us-west-2-c Availability Zone in the parent Region, US West (Oregon); in the us-west-2-lax-1a Local Zone in Los Angeles; and in the us-west-2-hnl-1a Local Zone in Honolulu. This is illustrated in the following figure. Notice that the architecture provided in this post is meant to be used for performance evaluation in a development environment. Before migrating any of the provided architecture to production, we recommend following the AWS Well-Architected Framework.

We conducted two separate test scenarios to evaluate TTFT as explained in the following:

Los Angeles test scenario:

Test user’s location – Los Angeles metropolitan area
Test A – 150 requests sent to FM deployed in Los Angeles Local Zone
Test B – 150 requests sent to FM deployed in US West (Oregon)

Honolulu test scenario:

Test user’s location – Honolulu metropolitan area
Test C – 150 requests sent to FM deployed in Honolulu Local Zone
Test D – 150 requests sent to FM deployed in US West (Oregon)

Architecture diagram for the deployment of FM inference endpoints

Evaluation setup

To conduct TTFT measurements, we use the load testing capabilities of the open source project LLMPerf. This tool launches multiple requests from the test user’s client to the FM endpoint and measures various performance metrics, including TTFT. Each request contains a random prompt with a mean token count of 250 tokens. Although a single prompt for short-form conversations typically consists of 50 tokens, we set the mean input token size to 250 tokens to account for multi-turn conversation history, system prompts, and contextual information that better represents real-world usage patterns.

Detailed instructions for installing LLMPerf and executing the load testing are available in the project’s documentation. Additionally, because we are using the Hugging Face TGI as the inference server, we follow the corresponding instructions from LLMPerf to perform the load testing. The following is the example command to initiate the load testing from the command line:

export HUGGINGFACE_API_BASE="http://<REPLACE WITH YOUR EC2 PUBLIC IP>:8080" 
export HUGGINGFACE_API_KEY="" 

python token_benchmark_ray.py 
    --model "huggingface/meta-llama/Llama-3.2-3B" 
    --mean-input-tokens 250 
    --stddev-input-tokens 50 
    --mean-output-tokens 100 
    --stddev-output-tokens 20 
    --max-num-completed-requests 150
    --timeout 600 
    --num-concurrent-requests 1 
    --results-dir "result_outputs" 
    --llm-api "litellm" 
    --additional-sampling-params '{}'

Each test scenario compares the TTFT latency between Local Zone and the parent Region endpoints to assess the impact of geographical distance. Latency results might vary based on several factors, including:

Test parameters and configuration
Time of day and network traffic
Internet service provider
Specific client location within the test Region
Current server load

Results

The following tables below present TTFT measurements in milliseconds (ms) for two distinct test scenarios. The results demonstrate significant TTFT reductions when using a Local Zone compared to the parent Region for both the Los Angeles and the Honolulu test scenarios. The observed differences in TTFT are solely attributed to network latency because identical FM inference configurations were employed in both the Local Zone and the parent Region.

User location: Los Angeles Metropolitan Area
LLM inference endpoint	Mean (ms)	Min (ms)	P25 (ms)	P50 (ms)	P75 (ms)	P95 (ms)	P99 (ms)	Max (ms)
Parent Region: US West (Oregon)	135	118	125	130	139	165	197	288
Local Zone: Los Angeles	80	50	72	75	86	116	141	232

The user in Los Angeles achieved a mean TTFT of 80 ms when calling the FM endpoint in the Los Angeles Local Zone, compared to 135 ms for the endpoint in the US West (Oregon) Region. This represents a 55 ms (about 41%) reduction in latency.

User location: Honolulu Metropolitan Area
LLM inference endpoint	Mean (ms)	Min (ms)	P25 (ms)	P50 (ms)	P75 (ms)	P95 (ms)	P99 (ms)	Max (ms)
Parent Region: US West (Oregon)	197	172	180	183	187	243	472	683
Local Zone: Honolulu	114	58	70	85	164	209	273	369

The user in Honolulu achieved a mean TTFT of 114 ms when calling the FM endpoint in the Honolulu Local Zone, compared to 197 ms for the endpoint in the US West (Oregon) Region. This represents an 83 ms (about 42%) reduction in latency.

Moreover, the TTFT reduction achieved by Local Zone deployments is consistent across all metrics in both test scenarios, from minimum to maximum values and throughout all percentiles (P25–P99), indicating a consistent improvement across all requests.

Finally, remember that TTFT is just one component of overall response latency, alongside on-device processing latency. By reducing TTFT using Local Zones, you create additional margin for on-device processing latency, making it easier to achieve the target response latency range needed for humanlike conversation.

Cleanup

In this post, we created Local Zones, subnets, security groups, and EC2 instances. To avoid incurring additional charges, it’s crucial to properly clean up these resources when they’re no longer needed. To do so, follow these steps:

Terminate the EC2 instances and delete their associated Amazon Elastic Block Store (Amazon EBS) volumes.
Delete the security groups and subnets.
Disable the Local Zones.

Conclusion

In conclusion, this post highlights how edge computing services, such as AWS Local Zones, play a crucial role in reducing FM inference latency for conversational AI applications. Our test deployments of Meta’s Llama 3.2-3B demonstrated that placing FM inference endpoints closer to end users through Local Zones dramatically reduces TTFT compared to traditional Regional deployments. This TTFT reduction plays a critical role in optimizing the overall response latency, helping achieve the target response times essential for natural, humanlike interactions regardless of user location.

To use these benefits for your own applications, we encourage you to explore the AWS Local Zones documentation. There, you’ll find information on available locations and supported AWS services so you can bring the power of edge computing to your conversational AI solutions.

About the Authors

Nima Seifi is a Solutions Architect at AWS, based in Southern California, where he specializes in SaaS and LLMOps. He serves as a technical advisor to startups building on AWS. Prior to AWS, he worked as a DevOps architect in the e-commerce industry for over 5 years, following a decade of R&D work in mobile internet technologies. Nima has authored 20+ technical publications and holds 7 U.S. patents. Outside of work, he enjoys reading, watching documentaries, and taking beach walks.

Nelson Ong is a Solutions Architect at Amazon Web Services. He works with early stage startups across industries to accelerate their cloud adoption.

Pixtral-12B-2409 is now available on Amazon Bedrock Marketplace

Today, we are excited to announce that Pixtral 12B (pixtral-12b-2409), a state-of-the-art 12 billion parameter vision language model (VLM) from Mistral AI that excels in both text-only and multimodal tasks, is available for customers through Amazon Bedrock Marketplace. Amazon Bedrock Marketplace is a new capability in Amazon Bedrock that enables developers to discover, test, and use over 100 popular, emerging, and specialized foundation models (FMs) alongside the current selection of industry-leading models in Amazon Bedrock. You can also use this model with Amazon SageMaker JumpStart, a machine learning (ML) hub that provides access to algorithms and models that can be deployed with one click for running inference.

In this post, we walk through how to discover, deploy, and use the Pixtral 12B model for a variety of real-world vision use cases.

Overview of Pixtral 12B

Pixtral 12B, Mistral’s inaugural VLM, delivers robust performance across a range of benchmarks, surpassing other open models and rivaling larger counterparts, according to Mistral’s evaluation. Designed for both image and document comprehension, Pixtral demonstrates advanced capabilities in vision-related tasks, including chart and figure interpretation, document question answering, multimodal reasoning, and instruction following—several of which are illustrated with examples later in this post. The model processes images at their native resolution and aspect ratio, providing high-fidelity input handling. Unlike many open source alternatives, Pixtral 12B achieves strong results in text-based benchmarks—such as instruction following, coding, and mathematical reasoning—without sacrificing its proficiency in multimodal tasks.

Mistral developed a novel architecture for Pixtral 12B, optimized for both computational efficiency and performance. The model consists of two main components: a 400-million-parameter vision encoder, responsible for tokenizing images, and a 12-billion-parameter multimodal transformer decoder, which predicts the next text token based on a sequence of text and images. The vision encoder was specifically trained to natively handle variable image sizes, enabling Pixtral to accurately interpret high-resolution diagrams, charts, and documents while maintaining fast inference speeds for smaller images such as icons, clipart, and equations. This architecture supports processing an arbitrary number of images of varying sizes within a large context window of 128k tokens.

License agreements are a critical decision factor when using open-weights models. Similar to other Mistral models, such as Mistral 7B, Mixtral 8x7B, Mixtral 8x22B, and Mistral Nemo 12B, Pixtral 12B is released under the commercially permissive Apache 2.0, providing enterprise and startup customers with a high-performing VLM option to build complex multimodal applications.

Performance metrics and benchmarks

Pixtral 12B is trained to understand both natural images and documents, achieving 52.5% on the Massive Multitask Language Understanding (MMLU) reasoning benchmark, surpassing a number of larger models according to Mistral. The MMLU benchmark is a test that evaluates a language model’s ability to understand and use language across a variety of subjects. The MMLU consists of over 10,000 multiple-choice questions spanning a variety of academic subjects, including mathematics, philosophy, law, and medicine. The model shows strong abilities in tasks such as chart and figure understanding, document question answering, multimodal reasoning, and instruction following. Pixtral is able to ingest images at their natural resolution and aspect ratio, giving the user flexibility on the number of tokens used to process an image. Pixtral is also able to process multiple images in its long context window of 128,000 tokens. Unlike previous open source models, Pixtral doesn’t compromise on text benchmark performance to excel in multimodal tasks, according to Mistral.

You can review the Mistral published benchmarks

Prerequisites

To try out Pixtral 12B in Amazon Bedrock Marketplace, you will need the following prerequisites:

An AWS account that will contain all your AWS resources.
An AWS Identity and Access Management (IAM) role to access Amazon Bedrock Marketplace and Amazon SageMaker endpoints. To learn more about how IAM works with Amazon Bedrock Marketplace, refer to Set up Amazon Bedrock Marketplace.
Access to accelerated instances (GPUs) for hosting the model, such as ml.g6.12xlarge. Refer to Requesting a quota increase for access to GPU instances.

Deploy Pixtral 12B in Amazon Bedrock Marketplace

On the Amazon Bedrock console, you can search for models that help you with a specific use case or language. The results of the search include both serverless models and models available in Amazon Bedrock Marketplace. You can filter results by provider, modality (such as text, image, or audio), or task (such as classification or text summarization).

To access Pixtral 12B in Amazon Bedrock Marketplace, follow these steps:

On the Amazon Bedrock console, choose Model catalog under Foundation models in the navigation pane.
Filter for Hugging Face as a provider and choose the Pixtral 12B model, or search for Pixtral in the Filter for a model input box.

The model detail page provides essential information about the model’s capabilities, pricing structure, and implementation guidelines. You can find detailed usage instructions, including sample API calls and code snippets for integration.

The page also includes deployment options and licensing information to help you get started with Pixtral 12B in your applications.

To begin using Pixtral 12B, choose Deploy.

You will be prompted to configure the deployment details for Pixtral 12B. The model ID will be prepopulated.

Read carefully and accept the End User License Agreement (EULA).
The Endpoint Name is automatically populated. Customers can choose to rename the endpoint.
For Number of instances, enter a number of instances (between 1–100).
For Instance type, choose your instance type. For optimal performance with Pixtral 12B, a GPU-based instance type like ml.g6.12xlarge is recommended.

Optionally, you can configure advanced security and infrastructure settings, including virtual private cloud (VPC) networking, service role permissions, and encryption settings. For most use cases, the default settings will work well. However, for production deployments, you might want to review these settings to align with your organization’s security and compliance requirements.

Choose Deploy to begin using the model.

When the deployment is complete, Endpoint status should change to In Service. After the endpoint is in service, you can test Pixtral 12B capabilities directly in the Amazon Bedrock playground.

Choose Open in playground to access an interactive interface where you can experiment with different prompts and adjust model parameters like temperature and maximum length.

This is an excellent way to explore the model’s reasoning and text generation abilities before integrating it into your applications. The playground provides immediate feedback, helping you understand how the model responds to various inputs and letting you fine-tune your prompts for optimal results.

You can quickly test the model in the playground through the UI. However, to invoke the deployed model programmatically with Amazon Bedrock APIs, you need to use the endpoint ARN as model-id in the Amazon Bedrock SDK.

Pixtral 12B use cases

In this section, we provide example use cases of Pixtral 12B using sample prompts. We have defined helper functions to invoke the Pixtral 12B model using Amazon Bedrock Converse APIs:

def get_image_format(image_path):
    with Image.open(image_path) as img:
        # Normalize the format to a known valid one
        fmt = img.format.lower() if img.format else 'jpeg'
        # Convert 'jpg' to 'jpeg'
        if fmt == 'jpg':
            fmt = 'jpeg'
    return fmt

def call_bedrock_model(model_id=None, prompt="", image_paths=None, system_prompt="", temperature=0.6, top_p=0.9, max_tokens=3000):
    
    if isinstance(image_paths, str):
        image_paths = [image_paths]
    if image_paths is None:
        image_paths = []
    
    # Start building the content array for the user message
    content_blocks = []

    # Include a text block if prompt is provided
    if prompt.strip():
        content_blocks.append({"text": prompt})

    # Add images as raw bytes
    for img_path in image_paths:
        fmt = get_image_format(img_path)
        # Read the raw bytes of the image (no base64 encoding!)
        with open(img_path, 'rb') as f:
            image_raw_bytes = f.read()

        content_blocks.append({
            "image": {
                "format": fmt,
                "source": {
                    "bytes": image_raw_bytes
                }
            }
        })

    # Construct the messages structure
    messages = [
        {
            "role": "user",
            "content": content_blocks
        }
    ]

    # Prepare additional kwargs if system prompts are provided
    kwargs = {}
    
    kwargs["system"] = [{"text": system_prompt}]

    # Build the arguments for the `converse` call
    converse_kwargs = {
        "messages": messages,
        "inferenceConfig": {
            "maxTokens": 4000,
            "temperature": temperature,
            "topP": top_p
        },
        **kwargs
    }

    
    converse_kwargs["modelId"] = model_id

    # Call the converse API
    try:
        response = bedrock_runtime.converse(**converse_kwargs)
    
        # Parse the assistant response
        assistant_message = response.get('output', {}).get('message', {})
        assistant_content = assistant_message.get('content', [])
        result_text = "".join(block.get('text', '') for block in assistant_content)
    except Exception as e:
        result_text = f"Error message: {e}"
    return result_text

Visual logical reasoning

One of the interesting use cases of vision models is solving logical reasoning problems or visual puzzles. Pixtral 12B vision models are highly capable in solving logical reasoning questions. Let’s explore an example.

We use the following input image.

Our prompt and input payload are as follows:

system_prompt='You are solving logical reasoning problems.'
task = 'Which of these figures differ from the other four?'
image_path = './Pixtral_data/logical_reasoning.jpg'

print('Input Image:nn')
Image.open(image_path).show()

response = call_bedrock_model(model_id=endpoint_arn, 
                   prompt=task, 
                   system_prompt=system_prompt,
                   image_paths = image_path)

print(f'nResponse from the model:nn{response}')

We get following response:

The image shows five figures labeled A, B, C, D, and E. Each figure consists of a head, body, and legs. The task is to identify which figure differs from the others.

**Reasoning:**
- Figure A: The head is connected directly to the body, and the legs are straight.
- Figure B: The head is connected directly to the body, and the legs are straight.
- Figure C: The head is connected directly to the body, and the legs are straight.
- Figure D: The head is connected directly to the body, and the legs are straight.
- Figure E: The head is connected directly to the body, but the legs are bent at the knees.

**Conclusion:** Figure E is the one that differs from the others.

Structured product information

Extracting product information is crucial for the retail industry, especially on sites that host third-party sellers, where product images are the most accessible resource. Accurately capturing relevant details from these images is vital for a product’s success in ecommerce. For instance, using advanced visual models like Pixtral 12B, retailers can efficiently extract key attributes from clothing product images, such as color, style, and patterns. This capability not only streamlines inventory management but also enhances customer experiences by providing essential information that aids in informed purchasing decisions.

We use the following input image.

Our prompt and input payload are as follows:

system_prompt='''You are a catalog manager for an ecommerce portal. You have an image of a product. 
Think very carefully in a step-by-step manner to extract product information. 
Always provide response in prescribed JSON format.
'''
image_path = './Pixtral_data/cap.png'
task = '''
Organize product information in JSON format to store in a database. 

Output json schema:

{
"product_name": "",
"description": "",
"category": "",
"sub_category": "",
"color": "",
"size": "",
"brand": "",
"material": "",
"features": [],
"image_alt_text": ""
}
'''

print('Input Image:nn')
Image.open(image_path).show()

response = call_bedrock_model(model_id=endpoint_arn, 
                   prompt=task, 
                   system_prompt=system_prompt,
                   image_paths = image_path)

print(f'nResponse from the model:nn{response}')

We get the following response:

Certainly! Based on the image provided, here is the extracted product information in JSON format:

```json
{
  "product_name": "Nike Blue Cap",
  "description": "A stylish blue cap from Nike with a sleek design and adjustable fit.",
  "category": "Apparel",
  "sub_category": "Headwear",
  "color": "Blue",
  "size": "Adjustable",
  "brand": "Nike",
  "material": "Nylon",
  "features": [
    "Adjustable fit",
    "Lightweight",
    "Breathable material",
    "Embroidered Nike logo"
  ],
  "image_alt_text": "A blue Nike cap with a white Nike swoosh logo on the front."
}
```

Vehicle damage assessment

In the insurance industry, image analysis plays a crucial role in claims processing. For vehicle damage assessment, vision models like Pixtral 12B can be used to compare images taken at policy issuance with those submitted during a claim. This approach can streamline the evaluation process, potentially reducing loss adjustment expenses and expediting claim resolution. By automating the identification and characterization of automobile damage, insurers can enhance efficiency, improve accuracy, and ultimately provide a better experience for policyholders.

We use the following input images.

Our prompt and input payload are as follows:

image_paths = [
    "Pixtral_data/car_image_before.png",
    "Pixtral_data/car_image_after.png"
]

system_prompt='''
You are a helpful ai assistant for an insurance agent. 
Insurance agent has received a claim for a vehicle damage. 
'''

task = '''This claim includes two images. 
One of the image was taken before the incident and another was taken after the incident.

Analyse these images and answer below questions:
1. describe if there is any damage to the vehicle
2. should insurance agent accept or reject the claim

'''

print('Input Images:nn')
Image.open(image_paths[0]).show()
Image.open(image_paths[1]).show()

response = call_bedrock_model(model_id=endpoint_arn, 
                   prompt=task, 
                   system_prompt=system_prompt,
                   image_paths = image_paths)

print(f'nResponse from the model:nn{response}')

We get the following response:

### Analysis of Vehicle Damage

#### Image 1 (Before Incident)
- **Vehicle Condition**: The vehicle appears to be in good condition with no visible damage.
- **Paint and Body**: The paint is uniform, and there are no visible scratches, dents, or other forms of damage.
- **Tires and Wheels**: The tires and wheels look intact and in good condition.

#### Image 2 (After Incident)
- **Vehicle Condition**: The vehicle shows noticeable damage.
- **Paint and Body**: There is a visible dent or scratch on the rear side of the vehicle.
- **Tires and Wheels**: The tires and wheels appear to be intact, but the focus is on the body damage.

### Questions and Answers

1. **Describe if there is any damage to the vehicle:**
   - **Before Incident**: No visible damage.
   - **After Incident**: There is visible damage on the rear side of the vehicle, likely a dent or scratch.

2. **Should the insurance agent accept or reject the claim:**
   - **Recommendation**: The insurance agent should accept the claim. The images clearly show that the vehicle sustained damage after the incident. The before and after images provide sufficient evidence of the damage, which was likely caused by an incident.

### Conclusion
Based on the analysis of the images, the insurance agent should accept the claim as the damage to the vehicle is evident and appears to have occurred after the incident.

Handwriting recognition

Another feature in vision language models is their ability to recognize handwriting and extract handwritten text. Pixtral 12B performs well on extracting content from complex and poorly handwritten notes.

We use the following input image.

Our prompt and input payload are as follows:

system_prompt='You are a Graphologists'
task = '''
Analyze the image and transcribe any handwritten text present. 
Convert the handwriting into a single, continuous string of text. 
Maintain the original spelling, punctuation, and capitalization as written. Ignore any printed text, drawings, or other non-handwritten elements in the image.
'''

image_path = './Pixtral_data/a01-000u-04.png'


print('Input Image:nn')
Image.open(image_path).show()

response = call_bedrock_model(model_id=endpoint_arn, 
                   prompt=task, 
                   system_prompt=system_prompt,
                   image_paths = image_path)

print(f'nResponse from the model:nn{response}')

We get the following response:

put down a resolution on the subject

Reasoning of complex figures

VLMs excel at interpreting and reasoning about complex figures, charts, and diagrams. In this particular use case, we use Pixtral 12B to analyze an intricate image containing GDP data. Pixtral 12B’s advanced capabilities in document understanding and complex figure analysis make it well-suited for extracting insights from visual representations of economic data. By processing both the visual elements and accompanying text, Pixtral 12B can provide detailed interpretations and reasoned analysis of the GDP figures presented in the image.

We use the following input image.

Our prompt and input payload are as follows:

system_prompt='You are a Global Economist.'
task = 'List the top 5 countries in Europe with the highest GDP'
image_path = './Pixtral_data/gdp.png'

print('Input Image:nn')
Image.open(image_path).show()

response = call_bedrock_model(model_id=endpoint_arn, 
                   prompt=task, 
                   system_prompt=system_prompt,
                   image_paths = image_path)

print(f'nResponse from the model:nn{response}')

We get the following response:

The top 5 countries in Europe with the highest GDP, based on the provided diagram, are:

1. **Germany**
   - GDP: $3.99 trillion
   - GDP Percentage: 4.65%

2. **United Kingdom**
   - GDP: $2.82 trillion
   - GDP Percentage: 3.29%

3. **France**
   - GDP: $2.78 trillion
   - GDP Percentage: 3.24%

4. **Italy**
   - GDP: $2.07 trillion
   - GDP Percentage: 2.42%

5. **Spain**
   - GDP: $1.43 trillion
   - GDP Percentage: 1.66%

These countries are highlighted in green on the diagram.

Clean up

To avoid unwanted charges, clean up your resources. If you deployed the model using Amazon Bedrock Marketplace, complete the following steps:

Delete the Amazon Bedrock Marketplace deployment

On the Amazon Bedrock console, under Foundation models in the navigation pane, choose Marketplace deployments.
In the Managed deployments section, locate the endpoint you want to delete.
Verify the endpoint details to make sure you’re deleting the correct deployment:
1. Endpoint name
2. Model name
3. Endpoint status
Select the endpoint, and choose Delete.
Choose Delete to delete the endpoint.
In the deletion confirmation dialog, review the warning message, enter confirm, and choose Delete to permanently remove the endpoint.

Conclusion

In this post, we showed you how to get started with the Pixtral 12B model in Amazon Bedrock and deploy the model for inference. The Pixtral 12B vision model enables you to solve multiple use cases, including document understanding, logical reasoning, handwriting recognition, image comparison, entity extraction, extraction of structured data from scanned images, and caption generation. These capabilities can drive productivity in a number of enterprise use cases, including ecommerce (retail), marketing, FSI, and much more.

For more Mistral resources on AWS, check out the GitHub repo. The complete code for the samples featured in this post is available on GitHub. Pixtral 12B is also available in Amazon SageMaker JumpStart; refer to Pixtral 12B is now available on Amazon SageMaker JumpStart for details.

About the Authors

Deepesh Dhapola is a Senior Solutions Architect at AWS India, where he assists financial services and fintech clients in scaling and optimizing their applications on the AWS platform. He specializes in core machine learning and generative AI. Outside of work, Deepesh enjoys spending time with his family and experimenting with various cuisines.

Preston Tuggle is a Sr. Specialist Solutions Architect working on generative AI.

Shane Rai is a Principal GenAI Specialist with the AWS World Wide Specialist Organization (WWSO). He works with customers across industries to solve their most pressing and innovative business needs using AWS’s breadth of cloud-based AI/ML services including model offerings from top tier foundation model providers.

John Liu has 14 years of experience as a product executive and 10 years of experience as a portfolio manager. At AWS, John is a Principal Product Manager for Amazon Bedrock. Previously, he was the Head of Product for AWS Web3 / Blockchain. Prior to AWS, John held various product leadership roles at public blockchain protocols and fintech companies, and also spent 9 years as a portfolio manager at various hedge funds.

3 new ways we’re working to protect and restore nature using AI

Learn more about Google for Startups Accelerator: AI for Nature and Climate, as well as other new efforts to use technology to preserve our environment.Read More

Animals Crossing: AI Helps Protect Wildlife Across the Globe

From Seattle, Washington, to Cape Town, South Africa — and everywhere around and between — AI is helping conserve the wild plants and animals that make up the intricate web of life on Earth.

It’s critical work that sustains ecosystems and supports biodiversity at a time when the United Nations estimates over 1 million species are threatened with extinction.

World Wildlife Day, a UN initiative, is celebrated every March 3 to recognize the unique contributions wild animals and plants have on people and the planet — and vice versa.

“Our own survival depends on wildlife,” the above video on this year’s celebration says, “just as much as their survival depends on us.”

Learn more about some of the leading nonprofits and startups using NVIDIA AI and accelerated computing to protect wildlife and natural habitats, today and every day:

Ai2’s EarthRanger Offers World’s Largest Elephant Database

Seattle-based nonprofit AI research institute Ai2 offers EarthRanger, a software platform that helps protected-area managers, ecologists and wildlife biologists make more informed operational decisions for wildlife conservation in real time, whether preventing poaching, spotting ill or injured animals, or studying animal behavior.

Among Ai2’s efforts with EarthRanger is the planned development of a machine learning model — trained using NVIDIA Hopper GPUs in the cloud — that predicts the movement of elephants in areas close to human-wildlife boundaries where elephants could raid crops and potentially prompt humans to retaliate.

With access to the world’s largest repository of elephant movement data, made possible by EarthRanger users who’ve shared their data, the AI model could help predict elephant behaviors, then alert area managers to safely guide the elephants away from risky situations that could arise for them or for people in the vicinity. Area managers or rangers typically use helicopters, other vehicles and chili bombs to safely reroute elephants.

An elephant named Hugo wears a monitoring device that helps keep him safe. Image courtesy of the Mara Elephant Project.

Beyond elephants, EarthRanger collects, integrates and displays data on a slew of wildlife — aggregated from over 100 data sources, including camera traps, acoustic sensors, satellites, radios and more. Then, the platform combines the data with field reports to provide a unified view of collared wildlife, rangers, enforcement assets and infrastructure within a protected area.

“Name a country, species or an environmental cause and we’re probably supporting a field organization’s conservation efforts there,” said Jes Lefcourt, director of EarthRanger at Ai2.

It’s deployed by governments and conservation organizations in 76 countries and 650 protected areas, including nearly every national park in Africa, about a dozen state fishing and wildlife departments in the U.S., as well as many other users across Latin America and Asia.

Four of these partners — Rouxcel Technology, OroraTech, Wildlife Protection Services and Conservation X Labs — are highlighted below.

Rouxcel Technology Saves Rhinos With AI

South African startup Rouxcel Technology’s AI-based RhinoWatches, tapping into EarthRanger, learn endangered black and white rhinos’ behaviors, then alert authorities in real time of any detected abnormalities. These abnormalities can include straying from typical habitats, territorial fighting with other animals and other potentially life-threatening situations.

It’s critical work, as there are just about 28,000 rhinos left in the world, from 500,000 at the beginning of the 20th century.

A white rhino sports a Rouxcel RhinoWatch. Image courtesy of Hannah Rippon.

Rouxcel, based in Cape Town, has deployed over 1,200 RhinoWatches — trained and optimized using NVIDIA accelerated computing — across more than 40 South African reserves. The startup, which uses the Ai2 EarthRanger platform, protects more than 1.2 million acres of rhino habitats, and has recently expanded to help conservation efforts in Kenya and Namibia.

Looking forward, Rouxcel is developing AI models to help prevent poaching and human-wildlife conflict for more species, including pangolins, a critically endangered species.

OroraTech Monitors Wildfires and Poaching With NVIDIA CUDA, Jetson

OroraTech — a member of the NVIDIA Inception program for cutting-edge startups — uses the EarthRanger platform to protect wildlife in a different way, offering a wildfire detection and monitoring service that fuses satellite imagery and AI to safeguard the environment and prevent poaching.

Combining data from satellites, ground-based cameras, aerial observations and local weather information, OroraTech detects threats to natural habitats and alerts users in real time. The company’s technologies monitor more than 30 million hectares of land that directly impact wildlife in Africa and Australia. That’s nearly the size of the Great Barrier Reef.

OroraTech detects an early bushfire near Expedition National Park in Australia.

OroraTech flies an NVIDIA Jetson module for edge AI and data processing onboard all of its satellite payloads — the instruments, equipment and systems on a satellite designed for performing specific tasks. Through GPU-accelerated image processing, OroraTech achieves exceptional latency, delivering fire notifications to users on the ground as fast as five minutes after image acquisition.

The AI-based fire-detection pipeline uses the NVIDIA cuDNN library of deep neural network primitives and the NVIDIA TensorRT software development kit for thermal anomaly detection and cloud masking in space, leading to high-precision fire detections.

Wildlife Protection Solutions Help Preserve Endangered Species

International nonprofit Wildlife Protection Solutions (WPS) supports more than 250 conservation projects in 50+ countries. Its remote cameras — about 3,000 deployed across the globe — using AI models provide real-time monitoring of animals and poachers, alerting rangers to intercede before wildlife is harmed.

WPS — which also taps into the EarthRanger platform — harnesses NVIDIA accelerated computing to optimize training and inference of its AI models, which process and analyze 65,000 photos per day.

The WPS tool is free and available on any mobile, tablet or desktop browser, enabling remote monitoring, early alerting and proactive, automated deterrence of wildlife or humans in sensitive areas.

Conservation X Labs Identifies Species From Crowdsourced Images

Seattle-based Conservation X Labs — which is on a mission to prevent the sixth mass extinction, or the dying out of a high percentage of the world’s biodiversity due to natural phenomena and human activity — also uses EarthRanger, including for its Wild Me solution: open-source AI software for the conservation research community.

Wild Me supports over 2,000 researchers across the globe running AI-enabled wildlife population studies for marine and terrestrial species.

In the below video, Wild Me helps researchers classify whale sharks using computer vision:

The crowdsourced database — which currently comprises 14 million photos — lets anyone upload imagery of species. Then, AI foundation models trained using NVIDIA accelerated computing help identify species to ease and accelerate animal population assessments and other research that supports the fight against species extinction.

In addition, Conservation X Labs’s Sentinel technology transforms traditional wildlife monitoring tools — like trail cameras and acoustic recorders — with AI, processing environmental data as it’s collected and providing conservationists with real-time, data-driven insights through satellite and cellular networks.

To date, Sentinel devices have delivered about 100,000 actionable insights for 80 different species. For example, see how the technology flags a limping panther, so wildlife protectors could rapidly step in to offer aid:

Learn more about how NVIDIA technologies bolster conservation and environmental initiatives at NVIDIA GTC, a global AI conference running March 17-21 in San Jose, California, including at sessions on how AI is supercharging Antarctic flora monitoring, enhancing a digital twin of the Great Barrier Reef and helping mitigate urban climate change.

Featured video courtesy of Conservation X Labs.

How healthcare organizations are using generative AI search and agents

Google Cloud and healthcare organizations share new partnerships at HIMSS 2025.Read More

Amazon SageMaker HyperPod recipes

Solution overview

SageMaker HyperPod

SageMaker training jobs

Solution walkthrough

Prerequisites

Prepare the dataset

Option A: Fine-tune using SageMaker training jobs

Option B: Fine-tune using SageMaker HyperPod with Slurm

Evaluate the fine-tuned model

Clean up

Conclusion

About the Authors

Business Impact of AI in Healthcare and Life Sciences

AI Applied Across Healthcare

A Growing Dose of Generative AI

Hybrid architecture with AWS Local Zones

Prerequisites

Solution walkthrough

Performance evaluation

Evaluation setup

Results

Cleanup

Conclusion

About the Authors

Overview of Pixtral 12B

Performance metrics and benchmarks

Prerequisites

Deploy Pixtral 12B in Amazon Bedrock Marketplace

Pixtral 12B use cases

Visual logical reasoning

Structured product information

Vehicle damage assessment

Handwriting recognition

Reasoning of complex figures

Clean up

Delete the Amazon Bedrock Marketplace deployment

Conclusion

About the Authors

Ai2’s EarthRanger Offers World’s Largest Elephant Database

Rouxcel Technology Saves Rhinos With AI

OroraTech Monitors Wildfires and Poaching With NVIDIA CUDA, Jetson

Wildlife Protection Solutions Help Preserve Endangered Species

Conservation X Labs Identifies Species From Crowdsourced Images

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.