March 2025 – Page 10

How to run Qwen 2.5 on AWS AI chips using Hugging Face libraries

The Qwen 2.5 multilingual large language models (LLMs) are a collection of pre-trained and instruction tuned generative models in 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B (text in/text out and code out). The Qwen 2.5 fine tuned text-only models are optimized for multilingual dialogue use cases and outperform both previous generations of Qwen models, and many of the publicly available chat models based on common industry benchmarks.

At its core, Qwen 2.5 is an auto-regressive language model that uses an optimized transformer architecture. The Qwen2.5 collection can support over 29 languages and has enhanced role-playing abilities and condition-setting for chatbots.

In this post, we outline how to get started with deploying the Qwen 2.5 family of models on an Inferentia instance using Amazon Elastic Compute Cloud (Amazon EC2) and Amazon SageMaker using the Hugging Face Text Generation Inference (TGI) container and the Hugging Face Optimum Neuron library. Qwen2.5 Coder and Math variants are also supported.

Preparation

Hugging Face provides two tools that are frequently used when using AWS Inferentia and AWS Trainium: Text Generation Inference (TGI) containers, which provide support for deploying and serving LLMS, and the Optimum Neuron library, which serves as an interface between the Transformers library and the Inferentia and Trainium accelerators.

The first time a model is run on Inferentia or Trainium, you compile the model to make sure that you have a version that will perform optimally on Inferentia and Trainium chips. The Optimum Neuron library from Hugging Face along with the Optimum Neuron cache will transparently supply a compiled model when available. If you’re using a different model with the Qwen2.5 architecture, you might need to compile the model before deploying. For more information, see Compiling a model for Inferentia or Trainium.

You can deploy TGI as a docker container on an Inferentia or Trainium EC2 instance or on Amazon SageMaker.

Option 1: Deploy TGI on Amazon EC2 Inf2

In this example, you will deploy Qwen2.5-7B-Instruct on an inf2.xlarge instance. (See this article for detailed instructions on how to deploy an instance using the Hugging Face DLAMI.)

For this option, you SSH into the instance and create a .env file (where you’ll define your constants and specify where your model is cached) and a file named docker-compose.yaml (where you’ll define all of the environment parameters that you’ll need to deploy your model for inference). You can copy the following files for this use case.

Create a .env file with the following content:

MODEL_ID='Qwen/Qwen2.5-7B-Instruct'
#MODEL_ID='/data/exportedmodel' 
HF_AUTO_CAST_TYPE='bf16' # indicates the auto cast type that was used to compile the model
MAX_BATCH_SIZE=4
MAX_INPUT_TOKENS=4000
MAX_TOTAL_TOKENS=4096

Create a file named docker-compose.yaml with the following content:

version: '3.7'

services:
  tgi-1:
    image: ghcr.io/huggingface/neuronx-tgi:latest
    ports:
      - "8081:8081"
    environment:
      - PORT=8081
      - MODEL_ID=${MODEL_ID}
      - HF_AUTO_CAST_TYPE=${HF_AUTO_CAST_TYPE}
      - HF_NUM_CORES=2
      - MAX_BATCH_SIZE=${MAX_BATCH_SIZE}
      - MAX_INPUT_TOKENS=${MAX_INPUT_TOKENS}
      - MAX_TOTAL_TOKENS=${MAX_TOTAL_TOKENS}
      - MAX_CONCURRENT_REQUESTS=512
      #- HF_TOKEN=${HF_TOKEN} #only needed for gated models
    volumes:
      - $PWD:/data #can be removed if you aren't loading locally
    devices:
      - "/dev/neuron0"

Use docker compose to deploy the model:

docker compose -f docker-compose.yaml --env-file .env up

To confirm that the model deployed correctly, send a test prompt to the model:

curl 127.0.0.1:8081/generate 
    -X POST 
    -d '{
  "inputs":"Tell me about AWS.",
  "parameters":{
    "max_new_tokens":60
  }
}' 
    -H 'Content-Type: application/json'

To confirm that the model can respond in multiple languages, try sending a prompt in Chinese:

#"Tell me how to open an AWS account"
curl 127.0.0.1:8081/generate 
    -X POST 
    -d '{
  "inputs":"告诉我如何开设 AWS 账户。", 
  "parameters":{
    "max_new_tokens":60
  }
}' 
    -H 'Content-Type: application/json'

Option 2: Deploy TGI on SageMaker

You can also use Hugging Face’s Optimum Neuron library to quickly deploy models directly from SageMaker using instructions on the Hugging Face Model Hub.

From the Qwen 2.5 model card hub, choose Deploy, then SageMaker, and finally AWS Inferentia & Trainium.

Copy the example code into a SageMaker notebook, then choose Run.
The notebook you copied will look like the following:

import json
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client("iam")
    role = iam.get_role(RoleName="sagemaker_execution_role")["Role"]["Arn"]

# Hub Model configuration. https://huggingface.co/models
hub = {
    "HF_MODEL_ID": "Qwen/Qwen2.5-7B-Instruct",
    "HF_NUM_CORES": "2",
    "HF_AUTO_CAST_TYPE": "bf16",
    "MAX_BATCH_SIZE": "8",
    "MAX_INPUT_TOKENS": "3686",
    "MAX_TOTAL_TOKENS": "4096",
}


region = boto3.Session().region_name
image_uri = f"763104351884.dkr.ecr.{region}.amazonaws.com/huggingface-pytorch-tgi-inference:2.1.2-optimum0.0.27-neuronx-py310-ubuntu22.04"

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
    image_uri=image_uri,
    env=hub,
    role=role,
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type="ml.inf2.xlarge",
    container_startup_health_check_timeout=1800,
    volume_size=512,
)

# send request
predictor.predict(
    {
        "inputs": "What is is the capital of France?",
        "parameters": {
            "do_sample": True,
            "max_new_tokens": 128,
            "temperature": 0.7,
            "top_k": 50,
            "top_p": 0.95,
        }
    }
)

Clean Up

Make sure that you terminate your EC2 instances and delete your SageMaker endpoints to avoid ongoing costs.

Terminate EC2 instances through the AWS Management Console.

Terminate a SageMaker endpoint through the console or with the following commands:

predictor.delete_model()
predictor.delete_endpoint(delete_endpoint_config=True)

Conclusion

AWS Trainium and AWS Inferentia deliver high performance and low cost for deploying Qwen2.5 models. We’re excited to see how you will use these powerful models and our purpose-built AI infrastructure to build differentiated AI applications. To learn more about how to get started with AWS AI chips, see the AWS Neuron documentation.

About the Authors

Jim Burtoft is a Senior Startup Solutions Architect at AWS and works directly with startups as well as the team at Hugging Face. Jim is a CISSP, part of the AWS AI/ML Technical Field Community, part of the Neuron Data Science community, and works with the open source community to enable the use of Inferentia and Trainium. Jim holds a bachelor’s degree in mathematics from Carnegie Mellon University and a master’s degree in economics from the University of Virginia.

Miriam Lebowitz is a Solutions Architect focused on empowering early-stage startups at AWS. She leverages her experience with AIML to guide companies to select and implement the right technologies for their business objectives, setting them up for scalable growth and innovation in the competitive startup world.

Rhia Soni is a Startup Solutions Architect at AWS. Rhia specializes in working with early stage startups and helps customers adopt Inferentia and Trainium. Rhia is also part of the AWS Analytics Technical Field Community and is a subject matter expert in Generative BI. Rhia holds a bachelor’s degree in Information Science from the University of Maryland.

Paul Aiuto is a Senior Solution Architect Manager focusing on Startups at AWS. Paul created a team of AWS Startup Solution architects that focus on the adoption of Inferentia and Trainium. Paul holds a bachelor’s degree in Computer Science from Siena College and has multiple Cyber Security certifications.

Revolutionizing customer service: MaestroQA’s integration with Amazon Bedrock for actionable insight

This post is cowritten with Harrison Hunter is the CTO and co-founder of MaestroQA.

MaestroQA augments call center operations by empowering the quality assurance (QA) process and customer feedback analysis to increase customer satisfaction and drive operational efficiencies. They assist with operations such as QA reporting, coaching, workflow automations, and root cause analysis.

In this post, we dive deeper into one of MaestroQA’s key features—conversation analytics, which helps support teams uncover customer concerns, address points of friction, adapt support workflows, and identify areas for coaching through the use of Amazon Bedrock. We discuss the unique challenges MaestroQA overcame and how they use AWS to build new features, drive customer insights, and improve operational inefficiencies.

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies, such as AI21 Labs, Anthropic, Cohere, Meta, Mistral, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.

The opportunity for open-ended conversation analysis at enterprise scale

MaestroQA serves a diverse clientele across various industries, including ecommerce, marketplaces, healthcare, talent acquisition, insurance, and fintech. All of these customers have a common challenge: the need to analyze a high volume of interactions with their customers. Analyzing these customer interactions is crucial to improving their product, improving their customer support, providing customer satisfaction, and identifying key industry signals. However, customer interaction data such as call center recordings, chat messages, and emails are highly unstructured and require advanced processing techniques in order to accurately and automatically extract insights.

When customers receive incoming calls at their call centers, MaestroQA employs its proprietary transcription technology, built by enhancing open source transcription models, to transcribe the conversations. After the data is transcribed, MaestroQA uses technology they have developed in combination with AWS services such as Amazon Comprehend to run various types of analysis on the customer interaction data. For example, MaestroQA offers sentiment analysis for customers to identify the sentiment of their end customer during the support interaction, enabling MaestroQA’s customers to sort their interactions and manually inspect the best or worst interactions. MaestroQA also offers a logic/keyword-based rules engine for classifying customer interactions based on other factors such as timing or process steps including metrics like Average Handle Time (AHT), compliance or process checks, and SLA adherence.

MaestroQA’s customers love these analysis features because they allow them to continuously improve the quality of their support and identify areas where they can improve their product to better satisfy their end customers. However, they were also interested in more advanced analysis, such as asking open-ended questions like “How many times did the customer ask for an escalation?” MaestroQA’s existing rules engine couldn’t always answer these types of queries because end-users could ask for the same outcome in many different ways. For example, “Can I speak to your manager?” and “I would like to speak to someone higher up” don’t share the same keywords, but are both asking for an escalation. MaestroQA needed a way to accurately classify customer interactions based on open-ended questions.

MaestroQA faced an additional hurdle: the immense scale of customer interactions their clients manage. With clients handling anywhere from thousands to millions of customer engagements monthly, there was a pressing need for comprehensive analysis of support team performance across this vast volume of interactions. Consequently, MaestroQA had to develop a solution capable of scaling to meet their clients’ extensive needs.

To start developing this product, MaestroQA first rolled out a product called AskAI. AskAI allowed customers to run open-ended questions on a targeted list of up to 1,000 conversations. For example, a customer might use MaestroQA’s filters to find customer interactions in Oregon within the past two months and then run a root cause analysis query such as “What are customers frustrated about in Oregon?” to find churn risk anecdotes. Their customers really liked this feature and surprised MaestroQA with the breadth of use cases they covered, including analyzing marketing campaigns, service issues, and product opportunities. Customers started to request the ability to run this type of analysis across all of their transcripts, which could number in the millions, so they could quantify the impact of what they were seeing and find instances of important issues.

Solution overview

MaestroQA decided to use Amazon Bedrock to address their customers’ need for advanced analysis of customer interaction transcripts. Amazon Bedrock’s broad choice of FMs from leading AI companies, along with its scalability and security features, made it an ideal solution for MaestroQA.

MaestroQA integrated Amazon Bedrock into their existing architecture using Amazon Elastic Container Service (Amazon ECS). The customer interaction transcripts are stored in an Amazon Simple Storage Service (Amazon S3) bucket.

The following architecture diagram demonstrates the request flow for AskAI. When a customer submits an analysis request through MaestroQA’s web application, an ECS cluster retrieves the relevant transcripts from Amazon S3, cleans and formats the prompt, sends them to Amazon Bedrock for analysis using the customer’s selected FM, and stores the results in a database hosted in Amazon Elastic Compute Cloud (Amazon EC2), where they can be retrieved by MaestroQA’s frontend web application.

MaestroQA offers their customers the flexibility to choose from multiple FMs available through Amazon Bedrock, including Anthropic’s Claude 3.5 Sonnet, Anthropic’s Claude 3 Haiku, Mistral 7b/8x7b, Cohere’s Command R and R+, and Meta’s Llama 3.1 models. This allows customers to select the model that best suits their specific use case and requirements.

The following screenshot shows how the AskAI feature allows MaestroQA’s customers to use the wide variety of FMs available on Amazon Bedrock to ask open-ended questions such as “What are some of the common issues in these tickets?” and generate useful insights from customer service interactions.

To handle the high volume of customer interaction transcripts and provide low-latency responses, MaestroQA takes advantage of the cross-Region inference capabilities of Amazon Bedrock. Originally, they were doing the load balancing themselves, distributing requests between available AWS US Regions (us-east-1, us-west-2, and so on) and available EU Regions (eu-west-3, eu-central-1, and so on) for their North American and European customers, respectively. Now, the cross-Region inference capability of Amazon Bedrock enables MaestroQA to achieve twice the throughput compared to single-Region inference, a critical factor in scaling their solution to accommodate more customers. MaestroQA’s team no longer has to spend time and effort to predict their demand fluctuations, which is especially key when usage increases for their ecommerce customers around the holiday season. Cross-Region inference dynamically routes traffic across multiple Regions, providing optimal availability for each request and smoother performance during these high-usage periods. MaestroQA monitors this setup’s performance and reliability using Amazon CloudWatch.

Benefits: How Amazon Bedrock added value

Amazon Bedrock has enabled MaestroQA to innovate faster and gain a competitive advantage by offering their customers powerful generative AI features for analyzing customer interaction transcripts. With Amazon Bedrock, MaestroQA can now provide their customers with the ability to run open-ended queries across millions of transcripts, unlocking valuable insights that were previously inaccessible.

The broad choice of FMs available through Amazon Bedrock allows MaestroQA to cater to their customers’ diverse needs and preferences. Customers can select the model that best aligns with their specific use case, finding the right balance between performance and price.

The scalability and cross-Region inference capabilities of Amazon Bedrock enable MaestroQA to handle high volumes of customer interaction transcripts while maintaining low latency, regardless of their customers’ geographical locations.

MaestroQA takes advantage of the robust security features and ethical AI practices of Amazon Bedrock to bolster customer confidence. These measures make sure that client data remains secure during processing and isn’t used for model training by third-party providers. Additionally, Amazon Bedrock availability in Europe, coupled with its geographic control capabilities, allows MaestroQA to seamlessly extend AI services to European customers. This expansion is achieved without introducing additional complexities, thereby maintaining operational efficiency while adhering to Regional data regulations.

The adoption of Amazon Bedrock proved to be a game changer for MaestroQA’s compact development team. Its serverless architecture allowed the team to rapidly prototype and refine their application without the burden of managing complex hardware infrastructure. This shift enabled MaestroQA to channel their efforts into optimizing application performance rather than grappling with resource allocation. Moreover, Amazon Bedrock offers seamless compatibility with their existing AWS environment, allowing for a smooth integration process and further streamlining their development workflow. MaestroQA was able to use their existing authentication process with AWS Identity and Access Management (IAM) to securely authenticate their application to invoke large language models (LLMs) within Amazon Bedrock. They were also able to use the familiar AWS SDK to quickly and effortlessly integrate Amazon Bedrock into their application.

Overall, by using Amazon Bedrock, MaestroQA is able to provide their customers with a powerful and flexible solution for extracting valuable insights from their customer interaction data, driving continuous improvement in their products and support processes.

Success metrics

The early results have been remarkable.

A lending company uses MaestroQA to detect compliance risks on 100% of their conversations. Before, agents would raise internal escalations if a consumer complained about the loan or expressed being in a vulnerable state. However, this process was manual and error prone, and the lending company would miss many of these risks. Now, they are able to detect compliance risks with almost 100% accuracy.

A medical device company, who is required to report device issues to the FDA, no longer relies solely on agents to report internally customer-reported issues, but uses this service to analyze all of their conversations to make sure all complaints are flagged.

An education company has been able to replace their manual survey scores with an automated customer sentiment score that increased their sample size from 15% to 100% of conversations.

The best is yet to come.

Conclusion

Using AWS, MaestroQA was able to innovate faster and gain a competitive advantage. Companies from different industries such as financial services, healthcare and life sciences, and EdTech all share the common desire to provide better customer services for their clients. MaestroQA was able to enable them to do that by quickly pivoting to offer powerful generative AI features that solved tangible business problems and enhanced overall compliance.

Check out MaestroQA’s feature AskAI and their LLM-powered AI Classifiers if you’re interested in better understanding your customer conversations and survey scores. For more about Amazon Bedrock, see Get started with Amazon Bedrock and learn about features such as cross-Region inference to help scale your generative AI features globally.

About the Authors

Carole Suarez is a Senior Solutions Architect at AWS, where she helps guide startups through their cloud journey. Carole specializes in data engineering and holds an array of AWS certifications on a variety of topics including analytics, AI, and security. She is passionate about learning languages and is fluent in English, French, and Tagalog.

Ben Gruher is a Generative AI Solutions Architect at AWS, focusing on startup customers. Ben graduated from Seattle University where he obtained bachelor’s and master’s degrees in Computer Science and Data Science.

Harrison Hunter is the CTO and co-founder of MaestroQA where he leads the engineer and product teams. Prior to MaestroQA, Harrison studied computer science and AI at MIT.

Optimize hosting DeepSeek-R1 distilled models with Hugging Face TGI on Amazon SageMaker AI

DeepSeek-R1, developed by AI startup DeepSeek AI, is an advanced large language model (LLM) distinguished by its innovative, multi-stage training process. Instead of relying solely on traditional pre-training and fine-tuning, DeepSeek-R1 integrates reinforcement learning to achieve more refined outputs. The model employs a chain-of-thought (CoT) approach that systematically breaks down complex queries into clear, logical steps. Additionally, it uses NVIDIA’s parallel thread execution (PTX) constructs to boost training efficiency, and a combined framework of supervised fine-tuning (SFT) and group robust policy optimization (GRPO) makes sure its results are both transparent and interpretable.

In this post, we demonstrate how to optimize hosting DeepSeek-R1 distilled models with Hugging Face Text Generation Inference (TGI) on Amazon SageMaker AI.

Model Variants

The current DeepSeek model collection consists of the following models:

DeepSeek-V3 – An LLM that uses a Mixture-of-Experts (MoE) architecture. MoE models like DeepSeek-V3 and Mixtral replace the standard feed-forward neural network in transformers with a set of parallel sub-networks called experts. These experts are selectively activated for each input, allowing the model to efficiently scale to a much larger size without a corresponding increase in computational cost. For example, DeepSeek-V3 is a 671-billion-parameter model, but only 37 billion parameters (approximately 5%) are activated during the output of each token. DeepSeek-V3-Base is the base model from which the R1 variants are derived.
DeepSeek-R1-Zero – A fine-tuned variant of DeepSeek-V3 based on using reinforcement learning to guide CoT reasoning capabilities, without any SFT done prior. According to the DeepSeek R1 paper, DeepSeek-R1-Zero excelled at reasoning behaviors but encountered challenges with readability and language mixing.
DeepSeek-R1 – Another fine-tuned variant of DeepSeek-V3-Base, built similarly to DeepSeek-R1-Zero, but with a multi-step training pipeline. DeepSeek-R1 starts with a small amount of cold-start data prior to the GRPO process. It also incorporates SFT data through rejection sampling, combined with supervised data generated from DeepSeek-V3 to retrain DeepSeek-V3-base. After this, the retrained model goes through another round of RL, resulting in the DeepSeek-R1 model checkpoint.
DeepSeek-R1-Distill – Variants of Hugging Face’s Qwen and Meta’s Llama based on Qwen2.5-Math-1.5B, Qwen2.5-Math-7B, Qwen2.5-14B, Qwen2.5-32B, Llama-3.1-8B, and Llama-3.3-70B-Instruct. The distilled model variants are a result of fine-tuning Qwen or Llama models through knowledge distillation, where DeepSeek-R1 acts as the teacher and Qwen or Llama as the student. These models retain their existing architecture while gaining additional reasoning capabilities through a distillation process. They are exclusively fine-tuned using SFT and don’t incorporate any RL techniques.

The following figure illustrates the performance of DeepSeek-R1 compared to other state-of-the-art models on standard benchmark tests, such as MATH-500, MMLU, and more.

Hugging Face Text Generation Inference (TGI)

Hugging Face Text Generation Inference (TGI) is a high-performance, production-ready inference framework optimized for deploying and serving large language models (LLMs) efficiently. It is designed to handle the demanding computational and latency requirements of state-of-the-art transformer models, including Llama, Falcon, Mistral, Mixtral, and GPT variants – for a full list of TGI supported models refer to supported models.

Amazon SageMaker AI provides a managed way to deploy TGI-optimized models, offering deep integration with Hugging Face’s inference stack for scalable and cost-efficient LLM deployment. To learn more about Hugging Face TGI support on Amazon SageMaker AI, refer to this announcement post and this documentation on deploy models to Amazon SageMaker AI.

Key Optimizations in Hugging Face TGI

Hugging Face TGI is built to address the challenges associated with deploying large-scale text generation models, such as inference latency, token throughput, and memory constraints.

The benefits of using TGI include:

Tensor parallelism – Splits large models across multiple GPUs for efficient memory utilization and computation
Continuous batching – Dynamically batches multiple inference requests to maximize token throughput and reduce latency
Quantization – Lowers memory usage and computational cost by converting model weights to INT8 or FP16
Speculative decoding – Uses a smaller draft model to speed up token prediction while maintaining accuracy
Key-value cache optimization – Reduces redundant computations for faster response times in long-form text generation
Token streaming – Streams tokens in real time for low-latency applications like chatbots and virtual assistants

Runtime TGI Arguments

TGI containers support runtime configurations that provide greater control over LLM deployments. These configurations allow you to adjust settings such as quantization, model parallel size (tensor parallel size), maximum tokens, data type (dtype), and more using container environment variables.

Notable runtime parameters influencing your model deployment include:

HF_MODEL_ID : This parameter specifies the identifier of the model to load, which can be a model ID from the Hugging Face Hub (e.g., meta-llama/Llama-3.2-11B-Vision-Instruct) or Simple Storage Service (S3) URI containing the model files.
HF_TOKEN : This parameter variable provides the access token required to download gated models from the Hugging Face Hub, such as Llama or Mistral.
SM_NUM_GPUS : This parameter specifies the number of GPUs to use for model inference, allowing the model to be sharded across multiple GPUs for improved performance.
MAX_CONCURRENT_REQUESTS : This parameter controls the maximum number of concurrent requests that the server can handle, effectively managing the load and ensuring optimal performance.
DTYPE : This parameter sets the data type for the model weights during loading, with options like float16 or bfloat16, influencing the model’s memory consumption and computational performance.

There are additional optional runtime parameters that are already pre-optimized in TGI containers to maximize performance on host hardware. However, you can modify them to exercise greater control over your LLM inference performance:

MAX_TOTAL_TOKENS: This parameter sets the upper limit on the combined number of input and output tokens a deployment can handle per request, effectively defining the “memory budget” for client interactions.
MAX_INPUT_TOKENS : This parameter specifies the maximum number of tokens allowed in the input prompt of each request, controlling the length of user inputs to manage memory usage and ensure efficient processing.
MAX_BATCH_PREFILL_TOKENS : This parameter caps the total number of tokens processed during the prefill stage across all batched requests, a phase that is both memory-intensive and compute-bound, thereby optimizing resource utilization and preventing out-of-memory errors.

For a complete list of runtime configurations, please refer to text-generation-launcher arguments.

DeepSeek Deployment Patterns with TGI on Amazon SageMaker AI

Amazon SageMaker AI offers a simple and streamlined approach to deploy DeepSeek-R1 models with just a few lines of code. Additionally, SageMaker endpoints support automatic load balancing and autoscaling, enabling your LLM deployment to scale dynamically based on incoming requests. During non-peak hours, the endpoint can scale down to zero, optimizing resource usage and cost efficiency.

The table below summarizes all DeepSeek-R1 models available on the Hugging Face Hub, as uploaded by the original model provider, DeepSeek.

Model	# Total Params	# Activated Params	Context Length	Download
DeepSeek-R1-Zero	671B	37B	128K	deepseek-ai/DeepSeek-R1-Zero
DeepSeek-R1	671B	37B	128K	deepseek-ai/DeepSeek-R1

DeepSeek AI also offers distilled versions of its DeepSeek-R1 model to offer more efficient alternatives for various applications.

Model	Base Model	Download
DeepSeek-R1-Distill-Qwen-1.5B	Qwen2.5-Math-1.5B	deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
DeepSeek-R1-Distill-Qwen-7B	Qwen2.5-Math-7B	deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
DeepSeek-R1-Distill-Llama-8B	Llama-3.1-8B	deepseek-ai/DeepSeek-R1-Distill-Llama-8B
DeepSeek-R1-Distill-Qwen-14B	Qwen2.5-14B	deepseek-ai/DeepSeek-R1-Distill-Qwen-14B
DeepSeek-R1-Distill-Qwen-32B	Qwen2.5-32B	deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
DeepSeek-R1-Distill-Llama-70B	Llama-3.3-70B-Instruct	deepseek-ai/DeepSeek-R1-Distill-Llama-70B

There are two ways to deploy LLMs, such as DeepSeek-R1 and its distilled variants, on Amazon SageMaker:

Option 1: Direct Deployment from Hugging Face Hub

The easiest way to host DeepSeek-R1 in your AWS account is by deploying it (along with its distilled variants) using TGI containers. These containers simplify deployment through straightforward runtime environment specifications. The architecture diagram below shows a direct download from the Hugging Face Hub, ensuring seamless integration with Amazon SageMaker.

The following code shows how to deploy the DeepSeek-R1-Distill-Llama-8B model to a SageMaker endpoint, directly from the Hugging Face Hub.

import sagemaker
from sagemaker.huggingface import (
    HuggingFaceModel, 
    get_huggingface_llm_image_uri
)

role = sagemaker.get_execution_role()
session = sagemaker.Session()

# select the latest 3+ version container 
deploy_image_uri = get_huggingface_llm_image_uri(
     "huggingface", 
     version="3.0.1" 
)

deepseek_tgi_model = HuggingFaceModel(
            image_uri=deploy_image_uri,
    env={
        "HF_MODEL_ID": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
        "ENDPOINT_SERVER_TIMEOUT": "3600",
        ...
    },
    role=role,
    sagemaker_session=session,
    name="deepseek-r1-llama-8b-model" # optional
)

pretrained_tgi_predictor = deepseek_tgi_model.deploy(
   endpoint_name="deepseek-r1-llama-8b-endpoint", # optional 
    initial_instance_count=1,
    instance_type="ml.g5.2xlarge",
    container_startup_health_check_timeout=600,
    wait=False, # set to true to wait for endpoint InService
)

To deploy other distilled models, simply update the HF_MODEL_ID to any of the DeepSeek distilled model variants, such as deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B, deepseek-ai/DeepSeek-R1-Distill-Qwen-32B, or deepseek-ai/DeepSeek-R1-Distill-Llama-70B.

Option 2: Deployment from a Private S3 Bucket

To deploy models privately within your AWS account, upload the DeepSeek-R1 model weights to a S3 bucket and set HF_MODEL_ID to the corresponding S3 bucket prefix. TGI will then retrieve and deploy the model weights from S3, eliminating the need for internet downloads during each deployment. This approach reduces model loading latency by keeping the weights closer to your SageMaker endpoints and enable your organization’s security teams to perform vulnerability scans before deployment. SageMaker endpoints also support auto-scaling, allowing DeepSeek-R1 to scale horizontally based on incoming request volume while seamlessly integrating with elastic load balancing.

Deploying a DeepSeek-R1 distilled model variant from S3 follows the same process as option 1, with one key difference: HF_MODEL_ID points to the S3 bucket prefix instead of the Hugging Face Hub. Before deployment, you must first download the model weights from the Hugging Face Hub and upload them to your S3 bucket.

deepseek_tgi_model = HuggingFaceModel(
            image_uri=deploy_image_uri,
    env={
        "HF_MODEL_ID": "s3://my-model-bucket/path/to/model",
        ...
    }, 
          vpc_config={ 
                 "Subnets": ["subnet-xxxxxxxx", "subnet-yyyyyyyy"],
    "SecurityGroupIds": ["sg-zzzzzzzz"] 
     },
    role=role, 
    sagemaker_session=session,
    name="deepseek-r1-llama-8b-model-s3" # optional
)

Deployment Best Practices

The following are some best practices to consider when deploying DeepSeek-R1 models on SageMaker AI:

Deploy within a private VPC – It’s recommended to deploy your LLM endpoints inside a virtual private cloud (VPC) and behind a private subnet, preferably with no egress. See the following code:

deepseek_tgi_model = HuggingFaceModel(
            image_uri=deploy_image_uri,
    env={
        "HF_MODEL_ID": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
        "ENDPOINT_SERVER_TIMEOUT": "3600",
        ...
    },
    role=role,
          vpc_config={ 
                 "Subnets": ["subnet-xxxxxxxx", "subnet-yyyyyyyy"],
                 "SecurityGroupIds": ["sg-zzzzzzzz"] 
     },
    sagemaker_session=session,
    name="deepseek-r1-llama-8b-model" # optional
)

Implement guardrails for safety and compliance – Always apply guardrails to validate incoming and outgoing model responses for safety, bias, and toxicity. You can use Amazon Bedrock Guardrails to enforce these protections on your SageMaker endpoint responses.

Inference Performance Evaluation

This section presents examples of the inference performance of DeepSeek-R1 distilled variants on Amazon SageMaker AI. Evaluating LLM performance across key metrics—end-to-end latency, throughput, and resource efficiency—is crucial for ensuring responsiveness, scalability, and cost-effectiveness in real-world applications. Optimizing these metrics directly enhances user experience, system reliability, and deployment feasibility at scale.

All DeepSeek-R1 Qwen (1.5B, 7B, 14B, 32B) and Llama (8B, 70B) variants are evaluated against four key performance metrics:

End-to-End Latency
Throughput (Tokens per Second)
Time to First Token
Inter-Token Latency

Please note that the main purpose of this performance evaluation is to give you an indication about relative performance of distilled R1 models on different hardware. We didn’t try to optimize the performance for each model/hardware/use case combination. These results should not be treated like a best possible performance of a particular model on a particular instance type. You should always perform your own testing using your own datasets and input/output sequence length.

If you are interested in running this evaluation job inside your own account, refer to our code on GitHub.

Scenarios

We tested the following scenarios:

Tokens – Two input token lengths were used to evaluate the performance of DeepSeek-R1 distilled variants hosted on SageMaker endpoints. Each test was executed 100 times, with concurrency set to 1, and the average values across key performance metrics were recorded. All models were run with dtype=bfloat16.
- Short-length test – 512 input tokens, 256 output tokens.
- Medium-length test – 3,072 input tokens, 256 output tokens.
Hardware – Several instance families were tested, including p4d (NVIDIA A100), g5 (NVIDIA A10G), g6 (NVIDIA L4), and g6e (NVIDIA L40s), each equipped with 1, 4, or 8 GPUs per instance. For additional details regarding pricing and instance specifications, refer to Amazon SageMaker AI pricing. In the following table, a green cell indicates a model was tested on a specific instance type, and a red cell indicates the model wasn’t tested because the instance type or size was excessive for the model or unable to run because of insufficient GPU memory.

Box Plots

In the following sections we use a box plot to visualize model performance. A box is a concise visual summary that displays a dataset’s median, interquartile range (IQR), and potential outliers using a box for the middle 50% of the data with whiskers extending to the smallest and largest non-outlier values. By examining the median’s placement within the box, the box’s size, and the whiskers’ lengths, you can quickly assess the data’s central tendency, variability, and skewness.

DeepSeek-R1-Distill-Qwen-1.5B

A single GPU instance is sufficient to host one or multiple concurrent DeepSeek-R1-Distill-Qwen-1.5B serving workers on TGI. In this test, a single worker was deployed, and performance was evaluated across the four outlined metrics. Results show that the ml.g5.xlarge outperforms the ml.g6.xlarge across all metrics.

DeepSeek-R1-Distill-Qwen-7B

DeepSeek-R1-Distill-Qwen-7B was tested on ml.g5.xlarge, ml.g5.2xlarge, ml.g6.xlarge, ml.g6.2xlarge, and ml.g6e.xlarge. The ml.g6e.xlarge instance performed the best, followed by ml.g5.2xlarge, ml.g5.xlarge, and the ml.g6 instances.

DeepSeek-R1-Distill-Llama-8B

Similar to the 7B variant, DeepSeek-R1-Distill-Llama-8B was benchmarked across ml.g5.xlarge, ml.g5.2xlarge, ml.g6.xlarge, ml.g6.2xlarge, and ml.g6e.xlarge, with ml.g6e.xlarge demonstrating the highest performance among all instances.

DeepSeek-R1-Distill-Qwen-14B

DeepSeek-R1-Distill-Qwen-14B was tested on ml.g5.12xlarge, ml.g6.12xlarge, ml.g6e.2xlarge, and ml.g6e.xlarge. The ml.g5.12xlarge exhibited the highest performance, followed by ml.g6.12xlarge. Although at a lower performance profile, DeepSeek-R1-14B can also be deployed on the single GPU g6e instances due to their larger memory footprint.

DeepSeek-R1-Distill-Qwen-32B

DeepSeek-R1-Distill-Qwen-32B requires more than 48GB of memory, making ml.g5.12xlarge, ml.g6.12xlarge, and ml.g6e.12xlarge suitable for performance comparison. In this test, ml.g6e.12xlarge delivered the highest performance, followed by ml.g5.12xlarge, with ml.g6.12xlarge ranking third.

DeepSeek-R1-Distill-Llama-70B

DeepSeek-R1-Distill-Llama-70B was tested on ml.g5.48xlarge, ml.g6.48xlarge, ml.g6e.12xlarge, ml.g6e.48xlarge, and ml.p4dn.24xlarge. The best performance was observed on ml.p4dn.24xlarge, followed by ml.g6e.48xlarge, ml.g6e.12xlarge, ml.g5.48xlarge, and finally ml.g6.48xlarge.

Clean Up

To avoid incurring cost after completing your evaluation, ensure you delete the endpoints you created earlier.

import boto3

# Create a low-level SageMaker service client.
sagemaker_client = boto3.client('sagemaker', region_name=<region>)

# Delete endpoint
sagemaker_client.delete_endpoint(EndpointName=endpoint_name)

Conclusion

In this blog, you learned about the current versions of DeepSeek models and how to use the Hugging Face TGI containers to simplify the deployment of DeepSeek-R1 distilled models (or any other LLM) on Amazon SageMaker AI to just a few lines of code. You also learned how to deploy models directly from the Hugging Face Hub for quick experimentation and from a private S3 bucket to provide enhanced security and model deployment performance.

Finally, you saw an extensive evaluation of all DeepSeek-R1 distilled models across four key inference performance metrics using 13 different NVIDIA accelerator instance types. This analysis provides valuable insights to help you select the optimal instance type for your DeepSeek-R1 deployment. All code used to analyze DeepSeek-R1 distilled model variants are available on GitHub.

About the Authors

Pranav Murthy is a Worldwide Technical Lead and Sr. GenAI Data Scientist at AWS. He helps customers build, train, deploy, evaluate, and monitor Machine Learning (ML), Deep Learning (DL), and Generative AI (GenAI) workloads on Amazon SageMaker. Pranav specializes in multimodal architectures, with deep expertise in computer vision (CV) and natural language processing (NLP). Previously, he worked in the semiconductor industry, developing AI/ML models to optimize semiconductor processes using state-of-the-art techniques. In his free time, he enjoys playing chess, training models to play chess, and traveling. You can find Pranav on LinkedIn.

Simon Pagezy is a Cloud Partnership Manager at Hugging Face, dedicated to making cutting-edge machine learning accessible through open source and open science. With a background in AI/ML consulting at AWS, he helps organizations leverage the Hugging Face ecosystem on their platform of choice.

Giuseppe Zappia is a Principal AI/ML Specialist Solutions Architect at AWS, focused on helping large enterprises design and deploy ML solutions on AWS. He has over 20 years of experience as a full stack software engineer, and has spent the past 5 years at AWS focused on the field of machine learning.

Dmitry Soldatkin is a Senior Machine Learning Solutions Architect at AWS, helping customers design and build AI/ML solutions. Dmitry’s work covers a wide range of ML use cases, with a primary interest in generative AI, deep learning, and scaling ML across the enterprise. He has helped companies in many industries, including insurance, financial services, utilities, and telecommunications. He has a passion for continuous innovation and using data to drive business outcomes. Prior to joining AWS, Dmitry was an architect, developer, and technology leader in data analytics and machine learning fields in the financial services industry.

Drop It Like It’s Mod: Breathing New Life Into Classic Games With AI in NVIDIA RTX Remix

PC game modding is massive, with over 5 billion mods downloaded annually. Mods push graphics forward with each GPU generation, extend a game’s lifespan with new content and attract new players.

NVIDIA RTX Remix is a modding platform for RTX AI PCs that lets modders capture game assets, automatically enhance materials with generative AI tools and create stunning RTX remasters with full ray tracing. Today, RTX Remix exited beta and fully launched with new NVIDIA GeForce RTX 50 Series neural rendering technology and many community-requested upgrades.

Since its initial beta release, RTX Remix has been experimented with by over 30,000 modders, bringing ray-traced mods of hundreds of classic titles to over 1 million gamers.

RTX Remix supports a host of AI tools, including NVIDIA DLSS 4, RTX Neural Radiance Cache and the community-published AI model PBRFusion 3.

Modders can build 4K physically based rendering (PBR) assets by hand or use generative AI to accelerate their workflows. And with a few additional clicks, RTX Remix mods support DLSS 4 with Multi Frame Generation. DLSS’ new transformer model and the first neural shader, Neural Radiance Cache, provide enhanced neural rendering performance, meaning classic games look and play better than ever.

Generative AI Texture Tools

RTX Remix’s built-in generative AI texture tools analyze low-resolution textures from classic games, generate physically accurate materials — including normal and roughness maps — and upscale the resolution by up to 4x. Many RTX Remix mods have been created incorporating generative AI.

Earlier this month, RTX Remix modder NightRaven published PBRFusion 3 — a new AI model that upscales textures and generates high-quality normal, roughness and height maps for physically-based materials.

PBRFusion 3 consists of two custom-trained models: a PBR model and a diffusion-based upscaler. PBRFusion 3 can also use the RTX Remix application programming interface to connect with ComfyUI in an integrated flow. NightRaven has packaged all the relevant pieces to make it easy to get started.

The PBRFusion3 page features a plug-and-play package that includes the relevant ComfyUI graphs and nodes. Once installed, remastering is easy. Select a number of textures in RTX Remix’s Viewport and hit process in ComfyUI. This integrated flow enables extensive remasters of popular games to be completed by small hobbyist mod teams.

RTX Remix and REST API

RTX Remix Toolkit capabilities are accessible via REST API, allowing modders to livelink RTX Remix to digital content creation tools such as Blender, modding tools such as Hammer and generative AI apps such as ComfyUI.

For example, through REST API integration, modders can seamlessly export all game textures captured in RTX Remix to ComfyUI and enhance them in one big batch before automatically bringing them back into the game. ComfyUI is RTX-accelerated and includes thousands of generative AI models to try, helping reduce the time to remaster a game scene and providing many ways to process textures.

Modders have many super resolution and PBR models to choose from, including ones that feature metallic and height maps — unlocking 8x or more resolution increases. Additionally, ComfyUI enables modders to use text prompts to generate new details in textures, or make grand stylistic departures by changing an entire scene’s look with a single text prompt.

‘Half-Life 2 RTX’ Demo

Half-Life 2 owners can download a free Half-Life 2 RTX demo from Steam, built with RTX Remix, starting March 18. The demo showcases Orbifold Studios’ work in Ravenholm and Nova Prospekt ahead of the full game’s release at a later date.

Half-Life 2 RTX showcases the expansive capabilities of RTX Remix and NVIDIA’s neural rendering technologies. DLSS 4 with Multi Frame Generation multiplies frame rates by up to 10x at 4K. Neural Radiance Cache further accelerates ray-traced lighting. RTX Skin enhances Father Grigori, headcrabs and zombies with one of the first implementations of subsurface scattering in ray-traced gaming. RTX Volumetrics add realistic smoke effects and fog. And everything interplays and interacts with the fully ray-traced lighting.

What’s Next in AI Starts Here

From the keynote by NVIDIA founder and CEO Jensen Huang on Tuesday, March 18, to over 1,000 inspiring sessions, 300+ exhibits, technical hands-on training and tons of unique networking events — NVIDIA’s own GTC is set to put a spotlight on AI and all its benefits.

Experts from across the AI ecosystem will share insights on deploying AI locally, optimizing models and harnessing cutting-edge hardware and software to enhance AI workloads — highlighting key advancements in RTX AI PCs and workstations. RTX AI Garage will be there to share highlights of the latest advancements coming to the RTX AI platform.

Follow NVIDIA AI PC on Facebook, Instagram, TikTok and X — and stay informed by subscribing to the RTX AI PC newsletter

Gaming Goodness: NVIDIA Reveals Latest Neural Rendering and AI Advancements Supercharging Game Development at GDC 2025

AI is leveling up the world’s most beloved games, as the latest advancements in neural rendering, NVIDIA RTX and digital human technologies equip game developers to take innovative leaps in their work.

At this year’s GDC conference, running March 17-21 in San Francisco, NVIDIA is revealing new AI tools and technologies to supercharge the next era of graphics in games.

Key announcements include new neural rendering advancements with Unreal Engine 5 and Microsoft DirectX; NVIDIA DLSS 4 now available in over 100 games and apps, making it the most rapidly adopted NVIDIA game technology of all time; and a Half-Life 2 RTX demo coming Tuesday, March 18.

Plus, the open-source NVIDIA RTX Remix modding platform has now been released, and NVIDIA ACE technology enhancements are bringing to life next-generation digital humans and AI agents for games.

Neural Shaders Enable Photorealistic, Living Worlds With AI

The next era of computer graphics will be based on NVIDIA RTX Neural Shaders, which allow the training and deployment of tiny neural networks from within shaders to generate textures, materials, lighting, volumes and more. This results in dramatic improvements in game performance, image quality and interactivity, delivering new levels of immersion for players.

At the CES trade show earlier this year, NVIDIA introduced RTX Kit, a comprehensive suite of neural rendering technologies for building AI-enhanced, ray-traced games with massive geometric complexity and photorealistic characters.

Now, at GDC, NVIDIA is expanding its powerful lineup of neural rendering technologies, including with Microsoft DirectX support and plug-ins for Unreal Engine 5.

NVIDIA is partnering with Microsoft to bring neural shading support to the DirectX 12 Agility software development kit preview in April, providing game developers with access to RTX Tensor Cores to accelerate the performance of applications powered by RTX Neural Shaders.

Plus, Unreal Engine developers will be able to get started with RTX Kit features such as RTX Mega Geometry and RTX Hair through the experimental NVIDIA RTX branch of Unreal Engine 5. These enable the rendering of assets with dramatic detail and fidelity, bringing cinematic-quality visuals to real-time experiences.

Now available, NVIDIA’s “Zorah” technology demo has been updated with new incredibly detailed scenes filled with millions of triangles, complex hair systems and cinematic lighting in real time — all by tapping into the latest technologies powering neural rendering, including:

ReSTIR Path Tracing
ReSTIR Direct Illumination
RTX Mega Geometry
RTX Hair

And the first neural shader, Neural Radiance Cache, is now available in RTX Remix.

Over 100 DLSS 4 Games and Apps Out Now

DLSS 4 debuted with the release of GeForce RTX 50 Series GPUs. Over 100 games and apps now feature support for DLSS 4. This milestone has been reached two years quicker than with DLSS 3, making DLSS 4 the most rapidly adopted NVIDIA game technology of all time.

DLSS 4 introduced Multi Frame Generation, which uses AI to generate up to three additional frames per traditionally rendered frame, working with the complete suite of DLSS technologies to multiply frame rates by up to 8x over traditional brute-force rendering.

This massive performance improvement on GeForce RTX 50 Series graphics cards and laptops enables gamers to max out visuals at the highest resolutions and play at incredible frame rates.

In addition, Lost Soul Aside, Mecha BREAK, Phantom Blade Zero, Stellar Blade, Tides of Annihilation and Wild Assault will launch with DLSS 4, giving GeForce RTX gamers the definitive PC experience in each title. Learn more.

Developers can get started with DLSS 4 through the DLSS 4 Unreal Engine plug-in.

‘Half-Life 2 RTX’ Demo Launch, RTX Remix Official Release

Half-Life 2 RTX is a community-made remaster of the iconic first-person shooter Half-Life 2.

A playable Half-Life 2 RTX demo will be available on Tuesday, March 18, for free download from Steam for Half-Life 2 owners. The demo showcases Orbifold Studios’ work in the eerily sensational maps of Ravenholm and Nova Prospekt, with significantly improved assets and textures, full ray tracing, DLSS 4 with Multi Frame Generation and RTX neural rendering technologies.

Half-Life 2 RTX was made possible by NVIDIA RTX Remix, an open-source platform officially released today for modders to create stunning RTX remasters of classic games.

Use the platform now to join the 30,000+ modders who’ve experimented with enhancing hundreds of classic titles since its beta release last year, enabling over 1 million gamers to experience astonishing ray-traced mods.

NVIDIA ACE Technologies Enhance Game Characters With AI

The NVIDIA ACE suite of RTX-accelerated digital human technologies brings game characters to life with generative AI.

NVIDIA ACE autonomous game characters add autonomous teammates, nonplayer characters (NPCs) and self-learning enemies to games, creating new narrative possibilities and enhancing player immersion.

ACE autonomous game characters are debuting in two titles this month:

In inZOI, “Smart Zoi” NPCs will respond more realistically and intelligently to their environment based on their personalities. The game launches with NVIDIA ACE-based characters on Friday, March 28.

And in NARAKA: BLADEPOINT MOBILE PC VERSION, on-device NVIDIA ACE-powered teammates will help players battle enemies, hunt for loot and fight for victory starting Thursday, March 27.

Developers can start building with ACE today.

Join NVIDIA at GDC.

See notice regarding software product information.

Relive the Magic as GeForce NOW Brings More Blizzard Gaming to the Cloud

Bundle up — GeForce NOW is bringing a flurry of Blizzard titles to its ever-expanding library.

Prepare to weather epic gameplay in the cloud, tackling the genres of real-time strategy (RTS), multiplayer online battle arena (MOBA) and more. Classic Blizzard titles join GeForce NOW, including Heroes of the Storm, Warcraft Rumble and three titles from the Warcraft: Remastered series.

They’re all part of 11 games joining the cloud this week, atop the latest update for hit game Zenless Zone Zero from miHoYo.

Blizzard Heats Things Up

Heroes of the Storm on GeForce NOW — *Heroes (and save data) never die in the cloud.*

Heroes of the Storm, Blizzard’s unique take on the MOBA genre, offers fast-paced team battles across diverse battlegrounds. The game features a roster of iconic Blizzard franchise characters, each with customizable talents and abilities. Heroes of the Storm emphasizes team-based gameplay with shared experiences and objectives, making it more accessible to newcomers while providing depth for experienced players.

Warcraft Rumble on GeForce NOW — *The cloud is rumbling.*

In Warcraft Rumble, a mobile action-strategy game set in the Warcraft universe, players collect and deploy miniature versions of the series’ beloved characters. The game offers a blend of tower defense and RTS elements as players battle across various modes, including a single-player campaign, player vs. player matches and cooperative dungeons.

Warcraft Remastered on GeForce NOW — *Old-school cool, new-school graphics.*

The Warcraft Remastered collection gives the classic RTS titles a modern twist with updated visuals and quality-of-life improvements. Warcraft: Remastered and Warcraft II: Remastered offer enhanced graphics while maintaining the original gameplay, allowing players to toggle between classic and updated visuals. Warcraft III: Reforged includes new graphics options and multiplayer features. Both these remasters provide nostalgia for long-time fans and an ideal opportunity for new players to experience the iconic strategy games that shaped the genre.

New Games, No Wait

Zenless Zone Zero update 1.6 Among the Forgotten Ruins on GeForce NOW — *New agents, new adventures.*

The popular Zenless Zone Zero gets its 1.6 update, “Among the Forgotten Ruins,” now available for members to stream without waiting around for updates or downloads. This latest update brings three new playable agents: Soldier 0-Anby, Pulchra and Trigger. Players can explore two new areas, Port Elpis and Reverb Arena, as well as try out the “Hollow Zero-Lost Void” mode. The update also introduces a revamped Decibel system for more strategic gameplay.

Look for the following games available to stream in the cloud this week:

Citizen Sleeper 2: Starward Vector (Xbox, available on PC Game Pass)
City Transport Simulator: Tram (Steam)
Dave the Diver (Steam)
Heroes of the Storm (Battle.net)
Microtopia (Steam)
Orcs Must Die Deathtrap (Xbox, available on PC Game Pass)
Potion Craft: Alchemist Simulator (Steam)
Warcraft I Remastered (Battle.net)
Warcraft II Remastered (Battle.net)
Warcraft III: Reforged (Battle.net)
Warcraft Rumble (Battle.net)

What are you planning to play this weekend? Let us know on X or in the comments below.

You wake up in the last game you played – how are things going?

— NVIDIA GeForce NOW (@NVIDIAGFN) March 11, 2025

Introducing the New PyTorch Landscape: Your Guide to the PyTorch Ecosystem

We’re excited to reveal our brand new PyTorch Landscape. The PyTorch Landscape helps researchers, developers, and organizations easily locate useful, curated, community-built tools that augment the PyTorch core framework.

What the Landscape Offers

The Landscape visually organizes projects into three categories—Modeling, Training, and Optimizations—making finding relevant frameworks, libraries, and projects easy. Users can quickly locate curated, valuable tools for a variety of use cases that complement the PyTorch framework. Each tool that is part of the Landscape has been reviewed and vetted by PyTorch project experts. The projects in the Landscape are considered to be mature and healthy and provide valuable capabilities that complement the PyTorch framework in their respective domains.

Explore the AI Landscape

The Explore page presents platforms, tools, and libraries, each with a logo, description, and links to GitHub and further details. This categorized, visual approach simplifies discovery and provides quick access to essential technologies.

Guide Page: A Closer Look

For deeper insights, the Guide page expands on each project, highlighting methodologies and trends shaping AI development, from adversarial robustness to self-supervised learning. There are also project statistics provided for each project, including metrics such as number of stars, contributors, commit history, languages used, license, and other valuable metrics that provide an in-depth understanding of the project and how it may be used.

Tracking AI’s Growth: The Stats Page

The Stats page provides insights into AI development trends, tracking repository activity, programming languages, and industry funding data.

Repositories: 117 repositories, 20.5k contributors, and 797.2k stars across 815MB of source code.
Development Trends: Weekly commit activity over the last year.
Licensing Breakdown: Repositories are categorized by license type.
Funding & Acquisitions: Insights into investment trends, including funding rounds and acquisitions.

Why Use the PyTorch Landscape?

Finding useful and high quality open source projects that complement the PyTorch core system can be overwhelming. The PyTorch Landscape offers a clear, accessible way to explore the ecosystem of community-built tools, whether you’re researching, building models, or making strategic decisions.

Stay ahead with the PyTorch Landscape — your guide to the PyTorch Ecosystem.

Want to Contribute a Project to the PyTorch Landscape?

Have you built a useful open source tool that you would like to share with the PyTorch community? Then help us grow the Ecosystem by contributing your tool! You can find the instructions to apply here. We welcome all contributions from the community!

Visatronic: A Multimodal Decoder-Only Model for Speech Synthesis

In this paper, we propose a new task – generating speech from videos of people and their transcripts (VTTS) – to motivate new techniques for multimodal speech generation. This task generalizes the task of generating speech from cropped lip videos, and is also more complicated than the task of generating generic audio clips (e.g., dog barking) from videos and text. Multilingual versions of the task could lead to new techniques for cross-lingual dubbing. We also present a decoder-only multimodal model for this task, which we call Visatronic. This model embeds vision, text and speech directly…Apple Machine Learning Research

Exploring creative possibilities: A visual guide to Amazon Nova Canvas

Compelling AI-generated images start with well-crafted prompts. In this follow-up to our Amazon Nova Canvas Prompt Engineering Guide, we showcase a curated gallery of visuals generated by Nova Canvas—categorized by real-world use cases—from marketing and product visualization to concept art and design exploration.

Each image is paired with the prompt and parameters that generated it, providing a practical starting point for your own AI-driven creativity. Whether you’re crafting specific types of images, optimizing workflows, or simply seeking inspiration, this guide will help you unlock the full potential of Amazon Nova Canvas.

Solution overview

Getting started with Nova Canvas is straightforward. You can access the model through the Image Playground on the AWS Management Console for Amazon Bedrock, or through APIs. For detailed setup instructions, including account requirements and necessary permissions, visit our documentation on Creative content generation with Amazon Nova. Our previous post on prompt engineering best practices provides comprehensive guidance on crafting effective prompts.

A visual guide to Amazon Nova Canvas

In this gallery, we showcase a diverse range of images and the prompts used to generate them, highlighting how Amazon Nova Canvas adapts to various use cases—from marketing and product design to storytelling and concept art.

All images that follow were generated using Nova Canvas at a 1280x720px resolution with a CFG scale of 6.5, seed of 0, and the Premium setting for image quality. This resolution also matches the image dimensions expected by Nova Reel, allowing you to take these images into Amazon Nova Reel to experiment with video generation.

Landscapes


Overhead perspective of winding river delta, capturing intricate branching waterways and sediment patterns. Soft morning light revealing subtle color gradations between water and land. Revealing landscape’s hidden fluid dynamics from bird’s-eye view.		Sparse arctic tundra landscape at twilight, expansive white terrain with isolated rock formations silhouetted against a deep blue sky. Low-contrast black and white composition capturing the infinite horizon, with subtle purple hues in the shadows. Ultra-wide-angle perspective emphasizing the vastness of negative space and geological simplicity.

Wide-angle aerial shot of patchwork agricultural terrain at golden hour, with long shadows accentuating the texture and topography of the land. Emphasis on the interplay of light and shadow across the geometric field divisions.		Dynamic drone perspective of a dramatic shoreline at golden hour, capturing long shadows cast by towering sea stacks and coastal cliffs. Hyper-detailed imagery showcasing the interplay of warm sunlight on rocky textures and the cool, foamy edges of incoming tides.

Dramatic wide-angle shot of a rugged mountain range at sunset, with a lone tree silhouetted in the foreground, creating a striking focal point.		Wide-angle capture of a hidden beach cove, surrounded by towering cliffs, with a shipwreck partially visible in the shallow waters.

Character portraits


A profile view of a weathered fisherman, silhouetted against a pastel dawn sky. The rim lighting outlines the shape of his beard and the texture of his knit cap. Rendered with high contrast to emphasize the rugged contours of his face and the determined set of his jaw.

A weathered fisherman with a thick gray beard and a knit cap, framed against the backdrop of a misty harbor at dawn. The image captures him in a medium shot, revealing more of his rugged attire. Cool, blue tones dominate the scene, contrasting with the warm highlights on his face.		An intimate portrait of a seasoned fisherman, his face filling the frame. His thick gray beard is flecked with sea spray, and his knit cap is pulled low over his brow. The warm glow of sunset bathes his weathered features in golden light, softening the lines of his face while still preserving the character earned through years at sea. His eyes reflect the calm waters of the harbor behind him.

A seaside cafe at sunrise, with a seasoned barista’s silhouette visible through the window. Their kind smile is illuminated by the warm glow of the rising sun, creating a serene atmosphere. The image has a dreamy, soft-focus quality with pastel hues.

A dynamic profile shot of a barista in motion, captured mid-conversation with a customer. Their smile is genuine and inviting, with laugh lines accentuating their seasoned experience. The cafe’s interior is rendered in soft bokeh, maintaining the cinematic feel with a shallow depth of field.		A front-facing portrait of an experienced barista, their welcoming smile framed by the sleek espresso machine. The background bustles with blurred cafe activity, while the focus remains sharp on the barista’s friendly demeanor. The lighting is contrasty, enhancing the cinematic mood.

Fashion photography


A model with sharp cheekbones and platinum pixie cut in a distressed leather bomber jacket stands amid red smoke in an abandoned subway tunnel. Wide-angle lens, emphasizing tunnel’s converging lines, strobed lighting creating a sense of motion.

A model with sharp cheekbones and platinum pixie cut wears a distressed leather bomber jacket, posed against a stark white cyclorama. Low-key lighting creates deep shadows, emphasizing the contours of her face. Shot from a slightly lower angle with a medium format camera, highlighting the jacket’s texture.		Close-up portrait of a model with defined cheekbones and a platinum pixie cut, emerging from an infinity pool while wearing a wet distressed leather bomber jacket. Shot from a low angle with a tilt-shift lens, blurring the background for a dreamy fashion magazine aesthetic.

A model with sharp cheekbones and platinum pixie cut is wearing a distressed leather bomber jacket, caught mid-laugh at a backstage fashion show. Black and white photojournalistic style, natural lighting.		Side profile of a model with defined cheekbones and a platinum pixie cut, standing still amidst the chaos of Chinatown at midnight. The distressed leather bomber jacket contrasts with the blurred neon lights in the background, creating a sense of urban solitude.

Product photography


A flat lay featuring a premium matte metal water bottle with bamboo accents, placed on a textured linen cloth. Eco-friendly items like a cork notebook, a sprig of eucalyptus, and a reusable straw are arranged around it. Soft, natural lighting casts gentle shadows, emphasizing the bottle’s matte finish and bamboo details. The background is an earthy tone like beige or light gray, creating a harmonious and sustainable composition.		Angled perspective of the premium water bottle with bamboo elements, positioned on a natural jute rug. Surrounding it are earth-friendly items: a canvas tote bag, a stack of recycled paper notebooks, and a terracotta planter with air-purifying plants. Warm, golden hour lighting casts long shadows, emphasizing textures and creating a cozy, sustainable atmosphere. The scene evokes a sense of eco-conscious home or office living.

An overhead view of the water bottle’s bamboo cap, partially unscrewed to reveal the threaded metal neck. Soft, even lighting illuminates the entire scene, showcasing the natural variations in the bamboo’s color and grain. The bottle’s matte metal body extends out of frame, creating a minimalist composition that draws attention to the sustainable materials and precision engineering.		An angled view of a premium matte metal water bottle with bamboo accents, showcasing its sleek profile. The background features a soft blur of a serene mountain lake. Golden hour sunlight casts a warm glow on the bottle’s surface, highlighting its texture. Captured with a shallow depth of field for product emphasis.

A pair of premium over-ear headphones with a matte black finish and gold accents, arranged in a flat lay on a clean white background. Organic leaves for accents. small notepad, pencils, and a carrying case are neatly placed beside the headphones, creating a symmetrical and balanced composition. Bright, diffused lighting eliminates shadows, emphasizing the sleek design without distractions. A shadowless, crisp aesthetic.

An overhead shot of premium over-ear headphones resting on a reflective surface, showcasing the symmetry of the design. Dramatic side lighting accentuates the curves and edges, casting subtle shadows that highlight the product’s premium build quality.		An extreme macro shot focusing on the junction where the leather ear cushion meets the metallic housing of premium over-ear headphones. Sharp details reveal the precise stitching and material textures, while selective focus isolates this area against a softly blurred, dark background, showcasing the product’s premium construction.

An overhead shot of premium over-ear headphones resting on a reflective surface, showcasing the symmetry of the design. Dramatic side lighting casts long shadows, accentuating the curves of the headband and the depth of the ear cups against a minimalist white background.		A dynamic composition of premium over-ear headphones floating in space, with the headband and ear cups slightly separated to showcase individual components. Rim lighting outlines each piece, while a gradient background adds depth and sophistication.

A smiling student holding up her smartphone, displaying a green matte screen for easy image replacement, in a classroom setting.		Overhead view of a young man typing on a laptop with a green matte screen, surrounded by work materials on a wooden table.

Food photography


Monochromatic macarons arranged in precise geometric pattern. Strong shadow play. Architectural lighting. Minimal composition.

A pyramid of macarons in ombre pastels, arranged on a matte black slate surface. Dramatic side lighting from left. Close-up view highlighting texture of macaron shells. Garnished with edible gold leaf accents. Shot at f/2 aperture for shallow depth of field.		Disassembled macaron parts in zero-g chamber. Textured cookie halves, viscous filling streams, and scattered almond slivers drifting. High-contrast lighting with subtle shadows on off-white. Wide-angle shot showcasing full dispersal pattern.

Architectural design


A white cubic house with floor-to-ceiling windows, interior view from living room. Double-height space, floating steel staircase, polished concrete floors. Late afternoon sunbeams streaming across minimal furnishings. Ultra-wide architectural lens.		A white cubic house with floor-to-ceiling windows, kitchen and dining space. Monolithic marble island, integrated appliances, dramatic shadows from skylight above. Shot from a low angle with a wide-angle lens, emphasizing the height and openness of the space, late afternoon golden hour light streaming in.

An angular white modernist house featuring expansive glass walls, photographed for Architectural Digest’s cover. Misty morning atmosphere, elongated infinity pool creating a mirror image, three-quarter aerial view, lush coastal vegetation framing the scene.

A white cubic house with floor-to-ceiling windows presented as detailed architectural blueprints. Site plan view showing landscaping and property boundaries, technical annotations, blue background with white lines, precise measurements and zoning specifications visible.		A white cubic house with floor-to-ceiling windows in precise isometric projection. X-ray style rendering revealing internal framework, electrical wiring, and plumbing systems. Technical cross-hatching on load-bearing elements and foundation.

Concept art


A stylized digital painting of a bustling plaza in a futuristic eco-city, with soft impressionistic brushstrokes. Crystalline towers frame the scene, while suspended gardens create a canopy overhead. Holographic displays and eco-friendly vehicles add life to the foreground. Dreamlike and atmospheric, with glowing highlights in sapphire and rose gold.		A stylized digital painting of an elevated park in a futuristic eco-city, viewed from a high angle, with soft impressionistic brushstrokes. Crystalline towers peek through a canopy of trees, while winding elevated walkways connect floating garden platforms. People relax in harmony with nature. Dreamlike and atmospheric, with glowing highlights in jade and amber.

Concept art of a floating garden platform in a futuristic city, viewed from below. Translucent roots and hanging vines intertwine with advanced technology, creating a mesmerizing canopy. Soft bioluminescent lights pulse through the vegetation, casting ethereal patterns on the ocean’s surface. A gradient of deep purples and blues dominates the twilight sky.

An enchanted castle atop a misty cliff at sunrise, warm golden light bathing the ivy-covered spires. A wide-angle view capturing a flock of birds soaring past the tallest tower, set against a dramatic sky with streaks of orange and pink. Mystical ambiance and dynamic composition.		A magical castle rising from morning fog on a rugged cliff face, bathed in cool blue twilight. A low-angle shot showcasing the castle’s imposing silhouette against a star-filled sky, with a crescent moon peeking through wispy clouds. Mysterious mood and vertical composition emphasizing height.

An enchanted fortress clinging to a mist-shrouded cliff, caught in the moment between night and day. A panoramic view from below, revealing the castle’s reflection in a tranquil lake at the base of the cliff. Ethereal pink and purple hues in the sky, with a V-formation of birds flying towards the castle. Serene atmosphere and balanced symmetry.

Illustration


Japanese ink wash painting of a cute baby dragon with pearlescent mint-green scales and tiny wings curled up in a nest made of cherry blossom petals. Delicate brushstrokes, emphasis on negative space.		Art nouveau-inspired composition centered on an endearing dragon hatchling with gleaming mint-green scales. Sinuous morning glory stems and blossoms intertwine around the subject, creating a harmonious balance. Soft, dreamy pastels and characteristic decorative elements frame the scene.

Watercolor scene of a cute baby dragon with pearlescent mint-green scales crouched at the edge of a garden puddle, tiny wings raised. Soft pastel flowers and foliage frame the composition. Loose, wet-on-wet technique for a dreamy atmosphere, with sunlight glinting off ripples in the puddle.

A playful, hand-sculpted claymation-style baby dragon with pearlescent mint scales and tiny wings, sitting on a puffy marshmallow cloud. Its soft, rounded features and expressive googly eyes give it a lively, mischievous personality as it giggles and flaps its stubby wings, trying to take flight in a candy-colored sky.		A whimsical, animated-style render of a baby dragon with pearlescent mint scales nestled in a bed of oversized, bioluminescent flowers. The floating island garden is bathed in the warm glow of sunset, with fireflies twinkling like stars. Dynamic lighting accentuates the dragon’s playful expression.

Graphic design


A set of minimalist icons for a health tracking app. Dual-line design with 1.5px stroke weight on solid backgrounds. Each icon uses teal for the primary line and a lighter shade for the secondary line, with ample negative space. Icons maintain consistent 64x64px dimensions with centered compositions. Clean, professional aesthetic suitable for both light and dark modes.		Stylized art deco icons for fitness tracking. Geometric abstractions of health symbols with gold accents. Balanced designs incorporating circles, triangles, and zigzag motifs. Clean and sophisticated.

Set of charming wellness icons for digital health tracker. Organic, hand-drawn aesthetic with soft, curvy lines. Uplifting color combination of lemon yellow and fuchsia pink. Subtle size variations among icons for a dynamic, handcrafted feel.

Lush greenery tapestry in 16:9 panoramic view. Detailed monstera leaves overlap in foreground, giving way to intricate ferns and tendrils. Emerald and sage watercolor washes create atmospheric depth. Foliage density decreases towards center, suggesting an enchanted forest clearing.		Modern botanical line drawing in 16:9 widescreen. Forest green single-weight outlines of stylized foliage. Negative space concentrated in the center for optimal text placement. Geometric simplification of natural elements with a focus on curves and arcs.

3D sculptural typography spelling out “BRAVE” with each letter made from a different material, arranged in a dynamic composition.		Experimental typographic interpretation of “BRAVE” using abstract, interconnected geometric shapes that flow and blend organically. Hyper-detailed textures reminiscent of fractals and natural patterns create a mesmerizing, otherworldly appearance with sharp contrast.

A dreamy photograph overlaid with delicate pen-and-ink drawings, blending reality and fantasy to reveal hidden magic in ordinary moments.		Surreal digital collage blending organic and technological elements in a futuristic style.

Abstract figures emerging from digital screens, gradient color transitions, mixed textures, dynamic composition, conceptual narrative style.		Abstract humanoid forms materializing from multiple digital displays, vibrant color gradients flowing between screens, contrasting smooth and pixelated textures, asymmetrical layout with visual tension, surreal storytelling aesthetic.

Abstract figures emerging from digital screens, glitch art aesthetic with RGB color shifts, fragmented pixel clusters, high contrast scanlines, deep shadows cast by volumetric lighting.

Conclusion

The examples showcased here are just the beginning of what’s possible with Amazon Nova Canvas. For even greater control, you can guide generations with reference images, use custom color palettes, or make precise edits—such as swapping backgrounds or refining details— with simple inputs. Plus, with built-in safeguards such as watermarking and content moderation, Nova Canvas offers a responsible and secure creative experience. Whether you’re a professional creator, a marketing team, or an innovator with a vision, Nova Canvas provides the tools to bring your ideas to life.

We invite you to explore these possibilities yourself and discover how Nova Canvas can transform your creative process. Stay tuned for our next installment, where we’ll dive into the exciting world of video generation with Amazon Nova Reel.

Ready to start creating? Visit the Amazon Bedrock console today and bring your ideas to life with Nova Canvas. For more information about features, specifications, and additional examples, explore our documentation on creative content generation with Amazon Nova.

About the authors

Yanyan Zhang is a Senior Generative AI Data Scientist at Amazon Web Services, where she has been working on cutting-edge AI/ML technologies as a Generative AI Specialist, helping customers use generative AI to achieve their desired outcomes. Yanyan graduated from Texas A&M University with a PhD in Electrical Engineering. Outside of work, she loves traveling, working out, and exploring new things.

Kris Schultz has spent over 25 years bringing engaging user experiences to life by combining emerging technologies with world class design. As Sr. Solutions Architect within Amazon AGI, he influences the development of Amazon’s first-party generative AI models. Kris is passionate about empowering users and creators of all types with generative AI tools and knowledge.

Sanju Sunny is a Generative AI Design Technologist with AWS Prototyping & Cloud Engineering (PACE), specializing in strategy, engineering, and customer experience. He collaborates with customers across diverse industries, leveraging Amazon’s customer-obsessed innovation mechanisms to rapidly conceptualize, validate, and prototype innovative products, services, and experiences.

Nitin Eusebius is a Sr. Enterprise Solutions Architect at AWS, experienced in Software Engineering, Enterprise Architecture, and AI/ML. He is deeply passionate about exploring the possibilities of generative AI. He collaborates with customers to help them build well-architected applications on the AWS platform, and is dedicated to solving technology challenges and assisting with their cloud journey.

Gemini Robotics brings AI into the physical world

Introducing Gemini Robotics and Gemini Robotics-ER, AI models designed for robots to understand, act and react to the physical world.Read More

Preparation

Option 1: Deploy TGI on Amazon EC2 Inf2

Option 2: Deploy TGI on SageMaker

Clean Up

Conclusion

About the Authors

The opportunity for open-ended conversation analysis at enterprise scale

Solution overview

Benefits: How Amazon Bedrock added value

Success metrics

Conclusion

About the Authors

Model Variants

Hugging Face Text Generation Inference (TGI)

Key Optimizations in Hugging Face TGI

Runtime TGI Arguments

DeepSeek Deployment Patterns with TGI on Amazon SageMaker AI

Option 1: Direct Deployment from Hugging Face Hub

Option 2: Deployment from a Private S3 Bucket

Deployment Best Practices

Inference Performance Evaluation

Scenarios

DeepSeek-R1-Distill-Qwen-1.5B

DeepSeek-R1-Distill-Qwen-7B

DeepSeek-R1-Distill-Llama-8B

DeepSeek-R1-Distill-Qwen-14B

DeepSeek-R1-Distill-Qwen-32B

DeepSeek-R1-Distill-Llama-70B

Clean Up

Conclusion

About the Authors

Generative AI Texture Tools

RTX Remix and REST API

‘Half-Life 2 RTX’ Demo

What’s Next in AI Starts Here

Neural Shaders Enable Photorealistic, Living Worlds With AI

Over 100 DLSS 4 Games and Apps Out Now

‘Half-Life 2 RTX’ Demo Launch, RTX Remix Official Release

NVIDIA ACE Technologies Enhance Game Characters With AI

Blizzard Heats Things Up

New Games, No Wait

What the Landscape Offers

Explore the AI Landscape

Guide Page: A Closer Look

Tracking AI’s Growth: The Stats Page

Why Use the PyTorch Landscape?

Want to Contribute a Project to the PyTorch Landscape?

Solution overview

A visual guide to Amazon Nova Canvas

Landscapes

Character portraits

Fashion photography

Product photography

Food photography

Architectural design

Concept art

Illustration

Graphic design

Conclusion

About the authors

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.