GTC 2025 – Announcements and Live Updates

GTC 2025 – Announcements and Live Updates

What’s next in AI is at GTC 2025. Not only the technology, but the people and ideas that are pushing AI forward — creating new opportunities, novel solutions and whole new ways of thinking. For all of that, this is the place.

Here’s where to find the news, hear the discussions, see the robots and ponder the just-plain mind-blowing. From the keynote to the final session, check back for live coverage kicking off when the doors open on Monday, March 17, in San Jose, California.

The Future Rolls Into San Jose

Anyone who’s been in downtown San Jose lately has seen it happening. The banners are up. The streets are shifting. The whole city is getting a fresh coat of NVIDIA green.

From March 17-21, San Jose will become a crossroads for the thinkers, tinkerers and true enthusiasts of AI, robotics and accelerated computing. The conversations will be sharp, fast-moving and sometimes improbable — but that’s the point.

At the center of it all? NVIDIA founder and CEO Jensen Huang’s keynote, offering a glimpse into the future. It’ll take place at the SAP Center on Tuesday, March 18, at 10 a.m. PT. Expect big ideas, a few surprises, some roars of laughter and the occasional moment that leaves the room silent.

But GTC isn’t just what happens on stage. It’s a conference that refuses to stay inside its walls. It spills out into sessions at McEnery Convention Center, hands-on demos at the Tech Interactive Museum, late-night conversations at the Plaza de César Chávez night market and more. San Jose isn’t just hosting GTC. It’s becoming it.

The speakers are a mix of visionaries and builders — the kind of people who make you rethink what’s possible:

🧠Yann LeCun – chief AI scientist at Meta, professor, New York University
🏆Frances Arnold – Nobel Laureate, Caltech
🚗RJ Scaringe – founder and CEO of Rivian
🤖Pieter Abbeel – robotics pioneer, UC Berkeley
🌍Arthur Mensch – CEO of Mistral AI
🌮Joe Park – chief digital and technology officer of Yum! Brands
♟Noam Brown – research scientist at OpenAI

Some are pushing the limits of AI itself; others are weaving it into the world around us.

📢 Want in? Register now.

Check back here for what to watch, read and play — and what it all means. Tune in to all the big moments, the small surprises and the ideas that’ll stick for years to come.

See you in San Jose. #GTC25

Read More

What's new in TensorFlow 2.19

What’s new in TensorFlow 2.19

Posted by the TensorFlow team

TensorFlow 2.19 has been released! Highlights of this release include changes to the C++ API in LiteRT, bfloat16 support for tflite casting, discontinue of releasing libtensorflow packages. Learn more by reading the full release notes.

Note: Release updates on the new multi-backend Keras will be published on keras.io, starting with Keras 3.0. For more information, please see https://keras.io/keras_3/.

TensorFlow Core

LiteRT

The public constants tflite::Interpreter:kTensorsReservedCapacity and tflite::Interpreter:kTensorsCapacityHeadroom are now const references, rather than constexpr compile-time constants. (This is to enable better API compatibility for TFLite in Play services while preserving the implementation flexibility to change the values of these constants in the future.)

TF-Lite

tfl.Cast op is now supporting bfloat16 in the runtime kernel. tf.lite.Interpreter gives a deprecation warning redirecting to its new location at ai_edge_litert.interpreter, as the API tf.lite.Interpreter will be deleted in TF 2.20. See the migration guide for details.

Libtensorflow

We have stopped publishing libtensorflow packages but it can still be unpacked from the PyPI package.

Read More

Benchmarking customized models on Amazon Bedrock using LLMPerf and LiteLLM

Benchmarking customized models on Amazon Bedrock using LLMPerf and LiteLLM

Open foundation models (FMs) allow organizations to build customized AI applications by fine-tuning for their specific domains or tasks, while retaining control over costs and deployments. However, deployment can be a significant portion of the effort, often requiring 30% of project time because engineers must carefully optimize instance types and configure serving parameters through careful testing. This process can be both complex and time-consuming, requiring specialized knowledge and iterative testing to achieve the desired performance.

Amazon Bedrock Custom Model Import simplifies deployments of custom models by offering a straightforward API for model deployment and invocation. You can upload model weights and let AWS handle an optimal, fully managed deployment. This makes sure that deployments are performant and cost effective. Amazon Bedrock Custom Model Import also handles automatic scaling, including scaling to zero. When not in use and there are no invocations for 5 minutes, it scales to zero. You pay only for what you use in 5-minute increments. It also handles scaling up, automatically increasing the number of active model copies when higher concurrency is required. These features make Amazon Bedrock Custom Model Import an attractive solution for organizations looking to use custom models on Amazon Bedrock providing simplicity and cost-efficiency.

Before deploying these models in production, it’s crucial to evaluate their performance using benchmarking tools. These tools help to proactively detect potential production issues such as throttling and verify that deployments can handle expected production loads.

This post begins a blog series exploring DeepSeek and open FMs on Amazon Bedrock Custom Model Import. It covers the process of performance benchmarking of custom models in Amazon Bedrock using popular open source tools: LLMPerf and LiteLLM. It includes a notebook that includes step-by-step instructions to deploy a DeepSeek-R1-Distill-Llama-8B model, but the same steps apply for any other model supported by Amazon Bedrock Custom Model Import.

Prerequisites

This post requires an Amazon Bedrock custom model. If you don’t have one in your AWS account yet, follow the instructions from Deploy DeepSeek-R1 distilled Llama models with Amazon Bedrock Custom Model Import.

Using open source tools LLMPerf and LiteLLM for performance benchmarking

To conduct performance benchmarking, you will use LLMPerf, a popular open-source library for benchmarking foundation models. LLMPerf simulates load tests on model invocation APIs by creating concurrent Ray Clients and analyzing their responses. A key advantage of LLMPerf is wide support of foundation model APIs. This includes LiteLLM, which supports all models available on Amazon Bedrock.

Setting up your custom model invocation with LiteLLM

LiteLLM is a versatile open source tool that can be used both as a Python SDK and a proxy server (AI gateway) for accessing over 100 different FMs using a standardized format. LiteLLM standardizes inputs to match each FM provider’s specific endpoint requirements. It supports Amazon Bedrock APIs, including InvokeModel and Converse APIs, and FMs available on Amazon Bedrock, including imported custom models.

To invoke a custom model with LiteLLM, you use the model parameter (see Amazon Bedrock documentation on LiteLLM). This is a string that follows the bedrock/provider_route/model_arn format.

The provider_route indicates the LiteLLM implementation of request/response specification to use. DeepSeek R1 models can be invoked using their custom chat template using the DeepSeek R1 provider route, or with the Llama chat template using the Llama provider route.

The model_arn is the model Amazon Resource Name (ARN) of the imported model. You can get the model ARN of your imported model in the console or by sending a ListImportedModels request.

For example, the following script invokes the custom model using the DeepSeek R1 chat template.

import time
from litellm import completion

while True:
    try:
        response = completion(
            model=f"bedrock/deepseek_r1/{model_id}",
            messages=[{"role": "user", "content": """Given the following financial data:
        - Company A's revenue grew from $10M to $15M in 2023
        - Operating costs increased by 20%
        - Initial operating costs were $7M
        
        Calculate the company's operating margin for 2023. Please reason step by step."""},
                      {"role": "assistant", "content": "<think>"}],
            max_tokens=4096,
        )
        print(response['choices'][0]['message']['content'])
        break
    except:
        time.sleep(60)

After the invocation parameters for the imported model have been verified, you can configure LLMPerf for benchmarking.

Configuring a token benchmark test with LLMPerf

To benchmark performance, LLMPerf uses Ray, a distributed computing framework, to simulate realistic loads. It spawns multiple remote clients, each capable of sending concurrent requests to model invocation APIs. These clients are implemented as actors that execute in parallel. llmperf.requests_launcher manages the distribution of requests across the Ray Clients, and allows for simulation of various load scenarios and concurrent request patterns. At the same time, each client will collect performance metrics during the requests, including latency, throughput, and error rates.

Two critical metrics for performance include latency and throughput:

  • Latency refers to the time it takes for a single request to be processed.
  • Throughput measures the number of tokens that are generated per second.

Selecting the right configuration to serve FMs typically involves experimenting with different batch sizes while closely monitoring GPU utilization and considering factors such as available memory, model size, and specific requirements of the workload. To learn more, see Optimizing AI responsiveness: A practical guide to Amazon Bedrock latency-optimized inference. Although Amazon Bedrock Custom Model Import simplifies this by offering pre-optimized serving configurations, it’s still crucial to verify your deployment’s latency and throughput.

Start by configuring token_benchmark.py, a sample script that facilitates the configuration of a benchmarking test. In the script, you can define parameters such as:

  • LLM API: Use LiteLLM to invoke Amazon Bedrock custom imported models.
  • Model: Define the route, API, and model ARN to invoke similarly to the previous section.
  • Mean/standard deviation of input tokens: Parameters to use in the probability distribution from which the number of input tokens will be sampled.
  • Mean/standard deviation of output tokens: Parameters to use in the probability distribution from which the number of output tokens will be sampled.
  • Number of concurrent requests: The number of users that the application is likely to support when in use.
  • Number of completed requests: The total number of requests to send to the LLM API in the test.

The following script shows an example of how to invoke the model. See this notebook for step-by-step instructions on importing a custom model and running a benchmarking test.

python3 ${{LLM_PERF_SCRIPT_DIR}}/token_benchmark_ray.py \
--model "bedrock/llama/{model_id}" \
--mean-input-tokens {mean_input_tokens} \
--stddev-input-tokens {stddev_input_tokens} \
--mean-output-tokens {mean_output_tokens} \
--stddev-output-tokens {stddev_output_tokens} \
--max-num-completed-requests ${{LLM_PERF_MAX_REQUESTS}} \
--timeout 1800 \
--num-concurrent-requests ${{LLM_PERF_CONCURRENT}} \
--results-dir "${{LLM_PERF_OUTPUT}}" \
--llm-api litellm \
--additional-sampling-params '{{}}'

At the end of the test, LLMPerf will output two JSON files: one with aggregate metrics, and one with separate entries for every invocation.

Scale to zero and cold-start latency

One thing to remember is that because Amazon Bedrock Custom Model Import will scale down to zero when the model is unused, you need to first make a request to make sure that there is at least one active model copy. If you obtain an error indicating that the model isn’t ready, you need to wait for approximately ten seconds and up to 1 minute for Amazon Bedrock to prepare at least one active model copy. When ready, run a test invocation again, and proceed with benchmarking.

Example scenario for DeepSeek-R1-Distill-Llama-8B

Consider a DeepSeek-R1-Distill-Llama-8B model hosted on Amazon Bedrock Custom Model Import, supporting an AI application with low traffic of no more than two concurrent requests. To account for variability, you can adjust parameters for token count for prompts and completions. For example:

  • Number of clients: 2
  • Mean input token count: 500
  • Standard deviation input token count: 25
  • Mean output token count: 1000
  • Standard deviation output token count: 100
  • Number of requests per client: 50

This illustrative test takes approximately 8 minutes. At the end of the test, you will obtain a summary of results of aggregate metrics:

inter_token_latency_s
    p25 = 0.010615988283217918
    p50 = 0.010694698716183695
    p75 = 0.010779359342088015
    p90 = 0.010945443657517748
    p95 = 0.01100556307365132
    p99 = 0.011071086908721675
    mean = 0.010710014800224604
    min = 0.010364670612635254
    max = 0.011485444453299149
    stddev = 0.0001658793389904756
ttft_s
    p25 = 0.3356793452499005
    p50 = 0.3783651359990472
    p75 = 0.41098671700046907
    p90 = 0.46655246950049334
    p95 = 0.4846706690498647
    p99 = 0.6790834719300077
    mean = 0.3837810468001226
    min = 0.1878921090010408
    max = 0.7590946710006392
    stddev = 0.0828713133225014
end_to_end_latency_s
    p25 = 9.885957818500174
    p50 = 10.561580732000039
    p75 = 11.271923759749825
    p90 = 11.87688222009965
    p95 = 12.139972019549713
    p99 = 12.6071144856102
    mean = 10.406450886010116
    min = 2.6196457750011177
    max = 12.626598834998731
    stddev = 1.4681851822617253
request_output_throughput_token_per_s
    p25 = 104.68609252502657
    p50 = 107.24619111072519
    p75 = 108.62997591951486
    p90 = 110.90675007239598
    p95 = 113.3896235445618
    p99 = 116.6688412475626
    mean = 107.12082450567561
    min = 97.0053466021563
    max = 129.40680882698936
    stddev = 3.9748004356837137
number_input_tokens
    p25 = 484.0
    p50 = 500.0
    p75 = 514.0
    p90 = 531.2
    p95 = 543.1
    p99 = 569.1200000000001
    mean = 499.06
    min = 433
    max = 581
    stddev = 26.549294727074212
number_output_tokens
    p25 = 1050.75
    p50 = 1128.5
    p75 = 1214.25
    p90 = 1276.1000000000001
    p95 = 1323.75
    p99 = 1372.2
    mean = 1113.51
    min = 339
    max = 1392
    stddev = 160.9598415942952
Number Of Errored Requests: 0
Overall Output Throughput: 208.0008834264341
Number Of Completed Requests: 100
Completed Requests Per Minute: 11.20784995697034

In addition to the summary, you will receive metrics for individual requests that can be used to prepare detailed reports like the following histograms for time to first token and token throughput.

Analyzing performance results from LLMPerf and estimating costs using Amazon CloudWatch

LLMPerf gives you the ability to benchmark the performance of custom models served in Amazon Bedrock without having to inspect the specifics of the serving properties and configuration of your Amazon Bedrock Custom Model Import deployment. This information is valuable because it represents the expected end user experience of your application.

In addition, the benchmarking exercise can serve as a valuable tool for cost estimation. By using Amazon CloudWatch, you can observe the number of active model copies that Amazon Bedrock Custom Model Import scales to in response to the load test. ModelCopy is exposed as a CloudWatch metric in the AWS/Bedrock namespace and is reported using the imported model ARN as a label. The plot for the ModelCopy metric is shown in the figure below. This data will assist in estimating costs, because billing is based on the number of active model copies at a given time.

Conclusion

While Amazon Bedrock Custom Model Import simplifies model deployment and scaling, performance benchmarking remains essential to predict production performance, and compare models across key metrics such as cost, latency, and throughput.

To learn more, try the example notebook with your custom model.

Additional resources:


About the Authors

Felipe Lopez is a Senior AI/ML Specialist Solutions Architect at AWS. Prior to joining AWS, Felipe worked with GE Digital and SLB, where he focused on modeling and optimization products for industrial applications.

Rupinder Grewal is a Senior AI/ML Specialist Solutions Architect with AWS. He currently focuses on the serving of models and MLOps on Amazon SageMaker. Prior to this role, he worked as a Machine Learning Engineer building and hosting models. Outside of work, he enjoys playing tennis and biking on mountain trails.

Paras Mehra is a Senior Product Manager at AWS. He is focused on helping build Amazon Bedrock. In his spare time, Paras enjoys spending time with his family and biking around the Bay Area.

Prashant Patel is a Senior Software Development Engineer in AWS Bedrock. He’s passionate about scaling large language models for enterprise applications. Prior to joining AWS, he worked at IBM on productionizing large-scale AI/ML workloads on Kubernetes. Prashant has a master’s degree from NYU Tandon School of Engineering. While not at work, he enjoys traveling and playing with his dogs.

Read More

Creating asynchronous AI agents with Amazon Bedrock

Creating asynchronous AI agents with Amazon Bedrock

The integration of generative AI agents into business processes is poised to accelerate as organizations recognize the untapped potential of these technologies. Advancements in multimodal artificial intelligence (AI), where agents can understand and generate not just text but also images, audio, and video, will further broaden their applications. This post will discuss agentic AI driven architecture and ways of implementing.

The emergence of generative AI agents in recent years has contributed to the transformation of the AI landscape, driven by advances in large language models (LLMs) and natural language processing (NLP). Companies like Anthropic, Cohere, and Amazon have made significant strides in developing powerful language models capable of understanding and generating human-like content across multiple modalities, revolutionizing how businesses integrate and utilize artificial intelligence in their processes.

These AI agents have demonstrated remarkable versatility, being able to perform tasks ranging from creative writing and code generation to data analysis and decision support. Their ability to engage in intelligent conversations, provide context-aware responses, and adapt to diverse domains has revolutionized how businesses approach problem-solving, customer service, and knowledge dissemination.

One of the most significant impacts of generative AI agents has been their potential to augment human capabilities through both synchronous and asynchronous patterns. In synchronous orchestration, just like in traditional process automation, a supervisor agent orchestrates the multi-agent collaboration, maintaining a high-level view of the entire process while actively directing the flow of information and tasks. This approach allows businesses to offload repetitive and time-consuming tasks in a controlled, predictable manner.

Alternatively, asynchronous choreography follows an event-driven pattern where agents operate autonomously, triggered by events or state changes in the system. In this model, agents publish events or messages that other agents can subscribe to, creating a workflow that emerges from their collective behavior. These patterns have proven particularly valuable in enhancing customer experiences, where agents can provide round-the-clock support, resolve issues promptly, and deliver personalized recommendations through either orchestrated or event-driven interactions, leading to increased customer satisfaction and loyalty.

Agentic AI architecture

Agentic AI architecture is a shift in process automation through autonomous agents towards the capabilities of AI, with the purpose of imitating cognitive abilities and enhancing the actions of traditional autonomous agents. This architecture can enable businesses to streamline operations, enhance decision-making processes, and automate complex tasks in new ways.

Much like traditional business process automation through technology, the agentic AI architecture is the design of AI systems designed to resolve complex problems with limited or indirect human intervention. These systems are composed of multiple AI agents that converse with each other or execute complex tasks through a series of choreographed or orchestrated processes. This approach empowers AI systems to exhibit goal-directed behavior, learn from experience, and adapt to changing environments.

The difference between a single agent invocation and a multi-agent collaboration lies in the complexity and the number of agents involved in the process.

When you interact with a digital assistant like Alexa, you’re typically engaging with a single agent, also known as a conversational agent. This agent processes your request, such as setting a timer or checking the weather, and provides a response without needing to consult other agents.

Now, imagine expanding this interaction to include multiple agents working together. Let’s start with a simple travel booking scenario:

Your interaction begins with telling a travel planning agent about your desired trip. In this first step, the AI model, in this case an LLM, is acting as an interpreter and user experience interface between your natural language input and the structured information needed by the travel planning system. It’s processing your request, which might be a complex statement like “I want to plan a week-long beach vacation in Hawaii for my family of four next month,” and extracting key details such as the destination, duration, number of travelers, and approximate dates.

The LLM is also likely to infer additional relevant information that wasn’t explicitly stated, such as the need for family-friendly accommodations or activities. It might ask follow-up questions to clarify ambiguous points or gather more specific preferences. Essentially, the LLM is transforming your casual, conversational input into a structured set of travel requirements that can be used by the specialized booking agents in the subsequent steps of the workflow.

This initial interaction sets the foundation for the entire multi-agent workflow, making sure that the travel planning agent has a clear understanding of your needs before engaging other specialized agents.

By adding another agent, the flight booking agent, the travel planning agent can call upon it to find suitable flights. The travel planning agent needs to provide the flight booking agent with relevant information (dates, destinations), and wait for and process the flight booking agent’s response, to incorporate the flight options into its overall plan

Now, let’s add another agent to the workflow; a hotel booking agent to support finding accommodations. With this addition, the travel planning agent must also communicate with the hotel booking agent, which needs to make sure that the hotel dates align with the flight dates and provide the information back to the overall plan to include both flight and hotel options.

As we continue to add agents, such as a car rental agent or a local activities agent, each new addition receives relevant information from the travel planning agent, performs its specific task, and returns its results to be incorporated into the overall plan. The travel planning agent acts not only as the user experience interface, but also as a coordinator, deciding when to involve each specialized agent and how to combine their inputs into a cohesive travel plan.

This multi-agent workflow allows for more complex tasks to be accomplished by taking advantage of the specific capabilities of each agent. The system remains flexible, because agents can be added or removed based on the specific needs of each request, without requiring significant changes to the existing agents and minimal change to the overall workflow.

For more on the benefits of breaking tasks into agents, see How task decomposition and smaller LLMs can make AI more affordable.

Process automation with agentic AI architecture

The preceding scenario, just like in traditional process automation, is a common orchestration pattern, where the multi-agent collaboration is orchestrated by a supervisor agent. The supervisor agent acts like a conductor leading an orchestra, telling each instrument when to play and how to harmonize with others. For this approach, Amazon Bedrock Agents enables generative AI applications to execute multi-step tasks orchestrated by an agent and create a multi-agent collaboration with Amazon Bedrock Agents to solve complex tasks. This is done by designating an Amazon Bedrock agent as a supervisor agent, associating one or more collaborator agents with the supervisor. For more details, read on creating and configuring Amazon Bedrock Agents and Use multi-agent collaboration with Amazon Bedrock Agents.

The following diagram illustrates the supervisor agent methodology.

Supervisor agent methodology

Supervisor agent methodology

Following traditional process automation patterns, the other end of the spectrum to synchronous orchestration would be asynchronous choreography: an asynchronous event-driven multi-agent workflow. In this approach, there would be no central orchestrating agent (supervisor). Agents operate autonomously where actions are triggered by events or changes in a system’s state and agents publish events or messages that other agents can subscribe to. In this approach, the workflow emerges from the collective behavior of the agents reacting to events asynchronously. It’s more like a jazz improvisation, where each musician responds to what others are playing without a conductor. The following diagram illustrates this event-driven workflow.

Event-driven workflow methodology

Event-driven workflow methodology

The event-driven pattern in asynchronous systems operates without predefined workflows, creating a dynamic and potentially chaotic processing environment. While agents subscribe to and publish messages through a central event hub, the flow of processing is determined organically by the message requirements and the available subscribed agents. Although the resulting pattern may resemble a structured workflow when visualized, it’s important to understand that this is emergent behavior rather than orchestrated design. The absence of centralized workflow definitions means that message processing occurs naturally based on publication timing and agent availability, creating a fluid and adaptable system that can evolve with changing requirements.

The choice between synchronous orchestration and asynchronous event-driven patterns fundamentally shapes how agentic AI systems operate and scale. Synchronous orchestration, with its supervisor agent approach, provides precise control and predictability, making it ideal for complex processes requiring strict oversight and sequential execution. This pattern excels in scenarios where the workflow needs to be tightly managed, audited, and debugged. However, it can create bottlenecks as all operations must pass through the supervisor agent. Conversely, asynchronous event-driven systems offer greater flexibility and scalability through their distributed nature. By allowing agents to operate independently and react to events in real-time, these systems can handle dynamic scenarios and adapt to changing requirements more readily. While this approach may introduce more complexity in tracking and debugging workflows, it excels in scenarios requiring high scalability, fault tolerance, and adaptive behavior. The decision between these patterns often depends on the specific requirements of the system, balancing the need for control and predictability against the benefits of flexibility and scalability.

Getting the best of both patterns

You can use a single agent to route messages to other agents based on the context of the event data (message) at runtime, with no prior knowledge of the downstream agents, without having to rely on each agent subscribing to an event hub. This is traditionally known as the message broker or event broker pattern, which for the purpose of this article we will call an agent broker pattern, to represent brokering of messages to AI agents. The agent broker pattern is a hybrid approach that combines elements of both centralized synchronous orchestration and distributed asynchronous event-driven systems.

The key to this pattern is that a single agent acts as a central hub for message distribution but doesn’t control the entire workflow. The broker agent determines where to send each message based on its content or metadata, making routing decisions at runtime. The processing agents are decoupled from each other and from the message source, only interacting with the broker to receive messages. The agent broker pattern is different from the supervisor pattern because it awaits a response from collaborating agents by routing a message to an agent and not awaiting a response. The following diagram illustrates the agent broker methodology.

Agent broker methodology

Agent broker methodology

Following an agent broker pattern, the system is still fundamentally event-driven, with actions triggered by the arrival of messages. New agents can be added to handle specific types of messages without changing the overall system architecture. Understanding how to implement this type of pattern will be explained later in this post.

This pattern is often used in enterprise messaging systems, microservices architectures, and complex event processing systems. It provides a balance between the structure of orchestrated workflows and the flexibility of pure event-driven systems.

Agentic architecture with the Amazon Bedrock Converse API

Traditionally, we might have had to sacrifice some flexibility in the broker pattern by having to update the routing logic in the broker when adding additional processes (agents) to the architecture. This is, however, not the case when using the Amazon Bedrock Converse API. With the Converse API, we can call a tool to complete an Amazon Bedrock model response. The only change is the additional agent added to the collaboration stored as configuration outside of the broker.

To let a model use a tool to complete a response for a message, the message and the definitions for one or more tools (agents) are sent to the model. If the model determines that one of the tools can help generate a response, it returns a request to use the tool.

AWS AppConfig, a capability of AWS Systems Manager, is used to store each of the agents’ tool context data as a single configuration in a managed data store, to be sent to the Converse API tool request. By using AWS Lambda as the message broker to receive all message and send requests to the Converse API with the tool context stored in AWS AppConfig, the architecture allows for adding additional agents to the system without having to update the routing logic, by ‘registering’ agents as ‘tool context’ in the configuration stored in AWS AppConfig, to be read by Lambda at run time (event message received). For more information about when to use AWS Config, see AWS AppConfig use cases.

Implementing the agent broker pattern

The following diagram demonstrates how Amazon EventBridge and Lambda act as a central message broker, with the Amazon Bedrock Converse API to let a model use a tool in a conversation to dynamically route messages to appropriate AI agents.

Agent broker architecture diagram

Agent broker architecture

Messages sent to EventBridge are routed through an EventBridge rule to Lambda. There are three tasks the EventBridge Lambda function performs as the agent broker:

  1. Query AWS AppConfig for all agents’ tool context. An agent tool context is a description of the agent’s capability along with the Amazon Resource Name (ARN) or URL of the agent’s message ingress.
  2. Provide the agent tool context along with the inbound event message to the Amazon Bedrock LLM through the Converse API; in this example, using an Amazon Bedrock tools-compatible LLM. The LLM, using the Converse API, combines the event message context compared to the agent tool context to provide a response back to the requesting Lambda function, containing the recommended tool or tools that should be used to process the message.
  3. Receive the response from the Converse API request containing one or more tools that should be called to process the event message, and hands the event message to the ingress of the recommended tools.

In this example, the architecture demonstrates brokering messages asynchronously to an Amazon SageMaker based agent, an Amazon Bedrock agent, and an external third-party agent, all from the same agent broker.

Although the brokering Lambda function could connect directly to the SageMaker or Amazon Bedrock agent API, the architecture provides for adaptability and scalability in message throughput, allowing messages from the agent broker to be queued, in this example with Amazon Simple Queue Service (Amazon SQS), and processed according to the capability of the receiving agent. For adaptability, the Lambda function subscribed to the agent ingress queue provides additional system prompts (pre-prompting of the LLM for specific tool context) and message formatted, and required functions for the expected input and output of the agent request.

To add new agents to the system, the only integration requirements are to update the AWS AppConfig with the new agent tool context (description of the agents’ capability and ingress endpoint), and making sure the brokering Lambda function has permissions to write to the agent ingress endpoint.

Agents can be added to the system without rewriting the Lambda function or integration that requires downtime, allowing the new agent to be used on the next instantiation of the brokering Lambda function.

Implementing the supervisor pattern with an agent broker

Building upon the agent broker pattern, the architecture can be extended to handle more complex, stateful interactions. Although the broker pattern effectively uses AWS AppConfig and Amazon Bedrock’s Converse API tool use capability for dynamic routing, its unidirectional nature has limitations. Events flow in and are distributed to agents, but complex scenarios like travel booking require maintaining context across multiple agent interactions. This is where the supervisor pattern provides additional capabilities without compromising the flexible routing we achieved with the broker pattern.

Using the example of the travel booking agent: the example has the broker agent and several task-based agents that events will be pushed to. When processing a request like “Book a 3-night trip to Sydney from Melbourne during the first week of September for 2 people”, we encounter several challenges. Although this statement contains clear intent, it lacks critical details that the agent might need, such as:

  1. Specific travel dates
  2. Accommodation preferences and room configurations

The broker pattern alone can’t effectively manage these information gaps while maintaining context between agent interactions. This is where adding the capability of a supervisor to the broker agent provides:

  • Contextual awareness between events and agent invocations
  • Bi-directional information flow capabilities

The following diagram illustrates the supervisor pattern workflow

Supervisor pattern architecture diagram

Supervisor pattern architecture

When a new event enters the system, the workflow initiates the following steps:

  1. The event is assigned a unique identifier for tracking
  2. The supervisor performs the following actions:
    • Evaluates which agents to invoke (brokering)
    • Creates a new state record with the identifier and timestamp
    • Provides this contextual information to the selected agents along with their invocation parameters
  3. Agents process their tasks and emit ‘task completion’ events back to EventBridge
  4. The supervisor performs the following actions:
    • Collects and processes completed events
    • Evaluates the combined results and context
    • Determines if additional agent invocations are needed
    • Continues this cycle until all necessary actions are completed

This pattern handles scenarios where agents might return varying results or request additional information. The supervisor can either:

  • Derive missing information from other agent responses
  • Request additional information from the source
  • Coordinate with other agents to resolve information gaps

To handle information gaps without architectural modifications, we can introduce an answers agent to the existing system. This agent operates within the same framework as other agents, but specializes in context resolution. When agents report incomplete information or require clarification, the answers agent can:

  • Process queries about missing information
  • Emit task completion events with enhanced context
  • Allow the supervisor to resume workflow execution with newly available information, the same way that it would after another agent emits its task-completion event.

This enhancement enables complex, multi-step workflows while maintaining the system’s scalability and flexibility. The supervisor can manage dependencies between agents, handle partial completions, and make sure that the necessary information is gathered before finalizing tasks.

Implementation considerations:

Implementing the supervisor pattern on top of the existing broker agent architecture provides the advantages of both the broker pattern and the complex state management of orchestration. The state management can be handled through Amazon DynamoDB, and maintaining the use of EventBridge for event routing and AWS AppConfig for agent configuration. The Amazon Bedrock Converse API continues to play a crucial role in agent selection, but now with added context from the supervisor’s state management. This allows you to preserve the dynamic routing capabilities we established with the broker pattern while adding the sophisticated workflow management needed for complex, multi-step processes.

Conclusion

Agentic AI architecture, powered by Amazon Bedrock and AWS services, represents a leap forward in the evolution of automated AI systems. By combining the flexibility of event-driven systems with the power of generative AI, this architecture enables businesses to create more adaptive, scalable, and intelligent automated processes. The agent broker pattern offers a robust solution for dynamically routing complex tasks to specialized AI agents, and the agent supervisor pattern extends these capabilities to handle sophisticated, context-aware workflows.

These patterns take advantage of the strengths of the Amazon Bedrock’s Converse API, Lambda, EventBridge, and AWS AppConfig to create a flexible and extensible system. The broker pattern excels at dynamic routing and seamless agent integration, while the supervisor pattern adds crucial state management and contextual awareness for complex, multi-step processes. Together, they provide a comprehensive framework for building sophisticated AI systems that can handle both simple routing and complex, stateful interactions.

This architecture not only streamlines operations, but also opens new possibilities for innovation and efficiency across various industries. Whether implementing simple task routing or orchestrating complex workflows requiring maintained context, organizations can build scalable, maintainable AI systems that evolve with their needs while maintaining operational stability.

To get started with an agentic AI architecture, consider the following next steps:

  • Explore Amazon Bedrock – If you haven’t already, sign up for Amazon Bedrock and experiment with its powerful generative AI models and APIs. Familiarize yourself with the Converse API and its tool use capabilities.
  • Prototype your own agent broker – Use the architecture outlined in this post as a starting point to build a proof-of-concept agent broker system tailored to your organization’s needs. Start small with a few specialized agents and gradually expand.
  • Identify use cases – Analyze your current business processes to identify areas where an agentic AI architecture could drive significant improvements. Consider complex, multi-step tasks that could benefit from AI assistance.
  • Stay informed – Keep up with the latest developments in AI and cloud technologies. AWS regularly updates its offerings, so stay tuned for new features that could enhance your agentic AI systems.
  • Collaborate and shareJoin AI and cloud computing communities to share your experiences and learn from others. Consider contributing to open-source projects or writing about your implementation to help advance the field.
  • Invest in training – Make sure your team has the necessary skills to work with these advanced AI technologies. Consider AWS training and certification programs to build expertise in your organization.

By embracing an agentic AI architecture, you’re not just optimizing your current processes – you’re positioning your organization at the forefront of the AI revolution. Start your journey today and unlock the full potential of AI-driven automation for your business.


About the Authors

aaron sempfAaron Sempf is Next Gen Tech Lead for the AWS Partner Organization in Asia-Pacific and Japan. With over 20 years in distributed system engineering design and development, he focuses on solving for large scale complex integration and event driven systems. In his spare time, he can be found coding prototypes for autonomous robots, IoT devices, distributed solutions, and designing agentic architecture patterns for generative AI assisted business automation.

josh tothJoshua Toth is a Senior Prototyping Engineer with over a decade of experience in software engineering and distributed systems. He specializes in solving complex business challenges through technical prototypes, demonstrating the art of the possible. With deep expertise in proof of concept development, he focuses on bridging the gap between emerging technologies and practical business applications. In his spare time, he can be found developing next-generation interactive demonstrations and exploring cutting-edge technological innovations.

sara van de moosdijkSara van de Moosdijk, simply known as Moose, is an AI/ML Specialist Solution Architect at AWS. She helps AWS customers and partners build and scale AI/ML solutions through technical enablement, support, and architectural guidance. Moose spends her free time figuring out how to fit more books in her overflowing bookcase.

Read More

How to run Qwen 2.5 on AWS AI chips using Hugging Face libraries

How to run Qwen 2.5 on AWS AI chips using Hugging Face libraries

The Qwen 2.5 multilingual large language models (LLMs) are a collection of pre-trained and instruction tuned generative models in 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B (text in/text out and code out). The Qwen 2.5 fine tuned text-only models are optimized for multilingual dialogue use cases and outperform both previous generations of Qwen models, and many of the publicly available chat models based on common industry benchmarks.

At its core, Qwen 2.5 is an auto-regressive language model that uses an optimized transformer architecture. The Qwen2.5 collection can support over 29 languages and has enhanced role-playing abilities and condition-setting for chatbots.

In this post, we outline how to get started with deploying the Qwen 2.5 family of models on an Inferentia instance using Amazon Elastic Compute Cloud (Amazon EC2) and Amazon SageMaker using the Hugging Face Text Generation Inference (TGI) container and the Hugging Face Optimum Neuron library. Qwen2.5 Coder and Math variants are also supported.

Preparation

Hugging Face provides two tools that are frequently used when using AWS Inferentia and AWS Trainium: Text Generation Inference (TGI) containers, which provide support for deploying and serving LLMS, and the Optimum Neuron library, which serves as an interface between the Transformers library and the Inferentia and Trainium accelerators.

The first time a model is run on Inferentia or Trainium, you compile the model to make sure that you have a version that will perform optimally on Inferentia and Trainium chips. The Optimum Neuron library from Hugging Face along with the Optimum Neuron cache will transparently supply a compiled model when available. If you’re using a different model with the Qwen2.5 architecture, you might need to compile the model before deploying. For more information, see Compiling a model for Inferentia or Trainium.

You can deploy TGI as a docker container on an Inferentia or Trainium EC2 instance or on Amazon SageMaker.

Option 1: Deploy TGI on Amazon EC2 Inf2

In this example, you will deploy Qwen2.5-7B-Instruct on an inf2.xlarge instance. (See this article for detailed instructions on how to deploy an instance using the Hugging Face DLAMI.)

For this option, you SSH into the instance and create a .env file (where you’ll define your constants and specify where your model is cached) and a file named docker-compose.yaml (where you’ll define all of the environment parameters that you’ll need to deploy your model for inference). You can copy the following files for this use case.

  1. Create a .env file with the following content:
MODEL_ID='Qwen/Qwen2.5-7B-Instruct'
#MODEL_ID='/data/exportedmodel' 
HF_AUTO_CAST_TYPE='bf16' # indicates the auto cast type that was used to compile the model
MAX_BATCH_SIZE=4
MAX_INPUT_TOKENS=4000
MAX_TOTAL_TOKENS=4096

  1. Create a file named docker-compose.yaml with the following content:
version: '3.7'

services:
  tgi-1:
    image: ghcr.io/huggingface/neuronx-tgi:latest
    ports:
      - "8081:8081"
    environment:
      - PORT=8081
      - MODEL_ID=${MODEL_ID}
      - HF_AUTO_CAST_TYPE=${HF_AUTO_CAST_TYPE}
      - HF_NUM_CORES=2
      - MAX_BATCH_SIZE=${MAX_BATCH_SIZE}
      - MAX_INPUT_TOKENS=${MAX_INPUT_TOKENS}
      - MAX_TOTAL_TOKENS=${MAX_TOTAL_TOKENS}
      - MAX_CONCURRENT_REQUESTS=512
      #- HF_TOKEN=${HF_TOKEN} #only needed for gated models
    volumes:
      - $PWD:/data #can be removed if you aren't loading locally
    devices:
      - "/dev/neuron0"
  1. Use docker compose to deploy the model:

docker compose -f docker-compose.yaml --env-file .env up

  1. To confirm that the model deployed correctly, send a test prompt to the model:
curl 127.0.0.1:8081/generate 
    -X POST 
    -d '{
  "inputs":"Tell me about AWS.",
  "parameters":{
    "max_new_tokens":60
  }
}' 
    -H 'Content-Type: application/json'
  1. To confirm that the model can respond in multiple languages, try sending a prompt in Chinese:
#"Tell me how to open an AWS account"
curl 127.0.0.1:8081/generate 
    -X POST 
    -d '{
  "inputs":"告诉我如何开设 AWS 账户。", 
  "parameters":{
    "max_new_tokens":60
  }
}' 
    -H 'Content-Type: application/json'

Option 2: Deploy TGI on SageMaker

You can also use Hugging Face’s Optimum Neuron library to quickly deploy models directly from SageMaker using instructions on the Hugging Face Model Hub.

  1. From the Qwen 2.5 model card hub, choose Deploy, then SageMaker, and finally AWS Inferentia & Trainium.

How to deploy the model on Amazon SageMaker

How to find the code you'll need to deploy the model using AWS Inferentia and Trainium

  1. Copy the example code into a SageMaker notebook, then choose Run.
  2. The notebook you copied will look like the following:
import json
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client("iam")
    role = iam.get_role(RoleName="sagemaker_execution_role")["Role"]["Arn"]

# Hub Model configuration. https://huggingface.co/models
hub = {
    "HF_MODEL_ID": "Qwen/Qwen2.5-7B-Instruct",
    "HF_NUM_CORES": "2",
    "HF_AUTO_CAST_TYPE": "bf16",
    "MAX_BATCH_SIZE": "8",
    "MAX_INPUT_TOKENS": "3686",
    "MAX_TOTAL_TOKENS": "4096",
}


region = boto3.Session().region_name
image_uri = f"763104351884.dkr.ecr.{region}.amazonaws.com/huggingface-pytorch-tgi-inference:2.1.2-optimum0.0.27-neuronx-py310-ubuntu22.04"

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
    image_uri=image_uri,
    env=hub,
    role=role,
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type="ml.inf2.xlarge",
    container_startup_health_check_timeout=1800,
    volume_size=512,
)

# send request
predictor.predict(
    {
        "inputs": "What is is the capital of France?",
        "parameters": {
            "do_sample": True,
            "max_new_tokens": 128,
            "temperature": 0.7,
            "top_k": 50,
            "top_p": 0.95,
        }
    }
)

Clean Up

Make sure that you terminate your EC2 instances and delete your SageMaker endpoints to avoid ongoing costs.

Terminate EC2 instances through the AWS Management Console.

Terminate a SageMaker endpoint through the console or with the following commands:

predictor.delete_model()
predictor.delete_endpoint(delete_endpoint_config=True)

Conclusion

AWS Trainium and AWS Inferentia deliver high performance and low cost for deploying Qwen2.5 models. We’re excited to see how you will use these powerful models and our purpose-built AI infrastructure to build differentiated AI applications. To learn more about how to get started with AWS AI chips, see the AWS Neuron documentation.


About the Authors

Jim Burtoft is a Senior Startup Solutions Architect at AWS and works directly with startups as well as the team at Hugging Face. Jim is a CISSP, part of the AWS AI/ML Technical Field Community, part of the Neuron Data Science community, and works with the open source community to enable the use of Inferentia and Trainium. Jim holds a bachelor’s degree in mathematics from Carnegie Mellon University and a master’s degree in economics from the University of Virginia.

Miriam Lebowitz ProfileMiriam Lebowitz is a Solutions Architect focused on empowering early-stage startups at AWS. She leverages her experience with AIML to guide companies to select and implement the right technologies for their business objectives, setting them up for scalable growth and innovation in the competitive startup world.

Rhia Soni is a Startup Solutions Architect at AWS. Rhia specializes in working with early stage startups and helps customers adopt Inferentia and Trainium. Rhia is also part of the AWS Analytics Technical Field Community and is a subject matter expert in Generative BI. Rhia holds a bachelor’s degree in Information Science from the University of Maryland.

Paul Aiuto is a Senior Solution Architect Manager focusing on Startups at AWS. Paul created a team of AWS Startup Solution architects that focus on the adoption of Inferentia and Trainium. Paul holds a bachelor’s degree in Computer Science from Siena College and has multiple Cyber Security certifications.

Read More

Revolutionizing customer service: MaestroQA’s integration with Amazon Bedrock for actionable insight

Revolutionizing customer service: MaestroQA’s integration with Amazon Bedrock for actionable insight

This post is cowritten with Harrison Hunter is the CTO and co-founder of MaestroQA.

MaestroQA augments call center operations by empowering the quality assurance (QA) process and customer feedback analysis to increase customer satisfaction and drive operational efficiencies. They assist with operations such as QA reporting, coaching, workflow automations, and root cause analysis.

In this post, we dive deeper into one of MaestroQA’s key features—conversation analytics, which helps support teams uncover customer concerns, address points of friction, adapt support workflows, and identify areas for coaching through the use of Amazon Bedrock. We discuss the unique challenges MaestroQA overcame and how they use AWS to build new features, drive customer insights, and improve operational inefficiencies.

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies, such as AI21 Labs, Anthropic, Cohere, Meta, Mistral, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.

The opportunity for open-ended conversation analysis at enterprise scale

MaestroQA serves a diverse clientele across various industries, including ecommerce, marketplaces, healthcare, talent acquisition, insurance, and fintech. All of these customers have a common challenge: the need to analyze a high volume of interactions with their customers. Analyzing these customer interactions is crucial to improving their product, improving their customer support, providing customer satisfaction, and identifying key industry signals. However, customer interaction data such as call center recordings, chat messages, and emails are highly unstructured and require advanced processing techniques in order to accurately and automatically extract insights.

When customers receive incoming calls at their call centers, MaestroQA employs its proprietary transcription technology, built by enhancing open source transcription models, to transcribe the conversations. After the data is transcribed, MaestroQA uses technology they have developed in combination with AWS services such as Amazon Comprehend to run various types of analysis on the customer interaction data. For example, MaestroQA offers sentiment analysis for customers to identify the sentiment of their end customer during the support interaction, enabling MaestroQA’s customers to sort their interactions and manually inspect the best or worst interactions. MaestroQA also offers a logic/keyword-based rules engine for classifying customer interactions based on other factors such as timing or process steps including metrics like Average Handle Time (AHT), compliance or process checks, and SLA adherence.

MaestroQA’s customers love these analysis features because they allow them to continuously improve the quality of their support and identify areas where they can improve their product to better satisfy their end customers. However, they were also interested in more advanced analysis, such as asking open-ended questions like “How many times did the customer ask for an escalation?” MaestroQA’s existing rules engine couldn’t always answer these types of queries because end-users could ask for the same outcome in many different ways. For example, “Can I speak to your manager?” and “I would like to speak to someone higher up” don’t share the same keywords, but are both asking for an escalation. MaestroQA needed a way to accurately classify customer interactions based on open-ended questions.

MaestroQA faced an additional hurdle: the immense scale of customer interactions their clients manage. With clients handling anywhere from thousands to millions of customer engagements monthly, there was a pressing need for comprehensive analysis of support team performance across this vast volume of interactions. Consequently, MaestroQA had to develop a solution capable of scaling to meet their clients’ extensive needs.

To start developing this product, MaestroQA first rolled out a product called AskAI. AskAI allowed customers to run open-ended questions on a targeted list of up to 1,000 conversations. For example, a customer might use MaestroQA’s filters to find customer interactions in Oregon within the past two months and then run a root cause analysis query such as “What are customers frustrated about in Oregon?” to find churn risk anecdotes. Their customers really liked this feature and surprised MaestroQA with the breadth of use cases they covered, including analyzing marketing campaigns, service issues, and product opportunities. Customers started to request the ability to run this type of analysis across all of their transcripts, which could number in the millions, so they could quantify the impact of what they were seeing and find instances of important issues.

Solution overview

MaestroQA decided to use Amazon Bedrock to address their customers’ need for advanced analysis of customer interaction transcripts. Amazon Bedrock’s broad choice of FMs from leading AI companies, along with its scalability and security features, made it an ideal solution for MaestroQA.

MaestroQA integrated Amazon Bedrock into their existing architecture using Amazon Elastic Container Service (Amazon ECS). The customer interaction transcripts are stored in an Amazon Simple Storage Service (Amazon S3) bucket.

The following architecture diagram demonstrates the request flow for AskAI. When a customer submits an analysis request through MaestroQA’s web application, an ECS cluster retrieves the relevant transcripts from Amazon S3, cleans and formats the prompt, sends them to Amazon Bedrock for analysis using the customer’s selected FM, and stores the results in a database hosted in Amazon Elastic Compute Cloud (Amazon EC2), where they can be retrieved by MaestroQA’s frontend web application.

Solution architecture

MaestroQA offers their customers the flexibility to choose from multiple FMs available through Amazon Bedrock, including Anthropic’s Claude 3.5 Sonnet, Anthropic’s Claude 3 Haiku, Mistral 7b/8x7b, Cohere’s Command R and R+, and Meta’s Llama 3.1 models. This allows customers to select the model that best suits their specific use case and requirements.

The following screenshot shows how the AskAI feature allows MaestroQA’s customers to use the wide variety of FMs available on Amazon Bedrock to ask open-ended questions such as “What are some of the common issues in these tickets?” and generate useful insights from customer service interactions.

Product screenshot

To handle the high volume of customer interaction transcripts and provide low-latency responses, MaestroQA takes advantage of the cross-Region inference capabilities of Amazon Bedrock. Originally, they were doing the load balancing themselves, distributing requests between available AWS US Regions (us-east-1, us-west-2, and so on) and available EU Regions (eu-west-3, eu-central-1, and so on) for their North American and European customers, respectively. Now, the cross-Region inference capability of Amazon Bedrock enables MaestroQA to achieve twice the throughput compared to single-Region inference, a critical factor in scaling their solution to accommodate more customers. MaestroQA’s team no longer has to spend time and effort to predict their demand fluctuations, which is especially key when usage increases for their ecommerce customers around the holiday season. Cross-Region inference dynamically routes traffic across multiple Regions, providing optimal availability for each request and smoother performance during these high-usage periods. MaestroQA monitors this setup’s performance and reliability using Amazon CloudWatch.

Benefits: How Amazon Bedrock added value

Amazon Bedrock has enabled MaestroQA to innovate faster and gain a competitive advantage by offering their customers powerful generative AI features for analyzing customer interaction transcripts. With Amazon Bedrock, MaestroQA can now provide their customers with the ability to run open-ended queries across millions of transcripts, unlocking valuable insights that were previously inaccessible.

The broad choice of FMs available through Amazon Bedrock allows MaestroQA to cater to their customers’ diverse needs and preferences. Customers can select the model that best aligns with their specific use case, finding the right balance between performance and price.

The scalability and cross-Region inference capabilities of Amazon Bedrock enable MaestroQA to handle high volumes of customer interaction transcripts while maintaining low latency, regardless of their customers’ geographical locations.

MaestroQA takes advantage of the robust security features and ethical AI practices of Amazon Bedrock to bolster customer confidence. These measures make sure that client data remains secure during processing and isn’t used for model training by third-party providers. Additionally, Amazon Bedrock availability in Europe, coupled with its geographic control capabilities, allows MaestroQA to seamlessly extend AI services to European customers. This expansion is achieved without introducing additional complexities, thereby maintaining operational efficiency while adhering to Regional data regulations.

The adoption of Amazon Bedrock proved to be a game changer for MaestroQA’s compact development team. Its serverless architecture allowed the team to rapidly prototype and refine their application without the burden of managing complex hardware infrastructure. This shift enabled MaestroQA to channel their efforts into optimizing application performance rather than grappling with resource allocation. Moreover, Amazon Bedrock offers seamless compatibility with their existing AWS environment, allowing for a smooth integration process and further streamlining their development workflow. MaestroQA was able to use their existing authentication process with AWS Identity and Access Management (IAM) to securely authenticate their application to invoke large language models (LLMs) within Amazon Bedrock. They were also able to use the familiar AWS SDK to quickly and effortlessly integrate Amazon Bedrock into their application.

Overall, by using Amazon Bedrock, MaestroQA is able to provide their customers with a powerful and flexible solution for extracting valuable insights from their customer interaction data, driving continuous improvement in their products and support processes.

Success metrics

The early results have been remarkable.

A lending company uses MaestroQA to detect compliance risks on 100% of their conversations. Before, agents would raise internal escalations if a consumer complained about the loan or expressed being in a vulnerable state. However, this process was manual and error prone, and the lending company would miss many of these risks. Now, they are able to detect compliance risks with almost 100% accuracy.

A medical device company, who is required to report device issues to the FDA, no longer relies solely on agents to report internally customer-reported issues, but uses this service to analyze all of their conversations to make sure all complaints are flagged.

An education company has been able to replace their manual survey scores with an automated customer sentiment score that increased their sample size from 15% to 100% of conversations.

The best is yet to come.

Conclusion

Using AWS, MaestroQA was able to innovate faster and gain a competitive advantage. Companies from different industries such as financial services, healthcare and life sciences, and EdTech all share the common desire to provide better customer services for their clients. MaestroQA was able to enable them to do that by quickly pivoting to offer powerful generative AI features that solved tangible business problems and enhanced overall compliance.

Check out MaestroQA’s feature AskAI and their LLM-powered AI Classifiers if you’re interested in better understanding your customer conversations and survey scores. For more about Amazon Bedrock, see Get started with Amazon Bedrock and learn about features such as cross-Region inference to help scale your generative AI features globally.


About the Authors

Carole Suarez is a Senior Solutions Architect at AWS, where she helps guide startups through their cloud journey. Carole specializes in data engineering and holds an array of AWS certifications on a variety of topics including analytics, AI, and security. She is passionate about learning languages and is fluent in English, French, and Tagalog.

Ben Gruher is a Generative AI Solutions Architect at AWS, focusing on startup customers. Ben graduated from Seattle University where he obtained bachelor’s and master’s degrees in Computer Science and Data Science.

Harrison Hunter is the CTO and co-founder of MaestroQA where he leads the engineer and product teams. Prior to MaestroQA, Harrison studied computer science and AI at MIT.

Read More

Optimize hosting DeepSeek-R1 distilled models with Hugging Face TGI on Amazon SageMaker AI

Optimize hosting DeepSeek-R1 distilled models with Hugging Face TGI on Amazon SageMaker AI

DeepSeek-R1, developed by AI startup DeepSeek AI, is an advanced large language model (LLM) distinguished by its innovative, multi-stage training process. Instead of relying solely on traditional pre-training and fine-tuning, DeepSeek-R1 integrates reinforcement learning to achieve more refined outputs. The model employs a chain-of-thought (CoT) approach that systematically breaks down complex queries into clear, logical steps. Additionally, it uses NVIDIA’s parallel thread execution (PTX) constructs to boost training efficiency, and a combined framework of supervised fine-tuning (SFT) and group robust policy optimization (GRPO) makes sure its results are both transparent and interpretable.

In this post, we demonstrate how to optimize hosting DeepSeek-R1 distilled models with Hugging Face Text Generation Inference (TGI) on Amazon SageMaker AI.

Model Variants

The current DeepSeek model collection consists of the following models:

  • DeepSeek-V3 – An LLM that uses a Mixture-of-Experts (MoE) architecture. MoE models like DeepSeek-V3 and Mixtral replace the standard feed-forward neural network in transformers with a set of parallel sub-networks called experts. These experts are selectively activated for each input, allowing the model to efficiently scale to a much larger size without a corresponding increase in computational cost. For example, DeepSeek-V3 is a 671-billion-parameter model, but only 37 billion parameters (approximately 5%) are activated during the output of each token. DeepSeek-V3-Base is the base model from which the R1 variants are derived.
  • DeepSeek-R1-Zero – A fine-tuned variant of DeepSeek-V3 based on using reinforcement learning to guide CoT reasoning capabilities, without any SFT done prior. According to the DeepSeek R1 paper, DeepSeek-R1-Zero excelled at reasoning behaviors but encountered challenges with readability and language mixing.
  • DeepSeek-R1 – Another fine-tuned variant of DeepSeek-V3-Base, built similarly to DeepSeek-R1-Zero, but with a multi-step training pipeline. DeepSeek-R1 starts with a small amount of cold-start data prior to the GRPO process. It also incorporates SFT data through rejection sampling, combined with supervised data generated from DeepSeek-V3 to retrain DeepSeek-V3-base. After this, the retrained model goes through another round of RL, resulting in the DeepSeek-R1 model checkpoint.
  • DeepSeek-R1-Distill – Variants of Hugging Face’s Qwen and Meta’s Llama based on Qwen2.5-Math-1.5B, Qwen2.5-Math-7B, Qwen2.5-14B, Qwen2.5-32B, Llama-3.1-8B, and Llama-3.3-70B-Instruct. The distilled model variants are a result of fine-tuning Qwen or Llama models through knowledge distillation, where DeepSeek-R1 acts as the teacher and Qwen or Llama as the student. These models retain their existing architecture while gaining additional reasoning capabilities through a distillation process. They are exclusively fine-tuned using SFT and don’t incorporate any RL techniques.

The following figure illustrates the performance of DeepSeek-R1 compared to other state-of-the-art models on standard benchmark tests, such as MATH-500, MMLU, and more.

Hugging Face Text Generation Inference (TGI)

Hugging Face Text Generation Inference (TGI) is a high-performance, production-ready inference framework optimized for deploying and serving large language models (LLMs) efficiently. It is designed to handle the demanding computational and latency requirements of state-of-the-art transformer models, including Llama, Falcon, Mistral, Mixtral, and GPT variants – for a full list of TGI supported models refer to supported models.

Amazon SageMaker AI provides a managed way to deploy TGI-optimized models, offering deep integration with Hugging Face’s inference stack for scalable and cost-efficient LLM deployment. To learn more about Hugging Face TGI support on Amazon SageMaker AI, refer to this announcement post and this documentation on deploy models to Amazon SageMaker AI.

Key Optimizations in Hugging Face TGI

Hugging Face TGI is built to address the challenges associated with deploying large-scale text generation models, such as inference latency, token throughput, and memory constraints.

The benefits of using TGI include:

  • Tensor parallelism – Splits large models across multiple GPUs for efficient memory utilization and computation
  • Continuous batching – Dynamically batches multiple inference requests to maximize token throughput and reduce latency
  • Quantization – Lowers memory usage and computational cost by converting model weights to INT8 or FP16
  • Speculative decoding – Uses a smaller draft model to speed up token prediction while maintaining accuracy
  • Key-value cache optimization – Reduces redundant computations for faster response times in long-form text generation
  • Token streaming – Streams tokens in real time for low-latency applications like chatbots and virtual assistants

Runtime TGI Arguments

TGI containers support runtime configurations that provide greater control over LLM deployments. These configurations allow you to adjust settings such as quantization, model parallel size (tensor parallel size), maximum tokens, data type (dtype), and more using container environment variables.

Notable runtime parameters influencing your model deployment include:

  1. HF_MODEL_ID : This parameter specifies the identifier of the model to load, which can be a model ID from the Hugging Face Hub (e.g., meta-llama/Llama-3.2-11B-Vision-Instruct) or Simple Storage Service (S3) URI containing the model files.
  2. HF_TOKEN : This parameter variable provides the access token required to download gated models from the Hugging Face Hub, such as Llama or Mistral.
  3. SM_NUM_GPUS : This parameter specifies the number of GPUs to use for model inference, allowing the model to be sharded across multiple GPUs for improved performance.
  4. MAX_CONCURRENT_REQUESTS : This parameter controls the maximum number of concurrent requests that the server can handle, effectively managing the load and ensuring optimal performance.
  5. DTYPE : This parameter sets the data type for the model weights during loading, with options like float16 or bfloat16, influencing the model’s memory consumption and computational performance.

There are additional optional runtime parameters that are already pre-optimized in TGI containers to maximize performance on host hardware. However, you can modify them to exercise greater control over your LLM inference performance:

  1. MAX_TOTAL_TOKENS: This parameter sets the upper limit on the combined number of input and output tokens a deployment can handle per request, effectively defining the “memory budget” for client interactions.
  2. MAX_INPUT_TOKENS : This parameter specifies the maximum number of tokens allowed in the input prompt of each request, controlling the length of user inputs to manage memory usage and ensure efficient processing.
  3. MAX_BATCH_PREFILL_TOKENS : This parameter caps the total number of tokens processed during the prefill stage across all batched requests, a phase that is both memory-intensive and compute-bound, thereby optimizing resource utilization and preventing out-of-memory errors.

For a complete list of runtime configurations, please refer to text-generation-launcher arguments.

DeepSeek Deployment Patterns with TGI on Amazon SageMaker AI

Amazon SageMaker AI offers a simple and streamlined approach to deploy DeepSeek-R1 models with just a few lines of code. Additionally, SageMaker endpoints support automatic load balancing and autoscaling, enabling your LLM deployment to scale dynamically based on incoming requests. During non-peak hours, the endpoint can scale down to zero, optimizing resource usage and cost efficiency.

The table below summarizes all DeepSeek-R1 models available on the Hugging Face Hub, as uploaded by the original model provider, DeepSeek.

Model

# Total Params

# Activated Params

Context Length

Download

DeepSeek-R1-Zero

671B

37B

128K

🤗 deepseek-ai/DeepSeek-R1-Zero

DeepSeek-R1

671B

37B

128K

🤗 deepseek-ai/DeepSeek-R1

DeepSeek AI also offers distilled versions of its DeepSeek-R1 model to offer more efficient alternatives for various applications.

Model

Base Model

Download

DeepSeek-R1-Distill-Qwen-1.5B

Qwen2.5-Math-1.5B

🤗 deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B

DeepSeek-R1-Distill-Qwen-7B

Qwen2.5-Math-7B

🤗 deepseek-ai/DeepSeek-R1-Distill-Qwen-7B

DeepSeek-R1-Distill-Llama-8B

Llama-3.1-8B

🤗 deepseek-ai/DeepSeek-R1-Distill-Llama-8B

DeepSeek-R1-Distill-Qwen-14B

Qwen2.5-14B

🤗 deepseek-ai/DeepSeek-R1-Distill-Qwen-14B

DeepSeek-R1-Distill-Qwen-32B

Qwen2.5-32B

🤗 deepseek-ai/DeepSeek-R1-Distill-Qwen-32B

DeepSeek-R1-Distill-Llama-70B

Llama-3.3-70B-Instruct

🤗 deepseek-ai/DeepSeek-R1-Distill-Llama-70B

There are two ways to deploy LLMs, such as DeepSeek-R1 and its distilled variants, on Amazon SageMaker:

Option 1: Direct Deployment from Hugging Face Hub

The easiest way to host DeepSeek-R1 in your AWS account is by deploying it (along with its distilled variants) using TGI containers. These containers simplify deployment through straightforward runtime environment specifications. The architecture diagram below shows a direct download from the Hugging Face Hub, ensuring seamless integration with Amazon SageMaker.

The following code shows how to deploy the DeepSeek-R1-Distill-Llama-8B model to a SageMaker endpoint, directly from the Hugging Face Hub.

import sagemaker
from sagemaker.huggingface import (
    HuggingFaceModel, 
    get_huggingface_llm_image_uri
)

role = sagemaker.get_execution_role()
session = sagemaker.Session()

# select the latest 3+ version container 
deploy_image_uri = get_huggingface_llm_image_uri(
     "huggingface", 
     version="3.0.1" 
)

deepseek_tgi_model = HuggingFaceModel(
            image_uri=deploy_image_uri,
    env={
        "HF_MODEL_ID": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
        "ENDPOINT_SERVER_TIMEOUT": "3600",
        ...
    },
    role=role,
    sagemaker_session=session,
    name="deepseek-r1-llama-8b-model" # optional
)

pretrained_tgi_predictor = deepseek_tgi_model.deploy(
   endpoint_name="deepseek-r1-llama-8b-endpoint", # optional 
    initial_instance_count=1,
    instance_type="ml.g5.2xlarge",
    container_startup_health_check_timeout=600,
    wait=False, # set to true to wait for endpoint InService
)

To deploy other distilled models, simply update the HF_MODEL_ID to any of the DeepSeek distilled model variants, such as deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B, deepseek-ai/DeepSeek-R1-Distill-Qwen-32B, or deepseek-ai/DeepSeek-R1-Distill-Llama-70B.

Option 2: Deployment from a Private S3 Bucket

To deploy models privately within your AWS account, upload the DeepSeek-R1 model weights to a S3 bucket and set HF_MODEL_ID to the corresponding S3 bucket prefix. TGI will then retrieve and deploy the model weights from S3, eliminating the need for internet downloads during each deployment. This approach reduces model loading latency by keeping the weights closer to your SageMaker endpoints and enable your organization’s security teams to perform vulnerability scans before deployment. SageMaker endpoints also support auto-scaling, allowing DeepSeek-R1 to scale horizontally based on incoming request volume while seamlessly integrating with elastic load balancing.

Deploying a DeepSeek-R1 distilled model variant from S3 follows the same process as option 1, with one key difference: HF_MODEL_ID points to the S3 bucket prefix instead of the Hugging Face Hub. Before deployment, you must first download the model weights from the Hugging Face Hub and upload them to your S3 bucket.

deepseek_tgi_model = HuggingFaceModel(
            image_uri=deploy_image_uri,
    env={
        "HF_MODEL_ID": "s3://my-model-bucket/path/to/model",
        ...
    }, 
          vpc_config={ 
                 "Subnets": ["subnet-xxxxxxxx", "subnet-yyyyyyyy"],
    "SecurityGroupIds": ["sg-zzzzzzzz"] 
     },
    role=role, 
    sagemaker_session=session,
    name="deepseek-r1-llama-8b-model-s3" # optional
)

Deployment Best Practices

The following are some best practices to consider when deploying DeepSeek-R1 models on SageMaker AI:

  • Deploy within a private VPC – It’s recommended to deploy your LLM endpoints inside a virtual private cloud (VPC) and behind a private subnet, preferably with no egress. See the following code:
deepseek_tgi_model = HuggingFaceModel(
            image_uri=deploy_image_uri,
    env={
        "HF_MODEL_ID": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
        "ENDPOINT_SERVER_TIMEOUT": "3600",
        ...
    },
    role=role,
          vpc_config={ 
                 "Subnets": ["subnet-xxxxxxxx", "subnet-yyyyyyyy"],
                 "SecurityGroupIds": ["sg-zzzzzzzz"] 
     },
    sagemaker_session=session,
    name="deepseek-r1-llama-8b-model" # optional
) 
  • Implement guardrails for safety and compliance – Always apply guardrails to validate incoming and outgoing model responses for safety, bias, and toxicity. You can use Amazon Bedrock Guardrails to enforce these protections on your SageMaker endpoint responses.

Inference Performance Evaluation

This section presents examples of the inference performance of DeepSeek-R1 distilled variants on Amazon SageMaker AI. Evaluating LLM performance across key metrics—end-to-end latency, throughput, and resource efficiency—is crucial for ensuring responsiveness, scalability, and cost-effectiveness in real-world applications. Optimizing these metrics directly enhances user experience, system reliability, and deployment feasibility at scale.

All DeepSeek-R1 Qwen (1.5B, 7B, 14B, 32B) and Llama (8B, 70B) variants are evaluated against four key performance metrics:

  1. End-to-End Latency
  2. Throughput (Tokens per Second)
  3. Time to First Token
  4. Inter-Token Latency

Please note that the main purpose of this performance evaluation is to give you an indication about relative performance of distilled R1 models on different hardware. We didn’t try to optimize the performance for each model/hardware/use case combination. These results should not be treated like a best possible performance of a particular model on a particular instance type. You should always perform your own testing using your own datasets and input/output sequence length.

If you are interested in running this evaluation job inside your own account, refer to our code on GitHub.

Scenarios

We tested the following scenarios:

  • Tokens – Two input token lengths were used to evaluate the performance of DeepSeek-R1 distilled variants hosted on SageMaker endpoints. Each test was executed 100 times, with concurrency set to 1, and the average values across key performance metrics were recorded. All models were run with dtype=bfloat16.
    • Short-length test – 512 input tokens, 256 output tokens.
    • Medium-length test – 3,072 input tokens, 256 output tokens.
  • Hardware – Several instance families were tested, including p4d (NVIDIA A100), g5 (NVIDIA A10G), g6 (NVIDIA L4), and g6e (NVIDIA L40s), each equipped with 1, 4, or 8 GPUs per instance. For additional details regarding pricing and instance specifications, refer to Amazon SageMaker AI pricing. In the following table, a green cell indicates a model was tested on a specific instance type, and a red cell indicates the model wasn’t tested because the instance type or size was excessive for the model or unable to run because of insufficient GPU memory.

Box Plots

In the following sections we use a box plot to visualize model performance. A box is a concise visual summary that displays a dataset’s median, interquartile range (IQR), and potential outliers using a box for the middle 50% of the data with whiskers extending to the smallest and largest non-outlier values. By examining the median’s placement within the box, the box’s size, and the whiskers’ lengths, you can quickly assess the data’s central tendency, variability, and skewness.

DeepSeek-R1-Distill-Qwen-1.5B

A single GPU instance is sufficient to host one or multiple concurrent DeepSeek-R1-Distill-Qwen-1.5B serving workers on TGI. In this test, a single worker was deployed, and performance was evaluated across the four outlined metrics. Results show that the ml.g5.xlarge outperforms the ml.g6.xlarge across all metrics.

DeepSeek-R1-Distill-Qwen-7B

DeepSeek-R1-Distill-Qwen-7B was tested on ml.g5.xlarge, ml.g5.2xlarge, ml.g6.xlarge, ml.g6.2xlarge, and ml.g6e.xlarge. The ml.g6e.xlarge instance performed the best, followed by ml.g5.2xlarge, ml.g5.xlarge, and the ml.g6 instances.

DeepSeek-R1-Distill-Llama-8B

Similar to the 7B variant, DeepSeek-R1-Distill-Llama-8B was benchmarked across ml.g5.xlarge, ml.g5.2xlarge, ml.g6.xlarge, ml.g6.2xlarge, and ml.g6e.xlarge, with ml.g6e.xlarge demonstrating the highest performance among all instances.

DeepSeek-R1-Distill-Qwen-14B

DeepSeek-R1-Distill-Qwen-14B was tested on ml.g5.12xlarge, ml.g6.12xlarge, ml.g6e.2xlarge, and ml.g6e.xlarge. The ml.g5.12xlarge exhibited the highest performance, followed by ml.g6.12xlarge. Although at a lower performance profile, DeepSeek-R1-14B can also be deployed on the single GPU g6e instances due to their larger memory footprint.

DeepSeek-R1-Distill-Qwen-32B

DeepSeek-R1-Distill-Qwen-32B requires more than 48GB of memory, making ml.g5.12xlarge, ml.g6.12xlarge, and ml.g6e.12xlarge suitable for performance comparison. In this test, ml.g6e.12xlarge delivered the highest performance, followed by ml.g5.12xlarge, with ml.g6.12xlarge ranking third.

DeepSeek-R1-Distill-Llama-70B

DeepSeek-R1-Distill-Llama-70B was tested on ml.g5.48xlarge, ml.g6.48xlarge, ml.g6e.12xlarge, ml.g6e.48xlarge, and ml.p4dn.24xlarge. The best performance was observed on ml.p4dn.24xlarge, followed by ml.g6e.48xlarge, ml.g6e.12xlarge, ml.g5.48xlarge, and finally ml.g6.48xlarge.

Clean Up

To avoid incurring cost after completing your evaluation, ensure you delete the endpoints you created earlier.

import boto3

# Create a low-level SageMaker service client.
sagemaker_client = boto3.client('sagemaker', region_name=<region>)

# Delete endpoint
sagemaker_client.delete_endpoint(EndpointName=endpoint_name)

Conclusion

In this blog, you learned about the current versions of DeepSeek models and how to use the Hugging Face TGI containers to simplify the deployment of DeepSeek-R1 distilled models (or any other LLM) on Amazon SageMaker AI to just a few lines of code. You also learned how to deploy models directly from the Hugging Face Hub for quick experimentation and from a private S3 bucket to provide enhanced security and model deployment performance.

Finally, you saw an extensive evaluation of all DeepSeek-R1 distilled models across four key inference performance metrics using 13 different NVIDIA accelerator instance types. This analysis provides valuable insights to help you select the optimal instance type for your DeepSeek-R1 deployment. All code used to analyze DeepSeek-R1 distilled model variants are available on GitHub.


About the Authors

Pranav Murthy is a Worldwide Technical Lead and Sr. GenAI Data Scientist at AWS. He helps customers build, train, deploy, evaluate, and monitor Machine Learning (ML), Deep Learning (DL), and Generative AI (GenAI) workloads on Amazon SageMaker. Pranav specializes in multimodal architectures, with deep expertise in computer vision (CV) and natural language processing (NLP). Previously, he worked in the semiconductor industry, developing AI/ML models to optimize semiconductor processes using state-of-the-art techniques. In his free time, he enjoys playing chess, training models to play chess, and traveling. You can find Pranav on LinkedIn.

Simon Pagezy is a Cloud Partnership Manager at Hugging Face, dedicated to making cutting-edge machine learning accessible through open source and open science. With a background in AI/ML consulting at AWS, he helps organizations leverage the Hugging Face ecosystem on their platform of choice.

Giuseppe Zappia is a Principal AI/ML Specialist Solutions Architect at AWS, focused on helping large enterprises design and deploy ML solutions on AWS. He has over 20 years of experience as a full stack software engineer, and has spent the past 5 years at AWS focused on the field of machine learning.

Dmitry Soldatkin is a Senior Machine Learning Solutions Architect at AWS, helping customers design and build AI/ML solutions. Dmitry’s work covers a wide range of ML use cases, with a primary interest in generative AI, deep learning, and scaling ML across the enterprise. He has helped companies in many industries, including insurance, financial services, utilities, and telecommunications. He has a passion for continuous innovation and using data to drive business outcomes. Prior to joining AWS, Dmitry was an architect, developer, and technology leader in data analytics and machine learning fields in the financial services industry.

Read More

Drop It Like It’s Mod: Breathing New Life Into Classic Games With AI in NVIDIA RTX Remix

Drop It Like It’s Mod: Breathing New Life Into Classic Games With AI in NVIDIA RTX Remix

PC game modding is massive, with over 5 billion mods downloaded annually. Mods push graphics forward with each GPU generation, extend a game’s lifespan with new content and attract new players.

NVIDIA RTX Remix is a modding platform for RTX AI PCs that lets modders capture game assets, automatically enhance materials with generative AI tools and create stunning RTX remasters with full ray tracing. Today, RTX Remix exited beta and fully launched with new NVIDIA GeForce RTX 50 Series neural rendering technology and many community-requested upgrades.

Since its initial beta release, RTX Remix has been experimented with by over 30,000 modders, bringing ray-traced mods of hundreds of classic titles to over 1 million gamers.

RTX Remix supports a host of AI tools, including NVIDIA DLSS 4, RTX Neural Radiance Cache and the community-published AI model PBRFusion 3.

Modders can build 4K physically based rendering (PBR) assets by hand or use generative AI to accelerate their workflows. And with a few additional clicks, RTX Remix mods support DLSS 4 with Multi Frame Generation. DLSS’ new transformer model and the first neural shader, Neural Radiance Cache, provide enhanced neural rendering performance, meaning classic games look and play better than ever.

Generative AI Texture Tools

RTX Remix’s built-in generative AI texture tools analyze low-resolution textures from classic games, generate physically accurate materials — including normal and roughness maps — and upscale the resolution by up to 4x. Many RTX Remix mods have been created incorporating generative AI.

Earlier this month, RTX Remix modder NightRaven published PBRFusion 3 — a new AI model that upscales textures and generates high-quality normal, roughness and height maps for physically-based materials.

PBRFusion 3 consists of two custom-trained models: a PBR model and a diffusion-based upscaler. PBRFusion 3 can also use the RTX Remix application programming interface to connect with ComfyUI in an integrated flow. NightRaven has packaged all the relevant pieces to make it easy to get started.

The PBRFusion3 page features a plug-and-play package that includes the relevant ComfyUI graphs and nodes. Once installed, remastering is easy. Select a number of textures in RTX Remix’s Viewport and hit process in ComfyUI. This integrated flow enables extensive remasters of popular games to be completed by small hobbyist mod teams.

RTX Remix and REST API

RTX Remix Toolkit capabilities are accessible via REST API, allowing modders to livelink RTX Remix to digital content creation tools such as Blender, modding tools such as Hammer and generative AI apps such as ComfyUI.

For example, through REST API integration, modders can seamlessly export all game textures captured in RTX Remix to ComfyUI and enhance them in one big batch before automatically bringing them back into the game. ComfyUI is RTX-accelerated and includes thousands of generative AI models to try, helping reduce the time to remaster a game scene and providing many ways to process textures.

Modders have many super resolution and PBR models to choose from, including ones that feature metallic and height maps — unlocking 8x or more resolution increases. Additionally, ComfyUI enables modders to use text prompts to generate new details in textures, or make grand stylistic departures by changing an entire scene’s look with a single text prompt.

‘Half-Life 2 RTX’ Demo

Half-Life 2 owners can download a free Half-Life 2 RTX demo from Steam, built with RTX Remix, starting March 18. The demo showcases Orbifold Studios’ work in Ravenholm and Nova Prospekt ahead of the full game’s release at a later date.

Half-Life 2 RTX showcases the expansive capabilities of RTX Remix and NVIDIA’s neural rendering technologies. DLSS 4 with Multi Frame Generation multiplies frame rates by up to 10x at 4K. Neural Radiance Cache further accelerates ray-traced lighting. RTX Skin enhances Father Grigori, headcrabs and zombies with one of the first implementations of subsurface scattering in ray-traced gaming. RTX Volumetrics add realistic smoke effects and fog. And everything interplays and interacts with the fully ray-traced lighting.

What’s Next in AI Starts Here

From the keynote by NVIDIA founder and CEO Jensen Huang on Tuesday, March 18, to over 1,000 inspiring sessions, 300+ exhibits, technical hands-on training and tons of unique networking events — NVIDIA’s own GTC is set to put a spotlight on AI and all its benefits.

Experts from across the AI ecosystem will share insights on deploying AI locally, optimizing models and harnessing cutting-edge hardware and software to enhance AI workloads — highlighting key advancements in RTX AI PCs and workstations. RTX AI Garage will be there to share highlights of the latest advancements coming to the RTX AI platform.

Follow NVIDIA AI PC on Facebook, Instagram, TikTok and X — and stay informed by subscribing to the RTX AI PC newsletter

Read More

Gaming Goodness: NVIDIA Reveals Latest Neural Rendering and AI Advancements Supercharging Game Development at GDC 2025

Gaming Goodness: NVIDIA Reveals Latest Neural Rendering and AI Advancements Supercharging Game Development at GDC 2025

AI is leveling up the world’s most beloved games, as the latest advancements in neural rendering, NVIDIA RTX and digital human technologies equip game developers to take innovative leaps in their work.

At this year’s GDC conference, running March 17-21 in San Francisco, NVIDIA is revealing new AI tools and technologies to supercharge the next era of graphics in games.

Key announcements include new neural rendering advancements with Unreal Engine 5 and Microsoft DirectX; NVIDIA DLSS 4 now available in over 100 games and apps, making it the most rapidly adopted NVIDIA game technology of all time; and a Half-Life 2 RTX demo coming Tuesday, March 18.

Plus, the open-source NVIDIA RTX Remix modding platform has now been released, and NVIDIA ACE technology enhancements are bringing to life next-generation digital humans and AI agents for games.

Neural Shaders Enable Photorealistic, Living Worlds With AI

The next era of computer graphics will be based on NVIDIA RTX Neural Shaders, which allow the training and deployment of tiny neural networks from within shaders to generate textures, materials, lighting, volumes and more. This results in dramatic improvements in game performance, image quality and interactivity, delivering new levels of immersion for players.

At the CES trade show earlier this year, NVIDIA introduced RTX Kit, a comprehensive suite of neural rendering technologies for building AI-enhanced, ray-traced games with massive geometric complexity and photorealistic characters.

Now, at GDC, NVIDIA is expanding its powerful lineup of neural rendering technologies, including with Microsoft DirectX support and plug-ins for Unreal Engine 5.

NVIDIA is partnering with Microsoft to bring neural shading support to the DirectX 12 Agility software development kit preview in April, providing game developers with access to RTX Tensor Cores to accelerate the performance of applications powered by RTX Neural Shaders.

Plus, Unreal Engine developers will be able to get started with RTX Kit features such as RTX Mega Geometry and RTX Hair through the experimental NVIDIA RTX branch of Unreal Engine 5. These enable the rendering of assets with dramatic detail and fidelity, bringing cinematic-quality visuals to real-time experiences.

Now available, NVIDIA’s “Zorah” technology demo has been updated with new incredibly detailed scenes filled with millions of triangles, complex hair systems and cinematic lighting in real time — all by tapping into the latest technologies powering neural rendering, including:

  • ReSTIR Path Tracing
  • ReSTIR Direct Illumination
  • RTX Mega Geometry
  • RTX Hair

And the first neural shader, Neural Radiance Cache, is now available in RTX Remix.

Over 100 DLSS 4 Games and Apps Out Now

DLSS 4 debuted with the release of GeForce RTX 50 Series GPUs. Over 100 games and apps now feature support for DLSS 4. This milestone has been reached two years quicker than with DLSS 3, making DLSS 4 the most rapidly adopted NVIDIA game technology of all time.

DLSS 4 introduced Multi Frame Generation, which uses AI to generate up to three additional frames per traditionally rendered frame, working with the complete suite of DLSS technologies to multiply frame rates by up to 8x over traditional brute-force rendering.

This massive performance improvement on GeForce RTX 50 Series graphics cards and laptops enables gamers to max out visuals at the highest resolutions and play at incredible frame rates.

In addition, Lost Soul Aside, Mecha BREAK, Phantom Blade Zero, Stellar Blade, Tides of Annihilation and Wild Assault will launch with DLSS 4, giving GeForce RTX gamers the definitive PC experience in each title. Learn more.

Developers can get started with DLSS 4 through the DLSS 4 Unreal Engine plug-in.

‘Half-Life 2 RTX’ Demo Launch, RTX Remix Official Release

Half-Life 2 RTX is a community-made remaster of the iconic first-person shooter Half-Life 2. 

A playable Half-Life 2 RTX demo will be available on Tuesday, March 18, for free download from Steam for Half-Life 2 owners. The demo showcases Orbifold Studios’ work in the eerily sensational maps of Ravenholm and Nova Prospekt, with significantly improved assets and textures, full ray tracing, DLSS 4 with Multi Frame Generation and RTX neural rendering technologies.

Half-Life 2 RTX was made possible by NVIDIA RTX Remix, an open-source platform officially released today for modders to create stunning RTX remasters of classic games.

Use the platform now to join the 30,000+ modders who’ve experimented with enhancing hundreds of classic titles since its beta release last year, enabling over 1 million gamers to experience astonishing ray-traced mods.

NVIDIA ACE Technologies Enhance Game Characters With AI

The NVIDIA ACE suite of RTX-accelerated digital human technologies brings game characters to life with generative AI.

NVIDIA ACE autonomous game characters add autonomous teammates, nonplayer characters (NPCs) and self-learning enemies to games, creating new narrative possibilities and enhancing player immersion.

ACE autonomous game characters are debuting in two titles this month:

In inZOI, “Smart Zoi” NPCs will respond more realistically and intelligently to their environment based on their personalities. The game launches with NVIDIA ACE-based characters on Friday, March 28.

And in NARAKA: BLADEPOINT MOBILE PC VERSION, on-device NVIDIA ACE-powered teammates will help players battle enemies, hunt for loot and fight for victory starting Thursday, March 27.

Developers can start building with ACE today.

Join NVIDIA at GDC.

See notice regarding software product information.

Read More

Relive the Magic as GeForce NOW Brings More Blizzard Gaming to the Cloud

Relive the Magic as GeForce NOW Brings More Blizzard Gaming to the Cloud

Bundle up — GeForce NOW is bringing a flurry of Blizzard titles to its ever-expanding library.

Prepare to weather epic gameplay in the cloud, tackling the genres of real-time strategy (RTS), multiplayer online battle arena (MOBA) and more. Classic Blizzard titles join GeForce NOW, including Heroes of the Storm, Warcraft Rumble and three titles from the Warcraft: Remastered series.

They’re all part of 11 games joining the cloud this week, atop the latest update for hit game Zenless Zone Zero from miHoYo.

Blizzard Heats Things Up

Heroes of the Storm on GeForce NOW
Heroes (and save data) never die in the cloud.

Heroes of the Storm, Blizzard’s unique take on the MOBA genre, offers fast-paced team battles across diverse battlegrounds. The game features a roster of iconic Blizzard franchise characters, each with customizable talents and abilities. Heroes of the Storm emphasizes team-based gameplay with shared experiences and objectives, making it more accessible to newcomers while providing depth for experienced players.

Warcraft Rumble on GeForce NOW
The cloud is rumbling.

In Warcraft Rumble, a mobile action-strategy game set in the Warcraft universe, players collect and deploy miniature versions of the series’ beloved characters. The game offers a blend of tower defense and RTS elements as players battle across various modes, including a single-player campaign, player vs. player matches and cooperative dungeons.

Warcraft Remastered on GeForce NOW
Old-school cool, new-school graphics.

The Warcraft Remastered collection gives the classic RTS titles a modern twist with updated visuals and quality-of-life improvements. Warcraft: Remastered and Warcraft II: Remastered offer enhanced graphics while maintaining the original gameplay, allowing players to toggle between classic and updated visuals. Warcraft III: Reforged includes new graphics options and multiplayer features. Both these remasters provide nostalgia for long-time fans and an ideal opportunity for new players to experience the iconic strategy games that shaped the genre.

New Games, No Wait

Zenless Zone Zero update 1.6 Among the Forgotten Ruins on GeForce NOW
New agents, new adventures.

The popular Zenless Zone Zero gets its 1.6 update, “Among the Forgotten Ruins,” now available for members to stream without waiting around for updates or downloads. This latest update brings three new playable agents: Soldier 0-Anby, Pulchra and Trigger. Players can explore two new areas, Port Elpis and Reverb Arena, as well as try out the “Hollow Zero-Lost Void” mode. The update also introduces a revamped Decibel system for more strategic gameplay.

Look for the following games available to stream in the cloud this week:

  • Citizen Sleeper 2: Starward Vector (Xbox, available on PC Game Pass)
  • City Transport Simulator: Tram (Steam)
  • Dave the Diver (Steam)
  • Heroes of the Storm (Battle.net
  • Microtopia (Steam)
  • Orcs Must Die Deathtrap (Xbox, available on PC Game Pass)
  • Potion Craft: Alchemist Simulator (Steam)
  • Warcraft I Remastered (Battle.net)
  • Warcraft II Remastered (Battle.net)
  • Warcraft III: Reforged  (Battle.net)
  • Warcraft Rumble (Battle.net)

What are you planning to play this weekend? Let us know on X or in the comments below.

 

Read More