Enterprises Onboard AI Teammates Faster With NVIDIA NeMo Tools to Scale Employee Productivity

Enterprises Onboard AI Teammates Faster With NVIDIA NeMo Tools to Scale Employee Productivity

An AI agent is only as accurate, relevant and timely as the data that powers it.

Now generally available, NVIDIA NeMo microservices are helping enterprise IT quickly build AI teammates that tap into data flywheels to scale employee productivity. The microservices provide an end-to-end developer platform for creating state-of-the-art agentic AI systems and continually optimizing them with data flywheels informed by inference and business data, as well as user preferences.

With a data flywheel, enterprise IT can onboard AI agents as digital teammates. These agents can tap into user interactions and data generated during AI inference to continuously improve model performance — turning usage into insight and insight into action.

Building Powerful Data Flywheels for Agentic AI

Without a constant stream of high-quality inputs — from databases, user interactions or real-world signals — an agent’s understanding can weaken, making responses less reliable and agents less productive.

Maintaining and improving the models that power AI agents in production requires three types of data: inference data to gather insights and adapt to evolving data patterns, up-to-date business data to provide intelligence, and user feedback data to advise if the model and application are performing as expected. NeMo microservices help developers tap into these three data types.

NeMo microservices speed AI agent development with end-to-end tools for curating, customizing, evaluating and guardrailing the models that drive their agents.

NVIDIA NeMo microservices — including NeMo Customizer, NeMo Evaluator and NeMo Guardrails — can be used alongside NeMo Retriever and NeMo Curator to ease enterprises’ experiences building, optimizing and scaling AI agents through custom enterprise data flywheels. For example:

  • NeMo Customizer accelerates large language model fine-tuning, delivering up to 1.8x higher training throughput. This high-performance, scalable microservice uses popular post-training techniques including supervised fine-tuning and low-rank adaptation.
  • NeMo Evaluator simplifies the evaluation of AI models and workflows on custom and industry benchmarks with just five application programming interface (API) calls.
  • NeMo Guardrails improves compliance protection by up to 1.4x with only half a second of additional latency, helping organizations implement robust safety and security measures that align with organizational policies and guidelines.

With NeMo microservices, developers can build data flywheels that boost AI agent accuracy and efficiency. Deployed through the NVIDIA AI Enterprise software platform, NeMo microservices are easy to operate and can run on any accelerated computing infrastructure, on premises or in the cloud, with enterprise-grade security, stability and support.

The microservices have become generally available at a time when enterprises are building large-scale multi-agent systems, where hundreds of specialized agents — with distinct goals and workflows — collaborate to tackle complex tasks as digital teammates, working alongside employees to assist, augment and accelerate work across functions.

This enterprise-wide impact positions AI agents as a trillion-dollar opportunity — with applications spanning automated fraud detection, shopping assistants, predictive machine maintenance and document review — and underscores the critical role data flywheels play in transforming business data into actionable insights.

Data flywheels built with NVIDIA NeMo microservices constantly curate data, retrain models and evaluate their performance, all with minimal human interactions and maximum autonomy.

Industry Pioneers Boost AI Agent Accuracy With NeMo Microservices

NVIDIA partners and industry pioneers are using NeMo microservices to build responsive AI agent platforms so that digital teammates can help get more done.

Working with Arize and Quantiphi, AT&T has built an advanced AI-powered agent using NVIDIA NeMo, designed to process a knowledge base of nearly 10,000 documents, refreshed weekly. The scalable, high-performance AI agent is fine-tuned for three key business priorities: speed, cost efficiency and accuracy — all increasingly critical as adoption scales.

AT&T boosted AI agent accuracy by up to 40% using NeMo Customizer and Evaluator by fine-tuning a Mistral 7B model to help deliver personalized services, prevent fraud and optimize network performance.

BlackRock is working with NeMo microservices for agentic AI capabilities in its Aladdin tech platform, which unifies the investment management process through a common data language.

Teaming with Galileo, Cisco’s Outshift team is using NVIDIA NeMo microservices to power a coding assistant that delivers 40% fewer tool selection errors and achieves up to 10x faster response times.

Nasdaq is accelerating its Nasdaq Gen AI Platform with NeMo Retriever microservices and NVIDIA NIM microservices. NeMo Retriever enhanced the platform’s search capabilities, leading to up to 30% improved accuracy and response times, in addition to cost savings.

Broad Model and Partner Ecosystem Support for NeMo Microservices

NeMo microservices support a broad range of popular open models, including Llama, the Microsoft Phi family of small language models, Google Gemma, Mistral and Llama Nemotron Ultra, currently the top open model on scientific reasoning, coding and complex math benchmarks.

Meta has tapped NVIDIA NeMo microservices through new connectors for Meta Llamastack. Users can access the same capabilities — including Customizer, Evaluator and Guardrails — via APIs, enabling them to run the full suite of agent-building workflows within their environment.

“With Llamastack integration, agent builders can implement data flywheels powered by NeMo microservices,” said Raghotham Murthy, software engineer, GenAI, at Meta. “This allows them to continuously optimize models to improve accuracy, boost efficiency and reduce total cost of ownership.”

Leading AI software providers such as Cloudera, Datadog, Dataiku, DataRobot, DataStax, SuperAnnotate, Weights & Biases and more have integrated NeMo microservices into their platforms. Developers can use NeMo microservices in popular AI frameworks including CrewAI, Haystack by deepset, LangChain, LlamaIndex and Llamastack.

Enterprises can build data flywheels with NeMo Retriever microservices using NVIDIA AI Data Platform offerings from NVIDIA-Certified Storage partners including DDN, Dell Technologies, Hewlett Packard Enterprise, Hitachi Vantara, IBM, NetApp, Nutanix, Pure Storage, VAST Data and WEKA.

Leading enterprise platforms including Amdocs, Cadence, Cohesity, SAP, ServiceNow and Synopsys are using NeMo Retriever microservices in their AI agent solutions.

Enterprises can run AI agents on NVIDIA-accelerated infrastructure, networking and software from leading system providers including Cisco, Dell, Hewlett Packard Enterprise and Lenovo.

Consulting giants including Accenture, Deloitte and EY are building AI agent platforms for enterprises using NeMo microservices.

Developers can download NeMo microservices from the NVIDIA NGC catalog. The microservices can be deployed as part of NVIDIA AI Enterprise with extended-life software branches for API stability, proactive security remediation and enterprise-grade support.

Read More

Project G-Assist Plug-In Builder Lets Anyone Customize AI on GeForce RTX AI PCs

Project G-Assist Plug-In Builder Lets Anyone Customize AI on GeForce RTX AI PCs

AI is rapidly reshaping what’s possible on a PC — whether for real-time image generation or voice-controlled workflows. As AI capabilities grow, so does their complexity. Tapping into the power of AI can entail navigating a maze of system settings, software and hardware configurations.

Enabling users to explore how on-device AI can simplify and enhance the PC experience, Project G-Assist — an AI assistant that helps tune, control and optimize GeForce RTX systems — is now available as an experimental feature in the NVIDIA app. Developers can try out AI-powered voice and text commands for tasks like monitoring performance, adjusting settings and interacting with supporting peripherals. Users can even summon other AIs powered by GeForce RTX AI PCs.

And it doesn’t stop there. For those looking to expand Project G-Assist capabilities in creative ways, the AI supports custom plug-ins. With the new ChatGPT-based G-Assist Plug-In Builder, developers and enthusiasts can create and customize G-Assist’s functionality, adding new commands, connecting external tools and building AI workflows tailored to specific needs. With the plug-in builder, users can generate properly formatted code with AI, then integrate the code into G-Assist — enabling quick, AI-assisted functionality that responds to text and voice commands.

Teaching PCs New Tricks: Plug-Ins and APIs Explained

Plug-ins are lightweight add-ons that give software new capabilities. G-Assist plug-ins can control music, connect with large language models and much more.

Under the hood, these plug-ins tap into application programming interfaces (APIs), which allow different software and services to talk to each other. Developers can define functions in simple JSON formats, write logic in Python and quickly integrate new tools or features into G-Assist.

With the G-Assist Plug-In Builder, users can:

  • Use a responsive small language model running locally on GeForce RTX GPUs for fast, private inference.
  • Extend G-Assist’s capabilities with custom functionality tailored to specific workflows, games and tools.
  • Interact with G-Assist directly from the NVIDIA overlay, without tabbing out of an application or workflow.
  • Invoke AI-powered GPU and system controls from applications using C++ and Python bindings.
  • Integrate with agentic frameworks using tools like Langflow, letting G-Assist function as a component in larger AI pipelines and multi-agent systems.

Built for Builders: Using Free APIs to Expand AI PC Capabilities 

NVIDIA’s GitHub repository provides everything needed to get started on developing with G-Assist — including sample plug-ins, step-by-step instructions and documentation for building custom functionalities.

Developers can define functions in JSON and drop config files into a designated directory, where G-Assist can automatically load and interpret them. Users can even submit plug-ins for review and potential inclusion in the NVIDIA GitHub repository to make new capabilities available for others.

Hundreds of free, developer-friendly APIs are available today to extend G-Assist capabilities — from automating workflows to optimizing PC setups to boosting online shopping. For ideas, find searchable indices of free APIs for use across entertainment, productivity, smart home, hardware and more on publicapis.dev, free-apis.github.io, apilist.fun and APILayer.

Available sample plug-ins include Spotify, which enables hands-free music and volume control, and Google Gemini, which allows G-Assist to invoke a much larger cloud-based AI for more complex conversations, brainstorming and web searches using a free Google AI Studio API key.

In the clip below, G-Assist asks Gemini for advice on which Legend to pick in the hit game Apex Legends when solo queueing, as well as whether it’s wise to jump into Nightmare mode for level 25 in Diablo IV:

And in the following clip, a developer uses the new plug-in builder to create a Twitch plug-in for G-Assist that checks if a streamer is live. After generating the necessary JSON manifest and Python files, the developer simply drops them into the G-Assist directory to enable voice commands like, “Hey, Twitch, is [streamer] live?”

In addition, users can customize G-Assist to control select peripherals and software applications with simple commands, such as to benchmark or adjust fan speeds, or to change lighting on supported Logitech G, Corsair, MSI and Nanoleaf devices.

Other examples include a Stock Checker plug-in that lets users quickly look up real-time stock prices and performance data, or a Weather plug-in allows users to ask G-Assist for current weather conditions in any city.

Details on how to build, share and load plug-ins are available on the NVIDIA GitHub repository.

Start Building Today

With the G-Assist Plugin Builder and open API support, anyone can extend G-Assist to fit their exact needs. Explore the GitHub repository and submit features for review to help shape the next wave of AI-powered PC experiences.

Plug in to NVIDIA AI PC on Facebook, Instagram, TikTok and X — and stay informed by subscribing to the RTX AI PC newsletter.

Follow NVIDIA Workstation on LinkedIn and X.

See notice regarding software product information.

Read More

PyTorch 2.7 Release

We are excited to announce the release of PyTorch® 2.7 (release notes)! This release features:

  • support for the NVIDIA Blackwell GPU architecture and pre-built wheels for CUDA 12.8 across Linux x86 and arm64 architectures.
  • torch.compile support for Torch Function Modes which enables users to override any *torch.** operation to implement custom user-defined behavior.
  • Mega Cache which allows users to have end-to-end portable caching for torch;
  • new features for FlexAttention – LLM first token processing, LLM throughput mode optimization and Flex Attention for Inference.

This release is composed of 3262 commits from 457 contributors since PyTorch 2.6. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try these out and report any issues as we improve 2.7. More information about how to get started with the PyTorch 2-series can be found at our Getting Started page.

Beta Prototype
Torch.Compile support for Torch Function Modes NVIDIA Blackwell Architecture Support
Mega Cache PyTorch Native Context Parallel
Enhancing Intel GPU Acceleration
FlexAttention LLM first token processing on X86 CPUs
FlexAttention LLM throughput mode optimization on X86 CPUs
Foreach Map
Flex Attention for Inference
Prologue Fusion Support in Inductor

*To see a full list of public feature submissions click here.

BETA FEATURES

[Beta] Torch.Compile support for Torch Function Modes

This feature enables users to override any *torch.** operation to implement custom user-defined behavior. For example, ops can be rewritten to accommodate a specific backend. This is used in FlexAttention to re-write indexing ops.

See the tutorial for more information.

[Beta] Mega Cache

Mega Cache allows users to have end-to-end portable caching for torch. The intended use case is after compiling and executing a model, the user calls torch.compiler.save_cache_artifacts() which will return the compiler artifacts in a portable form. Later, potentially on a different machine, the user may call torch.compiler.load_cache_artifacts() with these artifacts to pre-populate the torch.compile caches in order to jump-start their cache.

See the tutorial for more information.

PROTOTYPE FEATURES

[Prototype] NVIDIA Blackwell Architecture Support

PyTorch 2.7 introduces support for NVIDIA’s new Blackwell GPU architecture and ships pre-built wheels for CUDA 12.8. For more details on CUDA 12.8 see CUDA Toolkit Release.

  • Core components and libraries including cuDNN, NCCL, and CUTLASS have been upgraded to ensure compatibility with Blackwell platforms.
  • PyTorch 2.7 includes Triton 3.3, which adds support for the Blackwell architecture with torch.compile compatibility.
  • To utilize these new features, install PyTorch with CUDA 12.8 using: pip install torch==2.7.0 –index-url https://download.pytorch.org/whl/cu128

More context can also be found here.

[Prototype] PyTorch Native Context Parallel

PyTorch Context Parallel API allows users to create a Python context so that every *torch.nn.functional.scaled_dot_product_attention() *call within will run with context parallelism. Currently, PyTorch Context Parallel supports 3 attention backends: 1. Flash attention; 2. Efficient attention; and 3. cuDNN attention.

As an example, this is used within TorchTitan as the Context Parallel solution for LLM training.

See tutorial here.

[Prototype] Enhancing Intel GPU Acceleration

This latest release introduces enhanced performance optimizations for Intel GPU architectures. These improvements accelerate workloads across various Intel GPUs through the following key enhancements:

  • Enable torch.compile on Windows 11 for Intel GPUs, delivering the performance advantages over eager mode as on Linux.
  • Optimize the performance of PyTorch 2 Export Post Training Quantization (PT2E) on Intel GPU to provide a full graph mode quantization pipelines with enhanced computational efficiency.
  • Improve Scaled Dot-Product Attention (SDPA) inference performance with bfloat16 and float16 to accelerate attention-based models on Intel GPUs.
  • Enable AOTInuctor and torch.export on Linux to simplify deployment workflows.
  • Implement more Aten operators to enhance the continuity of operators execution on Intel GPU and increase the performance on Intel GPU in eager mode.
  • Enable profiler on both Windows and Linux to facilitate model performance analysis.
  • Expand the Intel GPUs support to Intel® Core™ Ultra Series 2 with Intel® Arc™ Graphics, and Intel® Arc™ B-Series graphics on both Windows and Linux.

For more information regarding Intel GPU support, please refer to Getting Started Guide.

See also the tutorials here and here.

[Prototype] FlexAttention LLM first token processing on X86 CPUs

FlexAttention X86 CPU support was first introduced in PyTorch 2.6, offering optimized implementations — such as PageAttention, which is critical for LLM inference—via the TorchInductor C++ backend. In PyTorch 2.7, more attention variants for first token processing of LLMs are supported. With this feature, users can have a smoother experience running FlexAttention on x86 CPUs, replacing specific scaled_dot_product_attention operators with a unified FlexAttention API, and benefiting from general support and good performance when using torch.compile.

[Prototype] FlexAttention LLM throughput mode optimization

The performance of FlexAttention on x86 CPUs for LLM inference throughput scenarios has been further improved by adopting the new C++ micro-GEMM template ability. This addresses the performance bottlenecks for large batch size scenarios present in PyTorch 2.6. With this enhancement, users can transparently benefit from better performance and a smoother experience when using FlexAttention APIs and torch.compile for LLM throughput serving on x86 CPUs.

[Prototype] Foreach Map

This feature uses torch.compile to allow users to apply any pointwise or user-defined function (e.g. torch.add) to lists of tensors, akin to the existing *torch.foreach** ops. The main advantage over the existing *torch.foreach** ops is that any mix of scalars or lists of tensors can be supplied as arguments, and even user-defined python functions can be lifted to apply to lists of tensors. Torch.compile will automatically generate a horizontally fused kernel for optimal performance.

See tutorial here.

[Prototype] Flex Attention for Inference

In release 2.5.0, FlexAttention* torch.nn.attention.flex_attention* was introduced for ML researchers who’d like to customize their attention kernels without writing kernel code. This update introduces a decoding backend optimized for inference, supporting GQA and PagedAttention, along with feature updates including nested jagged tensor support, performance tuning guides and trainable biases support.

[Prototype] Prologue Fusion Support in Inductor

Prologue fusion optimizes matrix multiplication (matmul) operations by fusing operations that come before the matmul into the matmul kernel itself, improving performance by reducing global memory bandwidth.

Read More

Supercharge your LLM performance with Amazon SageMaker Large Model Inference container v15

Supercharge your LLM performance with Amazon SageMaker Large Model Inference container v15

Today, we’re excited to announce the launch of Amazon SageMaker Large Model Inference (LMI) container v15, powered by vLLM 0.8.4 with support for the vLLM V1 engine. This version now supports the latest open-source models, such as Meta’s Llama 4 models Scout and Maverick, Google’s Gemma 3, Alibaba’s Qwen, Mistral AI, DeepSeek-R, and many more. Amazon SageMaker AI continues to evolve its generative AI inference capabilities to meet the growing demands in performance and model support for foundation models (FMs).

This release introduces significant performance improvements, expanded model compatibility with multimodality (that is, the ability to understand and analyze text-to-text, images-to-text, and text-to-images data), and provides built-in integration with vLLM to help you seamlessly deploy and serve large language models (LLMs) with the highest performance at scale.

What’s new?

LMI v15 brings several enhancements that improve throughput, latency, and usability:

  1. An async mode that directly integrates with vLLM’s AsyncLLMEngine for improved request handling. This mode creates a more efficient background loop that continuously processes incoming requests, enabling it to handle multiple concurrent requests and stream outputs with higher throughput than the previous Rolling-Batch implementation in v14.
  2. Support for the vLLM V1 engine, which delivers up to 111% higher throughput compared to the previous V0 engine for smaller models at high concurrency. This performance improvement comes from reduced CPU overhead, optimized execution paths, and more efficient resource utilization in the V1 architecture. LMI v15 supports both V1 and V0 engines, with V1 being the default. If you have a need to use V0, you can use the V0 engine by specifying VLLM_USE_V1=0. vLLM V1’s engine also comes with a core re-architecture of the serving engine with simplified scheduling, zero-overhead prefix caching, clean tensor-parallel inference, efficient input preparation, and advanced optimizations with torch.compile and Flash Attention 3. For more information, see the vLLM Blog.
  3. Expanded API schema support with three flexible options to allow seamless integration with applications built on popular API patterns:
    1. Message format compatible with the OpenAI Chat Completions API.
    2. OpenAI Completions format.
    3. Text Generation Inference (TGI) schema to support backward compatibility with older models.
  4. Multimodal support, with enhanced capabilities for vision-language models including optimizations such as multimodal prefix caching
  5. Built-in support for function calling and tool calling, enabling sophisticated agent-based workflows.

Enhanced model support

LMI v15 supports an expanding roster of state-of-the-art models, including the latest releases from leading model providers. The container offers ready-to-deploy compatibility for but not limited to:

  • Llama 4 – Llama-4-Scout-17B-16E and Llama-4-Maverick-17B-128E-Instruct
  • Gemma 3 – Google’s lightweight and efficient models, known for their strong performance despite smaller size
  • Qwen 2.5 – Alibaba’s advanced models including QwQ 2.5 and Qwen2-VL with multimodal capabilities
  • Mistral AI models – High-performance models from Mistral AI that offer efficient scaling and specialized capabilities
  • DeepSeek-R1/V3 – State of the art reasoning models

Each model family can be deployed using the LMI v15 container by specifying the appropriate model ID, for example, meta-llama/Llama-4-Scout-17B-16E, and configuration parameters as environment variables, without requiring custom code or optimization work.

Benchmarks

Our benchmarks demonstrate the performance advantages of LMI v15’s V1 engine compared to previous versions:

Model Batch size Instance type LMI v14 throughput [tokens/s] (V0 engine) LMI v15 throughput [tokens/s] (V1 engine) Improvement
1 deepseek-ai/DeepSeek-R1-Distill-Llama-70B 128 p4d.24xlarge 1768 2198 24%
2 meta-llama/Llama-3.1-8B-Instruct 64 ml.g6e.2xlarge 1548 2128 37%
3 mistralai/Mistral-7B-Instruct-v0.3 64 ml.g6e.2xlarge 942 1988 111%

DeepSeek-R1 Llama 70B for various levels of concurrency

Llama 3.1 8B Instruct for various level of concurrency

Mistral 7B for various levels of concurrency

The async engine in LMI v15 shows strength in high-concurrency scenarios, where multiple simultaneous requests benefit from the optimized request handling. These benchmarks highlight that the V1 engine in async mode delivers between 24% and 111% higher throughput compared to LMI v14 using rolling batch in the models tested in high concurrency scenarios for batch size of 64 and 128. We suggest to keep in mind the following considerations for optimal performance:

  • Higher batch sizes increase concurrency but come with a natural tradeoff in terms of latency
  • Batch sizes of 4 and 8 provide the best latency for most use cases
  • Batch sizes up to 64 and 128 achieve maximum throughput with acceptable latency trade-offs

API formats

LMI v15 supports three API schemas: OpenAI Chat Completions, OpenAI Completions, and TGI.

  • Chat Completions – Message format is compatible with OpenAI Chat Completions API. Use this schema for tool calling, reasoning, and multimodal use cases. Here is a sample of the invocation with the Messages API:
    body = {
        "messages": [
            {"role": "user", "content": "Name popular places to visit in London?"}
        ],
        "temperature": 0.9,
        "max_tokens": 256,
        "stream": True,
    }

  • OpenAI Completions format – The Completions API endpoint is no longer receiving updates:
    body = {
     "prompt": "Name popular places to visit in London?",
     "temperature": 0.9,
     "max_tokens": 256,
     "stream": True,
    } 

  • TGI – Supports backward compatibility with older models:
    body = {
    "inputs": "Name popular places to visit in London?",
    "parameters": {
    "max_new_tokens": 256,
    "temperature": 0.9,
    },
    "stream": True,
    }

Getting started with LMI v15

Getting started with LMI v15 is seamless, and you can deploy with LMI v15 in only a few lines of code. The container is available through Amazon Elastic Container Registry (Amazon ECR), and deployments can be managed through SageMaker AI endpoints. To deploy models, you need to specify the Hugging Face model ID, instance type, and configuration options as environment variables.

For optimal performance, we recommend the following instances:

  • Llama 4 Scout: ml.p5.48xlarge
  • DeepSeek R1/V3: ml.p5e.48xlarge
  • Qwen 2.5 VL-32B: ml.g5.12xlarge
  • Qwen QwQ 32B: ml.g5.12xlarge
  • Mistral Large: ml.g6e.48xlarge
  • Gemma3-27B: ml.g5.12xlarge
  • Llama 3.3-70B: ml.p4d.24xlarge

To deploy with LMI v15, follow these steps:

  1. Clone the notebook to your Amazon SageMaker Studio notebook or to Visual Studio Code (VS Code). You can then run the notebook to do the initial setup and deploy the model from the Hugging Face repository to the SageMaker AI endpoint. We walk through the key blocks here.
  2. LMI v15 maintains the same configuration pattern as previous versions, using environment variables in the form OPTION_<CONFIG_NAME>. This consistent approach makes it straightforward for users familiar with earlier LMI versions to migrate to v15.
    vllm_config = {
        "HF_MODEL_ID": "meta-llama/Llama-4-Scout-17B-16E",
        "HF_TOKEN": "entertoken",
        "OPTION_MAX_MODEL_LEN": "250000",
        "OPTION_MAX_ROLLING_BATCH_SIZE": "8",
        "OPTION_MODEL_LOADING_TIMEOUT": "1500",
        "SERVING_FAIL_FAST": "true",
        "OPTION_ROLLING_BATCH": "disable",
        "OPTION_ASYNC_MODE": "true",
        "OPTION_ENTRYPOINT": "djl_python.lmi_vllm.vllm_async_service"
    }

    • HF_MODEL_ID sets the model id from Hugging Face. You can also download model from Amazon Simple Storage Service (Amazon S3).
    • HF_TOKEN sets the token to download the model. This is required for gated models like Llama-4
    • OPTION_MAX_MODEL_LEN. This is the max model context length.
    • OPTION_MAX_ROLLING_BATCH_SIZE sets the batch size for the model.
    • OPTION_MODEL_LOADING_TIMEOUT sets the timeout value for SageMaker to load the model and run health checks.
    • SERVING_FAIL_FAST=true. We recommend setting this flag because it allows SageMaker to gracefully restart the container when an unrecoverable engine error occurs.
    • OPTION_ROLLING_BATCH= disable disables the rolling batch implementation of LMI, which was the default offering in LMI V14. We recommend using async instead as this latest implementation and provides better performance
    • OPTION_ASYNC_MODE=true enables async mode.
    • OPTION_ENTRYPOINT provides the entrypoint for vLLM’s async integrations
  3. Set the latest container (in this example we used 0.33.0-lmi15.0.0-cu128), AWS Region (us-east-1), and create a model artifact with all the configurations. To review the latest available container version, see Available Deep Learning Containers Images.
  4. Deploy the model to the endpoint using model.deploy().
    CONTAINER_VERSION = '0.33.0-lmi15.0.0-cu128'
    REGION = 'us-east-1'
    # Construct container URI
    container_uri = f'763104351884.dkr.ecr.{REGION}.amazonaws.com/djl-inference:{CONTAINER_VERSION}'
    
    # Select instance type
    instance_type = "ml.p5.48xlarge"
    
    model = Model(image_uri=container_uri,
                  role=role,
                  env=vllm_config)
    endpoint_name = sagemaker.utils.name_from_base("Llama-4")
    
    print(endpoint_name)
    model.deploy(
        initial_instance_count=1,
        instance_type=instance_type,
        endpoint_name=endpoint_name,
        container_startup_health_check_timeout = 1800
    )

  5. Invoke the model, SageMaker inference provides two APIs to invoke the model- InvokeEndpoint and InvokeEndpointWithResponseStream. You can choose either option based on your needs.
    # Create SageMaker Runtime client
    smr_client = boto3.client('sagemaker-runtime')
    ##Add your endpoint here 
    endpoint_name = ''
    
    # Invoke with messages format
    body = {
    "messages": [
    {"role": "user", "content": "Name popular places to visit in London?"}
    ],
    "temperature": 0.9,
    "max_tokens": 256,
    "stream": True,
    }
    
    # Invoke with endpoint streaming
    resp = smr_client.invoke_endpoint_with_response_stream(
    EndpointName=endpoint_name,
    Body=json.dumps(body),
    ContentType="application/json",
    )

To run multi-modal inference with Llama-4 Scout, see the notebook for the full code sample to run inference requests with images.

Conclusion

Amazon SageMaker LMI container v15 represents a significant step forward in large model inference capabilities. With the new vLLM V1 engine, async operating mode, expanded model support, and optimized performance, you can deploy cutting-edge LLMs with greater performance and flexibility. The container’s configurable options give you the flexibility to fine-tune deployments for your specific needs, whether optimizing for latency, throughput, or cost.

We encourage you to explore this release for deploying your generative AI models.

Check out the provided example notebooks to start deploying models with LMI v15.


About the authors

Vivek Gangasani is a Lead Specialist Solutions Architect for Inference at AWS. He helps emerging generative AI companies build innovative solutions using AWS services and accelerated compute. Currently, he is focused on developing strategies for fine-tuning and optimizing the inference performance of large language models. In his free time, Vivek enjoys hiking, watching movies, and trying different cuisines.

Siddharth Venkatesan is a Software Engineer in AWS Deep Learning. He currently focusses on building solutions for large model inference. Prior to AWS he worked in the Amazon Grocery org building new payment features for customers world-wide. Outside of work, he enjoys skiing, the outdoors, and watching sports.

Felipe Lopez is a Senior AI/ML Specialist Solutions Architect at AWS. Prior to joining AWS, Felipe worked with GE Digital and SLB, where he focused on modeling and optimization products for industrial applications.

Banu Nagasundaram leads product, engineering, and strategic partnerships for Amazon SageMaker JumpStart, the SageMaker machine learning and generative AI hub. She is passionate about building solutions that help customers accelerate their AI journey and unlock business value.

Dmitry Soldatkin is a Senior AI/ML Solutions Architect at Amazon Web Services (AWS), helping customers design and build AI/ML solutions. Dmitry’s work covers a wide range of ML use cases, with a primary interest in Generative AI, deep learning, and scaling ML across the enterprise. He has helped companies in many industries, including insurance, financial services, utilities, and telecommunications. You can connect with Dmitry on LinkedIn.

Read More

Accuracy evaluation framework for Amazon Q Business – Part 2

Accuracy evaluation framework for Amazon Q Business – Part 2

In the first post of this series, we introduced a comprehensive evaluation framework for Amazon Q Business, a fully managed Retrieval Augmented Generation (RAG) solution that uses your company’s proprietary data without the complexity of managing large language models (LLMs). The first post focused on selecting appropriate use cases, preparing data, and implementing metrics to support a human-in-the-loop evaluation process.

In this post, we dive into the solution architecture necessary to implement this evaluation framework for your Amazon Q Business application. We explore two distinct evaluation solutions:

  • Comprehensive evaluation workflow – This ready-to-deploy solution uses AWS CloudFormation stacks to set up an Amazon Q Business application, complete with user access, a custom UI for review and evaluation, and the supporting evaluation infrastructure
  • Lightweight AWS Lambda based evaluation – Designed for users with an existing Amazon Q Business application, this streamlined solution employs an AWS Lambda function to efficiently assess the application’s accuracy

By the end of this post, you will have a clear understanding of how to implement an evaluation framework that aligns with your specific needs with a detailed walkthrough, so your Amazon Q Business application delivers accurate and reliable results.

Challenges in evaluating Amazon Q Business

Evaluating the performance of Amazon Q Business, which uses a RAG model, presents several challenges due to its integration of retrieval and generation components. It’s crucial to identify which aspects of the solution need evaluation. For Amazon Q Business, both the retrieval accuracy and the quality of the answer output are important factors to assess. In this section, we discuss key metrics that need to be included for a RAG generative AI solution.

Context recall

Context recall measures the extent to which all relevant content is retrieved. High recall provides comprehensive information gathering but might introduce extraneous data.

For example, a user might ask the question “What can you tell me about the geography of the United States?” They could get the following responses:

  • Expected: The United States is the third-largest country in the world by land area, covering approximately 9.8 million square kilometers. It has a diverse range of geographical features.
  • High context recall: The United States spans approximately 9.8 million square kilometers, making it the third-largest nation globally by land area. country’s geography is incredibly diverse, featuring the Rocky Mountains stretching from New Mexico to Alaska, the Appalachian Mountains along the eastern states, the expansive Great Plains in the central region, arid deserts like the Mojave in the southwest.
  • Low context recall: The United States features significant geographical landmarks. Additionally, the country is home to unique ecosystems like the Everglades in Florida, a vast network of wetlands.

The following diagram illustrates the context recall workflow.

Context precision

Context precision assesses the relevance and conciseness of retrieved information. High precision indicates that the retrieved information closely matches the query intent, reducing irrelevant data.

For example, “Why Silicon Valley is great for tech startups?”might give the following answers:

  • Ground truth answer: Silicon Valley is famous for fostering innovation and entrepreneurship in the technology sector.
  • High precision context: Many groundbreaking startups originate from Silicon Valley, benefiting from a culture that encourages innovation, risk-taking
  • Low precision context: Silicon Valley experiences a Mediterranean climate, with mild, wet, winters and warm, dry summers, contributing to its appeal as a place to live and works

The following diagram illustrates the context precision workflow.

Answer relevancy

Answer relevancy evaluates whether responses fully address the query without unnecessary details. Relevant answers enhance user satisfaction and trust in the system.

For example, a user might ask the question “What are the key features of Amazon Q Business Service, and how can it benefit enterprise customers?” They could get the following answers:

  • High relevance answer: Amazon Q Business Service is a RAG Generative AI solution designed for enterprise use. Key features include a fully managed Generative AI solutions, integration with enterprise data sources, robust security protocols, and customizable virtual assistants. It benefits enterprise customers by enabling efficient information retrieval, automating customer support tasks, enhancing employee productivity through quick access to data, and providing insights through analytics on user interactions.
  • Low relevance answer: Amazon Q Business Service is part of Amazon’s suite of cloud services. Amazon also offers online shopping and streaming services.

The following diagram illustrates the answer relevancy workflow.

Truthfulness

Truthfulness verifies factual accuracy by comparing responses to verified sources. Truthfulness is crucial to maintain the system’s credibility and reliability.

For example, a user might ask “What is the capital of Canada?” They could get the following responses:

  • Context: Canada’s capital city is Ottawa, located in the province of Ontario. Ottawa is known for its historic Parliament Hill, the center of government, and the scenic Rideau Canal, a UNESCO World Heritage site
  • High truthfulness answer: The capital of Canada is Ottawa
  • Low truthfulness answer: The capital of Canada is Toronto

The following diagram illustrates the truthfulness workflow.

Evaluation methods

Deciding on who should conduct the evaluation can significantly impact results. Options include:

  • Human-in-the-Loop (HITL) – Human evaluators manually assess the accuracy and relevance of responses, offering nuanced insights that automated systems might miss. However, it is a slow process and difficult to scale.
  • LLM-aided evaluation – Automated methods, such as the Ragas framework, use language models to streamline the evaluation process. However, these might not fully capture the complexities of domain-specific knowledge.

Each of these preparatory and evaluative steps contributes to a structured approach to evaluating the accuracy and effectiveness of Amazon Q Business in supporting enterprise needs.

Solution overview

In this post, we explore two different solutions to provide you the details of an evaluation framework, so you can use it and adapt it for your own use case.

Solution 1: End-to-end evaluation solution

For a quick start evaluation framework, this solution uses a hybrid approach with Ragas (automated scoring) and HITL evaluation for robust accuracy and reliability. The architecture includes the following components:

  • User access and UI – Authenticated users interact with a frontend UI to upload datasets, review RAGAS output, and provide human feedback
  • Evaluation solution infrastructure – Core components include:
  • Ragas scoring – Automated metrics provide an initial layer of evaluation
  • HITL review – Human evaluators refine Ragas scores through the UI, providing nuanced accuracy and reliability

By integrating a metric-based approach with human validation, this architecture makes sure Amazon Q Business delivers accurate, relevant, and trustworthy responses for enterprise users. This solution further enhances the evaluation process by incorporating HITL reviews, enabling human feedback to refine automated scores for higher precision.

A quick video demo of this solution is shown below:

Solution architecture

The solution architecture is designed with the following core functionalities to support an evaluation framework for Amazon Q Business:

  1. User access and UI – Users authenticate through Amazon Cognito, and upon successful login, interact with a Streamlit-based custom UI. This frontend allows users to upload CSV datasets to Amazon Simple Storage Service (Amazon S3), review Ragas evaluation outputs, and provide human feedback for refinement. The application exchanges the Amazon Cognito token for an AWS IAM Identity Center token, granting scoped access to Amazon Q Business.UI
  2. infrastructure – The UI is hosted behind an Application Load Balancer, supported by Amazon Elastic Compute Cloud (Amazon EC2) instances running in an Auto Scaling group for high availability and scalability.
  3. Upload dataset and trigger evaluation – Users upload a CSV file containing queries and ground truth answers to Amazon S3, which triggers an evaluation process. A Lambda function reads the CSV, stores its content in a DynamoDB table, and initiates further processing through a DynamoDB stream.
  4. Consuming DynamoDB stream – A separate Lambda function processes new entries from the DynamoDB stream, and publishes messages to an SQS queue, which serves as a trigger for the evaluation Lambda function.
  5. Ragas scoring – The evaluation Lambda function consumes SQS messages, sending queries (prompts) to Amazon Q Business for generating answers. It then evaluates the prompt, ground truth, and generated answer using the Ragas evaluation framework. Ragas computes automated evaluation metrics such as context recall, context precision, answer relevancy, and truthfulness. The results are stored in DynamoDB and visualized in the UI.

HITL review – Authenticated users can review and refine RAGAS scores directly through the UI, providing nuanced and accurate evaluations by incorporating human insights into the process.

This architecture uses AWS services to deliver a scalable, secure, and efficient evaluation solution for Amazon Q Business, combining automated and human-driven evaluations.

Prerequisites

For this walkthrough, you should have the following prerequisites:

Additionally, make sure that all the resources you deploy are in the same AWS Region.

Deploy the CloudFormation stack

Complete the following steps to deploy the CloudFormation stack:

  1. Clone the repository or download the files to your local computer.
  2. Unzip the downloaded file (if you used this option).
  3. Using your local computer command line, use the ‘cd’ command and change directory into ./sample-code-for-evaluating-amazon-q-business-applications-using-ragas-main/end-to-end-solution
  4. Make sure the ./deploy.sh script can run by executing the command chmod 755 ./deploy.sh.
  5. Execute the CloudFormation deployment script provided as follows:
    ./deploy.sh -s [CNF_STACK_NAME] -r [AWS_REGION]

You can follow the deployment progress on the AWS CloudFormation console. It takes approximately 15 minutes to complete the deployment, after which you will see a similar page to the following screenshot.

Add users to Amazon Q Business

You need to provision users for the pre-created Amazon Q Business application. Refer to Setting up for Amazon Q Business for instructions to add users.

Upload the evaluation dataset through the UI

In this section, you review and upload the following CSV file containing an evaluation dataset through the deployed custom UI.

This CSV file contains two columns: prompt and ground_truth. There are four prompts and their associated ground truth in this dataset:

  • What are the index types of Amazon Q Business and the features of each?
  • I want to use Q Apps, which subscription tier is required to use Q Apps?
  • What is the file size limit for Amazon Q Business via file upload?
  • What data encryption does Amazon Q Business support?

To upload the evaluation dataset, complete the following steps:

  1. On the AWS CloudFormation console, choose Stacks in the navigation pane.
  2. Choose the evals stack that you already launched.
  3. On the Outputs tab, take note of the user name and password to log in to the UI application, and choose the UI URL.

The custom UI will redirect you to the Amazon Cognito login page for authentication.

The UI application authenticates the user with Amazon Cognito, and initiates the token exchange workflow to implement a secure Chatsync API call with Amazon Q Business.

  1. Use the credentials you noted earlier to log in.

For more information about the token exchange flow between IAM Identity Center and the identity provider (IdP), refer to Building a Custom UI for Amazon Q Business.

  • After you log in to the custom UI used for Amazon Q evaluation, choose Upload Dataset, then upload the dataset CSV file.

After the file is uploaded, the evaluation framework will send the prompt to Amazon Q Business to generate the answer, and then send the prompt, ground truth, and answer to Ragas to evaluate. During this process, you can also review the uploaded dataset (including the four questions and associated ground truth) on the Amazon Q Business console, as shown in the following screenshot.

After about 7 minutes, the workflow will finish, and you should see the evaluation result for first question.

Perform HITL evaluation

After the Lambda function has completed its execution, Ragas scoring will be shown in the custom UI. Now you can review metric scores generated using Ragas (an-LLM aided evaluation method), and you can provide human feedback as an evaluator to provide further calibration. This human-in-the-loop calibration can further improve the evaluation accuracy, because the HITL process is particularly valuable in fields where human judgment, expertise, or ethical considerations are crucial.

Let’s review the first question: “What are the index types of Amazon Q Business and the features of each?” You can read the question, Amazon Q Business generated answers, ground truth, and context.

Next, review the evaluation metrics scored by using Ragas. As discussed earlier, there are four metrics:

  • Answer relevancy – Measures relevancy of answers. Higher scores indicate better alignment with the user input, and lower scores are given if the response is incomplete or includes redundant information.
  • Truthfulness – Verifies factual accuracy by comparing responses to verified sources. Higher scores indicate a better consistency with verified sources.
  • Context precision – Assesses the relevance and conciseness of retrieved information. Higher scores indicate that the retrieved information closely matches the query intent, reducing irrelevant data.
  • Context recall – Measures how many of the relevant documents (or pieces of information) were successfully retrieved. It focuses on not missing important results. Higher recall means fewer relevant documents were left out.

For this question, all metrics showed Amazon Q Business achieved a high-quality response. It’s worthwhile to compare your own evaluation with these scores generated by Ragas.

Next, let’s review a question that returned with a low answer relevancy score. For example: “I want to use Q Apps, which subscription tier is required to use Q Apps?”

Analyzing both question and answer, we can consider the answer relevant and aligned with the user question, but the answer relevancy score from Ragas doesn’t reflect this human analysis, showing a lower score than expected. It’s important to calibrate Ragas evaluation judgement as Human in the Lopp. You should read the question and answer carefully, and make necessary changes of the metric score to reflect the HITL analysis. Finally, the results will be updated in DynamoDB.

Lastly, save the metric score in the CSV file, and you can download and review the final metric scores.

Solution 2: Lambda based evaluation

If you’re already using Amazon Q Business, AmazonQEvaluationLambda allows for quick integration of evaluation methods into your application without setting up a custom UI application. It offers the following key features:

  • Evaluates responses from Amazon Q Business using Ragas against a predefined test set of questions and ground truth data
  • Outputs evaluation metrics that can be visualized directly in Amazon CloudWatch
  • Both solutions provide you results based on the input dataset and the responses from the Amazon Q Business application, using Ragas to evaluate four key evaluation metrics (context recall, context precision, answer relevancy, and truthfulness).

This solution provides you sample code to evaluate the Amazon Q Business application response. To use this solution, you need to have or create a working Amazon Q Business application integrated with IAM Identity Center or Amazon Cognito as an IdP. This Lambda function works in the same way as the Lambda function in the end-to-end evaluation solution, using RAGAS against a test set of questions and ground truth. This lightweight solution doesn’t have a custom UI, but it can provide result metrics (context recall, context precision, answer relevancy, truthfulness), for visualization in CloudWatch. For deployment instructions, refer to the following GitHub repo.

Using evaluation results to improve Amazon Q Business application accuracy

This section outlines strategies to enhance key evaluation metrics—context recall, context precision, answer relevance, and truthfulness—for a RAG solution in the context of Amazon Q Business.

Context recall

Let’s examine the following problems and troubleshooting tips:

  1. Aggressive query filtering – Overly strict search filters or metadata constraints might exclude relevant records. You should review the metadata filters or boosting settings applied in Amazon Q Business to make sure they don’t unnecessarily restrict results.
  2. Data source ingestion errors – Documents from certain data sources aren’t successfully ingested into Amazon Q Business. To address this, check the document sync history report in Amazon Q Business to confirm successful ingestion and resolve ingestion errors.

Context precision

Consider the following potential issues:

  • Over-retrieval of documents – Large top-K values might retrieve semi-related or off-topic passages, which the LLM might incorporate unnecessarily. To address this, refine metadata filters or apply boosting to improve passage relevance and reduce noise in the retrieved context.
  1. Poor query specificity – Broad or poorly formed user queries can yield loosely related results. You should make sure user queries are clear and specific. Train users or implement query refinement mechanisms to optimize query quality.

Answer relevance

Consider the following troubleshooting methods:

  • Partial coverage – Retrieved context addresses parts of the question but fails to cover all aspects, especially in multi-part queries. To address this, decompose complex queries into sub-questions. Instruct the LLM or a dedicated module to retrieve and answer each sub-question before composing the final response. For example:
    • Break down the query into sub-questions.
    • Retrieve relevant passages for each sub-question.
    • Compose a final answer addressing each part.
  • Context/answer mismatch – The LLM might misinterpret retrieved passages, omit relevant information, or merge content incorrectly due to hallucination. You can use prompt engineering to guide the LLM more effectively. For example, for the original query “What are the top 3 reasons for X?” you can use the rewritten prompt “List the top 3 reasons for X clearly labeled as #1, #2, and #3, based strictly on the retrieved context.”

Truthfulness

Consider the following:

  • Stale or inaccurate data sources – Outdated or conflicting information in the knowledge corpus might lead to incorrect answers. To address this, compare the retrieved context with verified sources to provide accuracy. Collaborate with SMEs to validate the data.
  • LLM hallucination – The model might fabricate or embellish details, even with accurate retrieved context. Although Amazon Q Business is a RAG generative AI solution, and should significantly reduce the hallucination, it’s not possible to eliminate hallucination totally. You can measure the frequency of low context precision answers to identify patterns and quantify the impact of hallucinations to gain an aggregated view with the evaluation solution.

By systematically examining and addressing the root causes of low evaluation metrics, you can optimize your Amazon Q Business application. From document retrieval and ranking to prompt engineering and validation, these strategies will help enhance the effectiveness of your RAG solution.

Clean up

Don’t forget to go back to the CloudFormation console and delete the CloudFormation stack to delete the underlying infrastructure that you set up, to avoid additional costs on your AWS account.

Conclusion

In this post, we outlined two evaluation solutions for Amazon Q Business: a comprehensive evaluation workflow and a lightweight Lambda based evaluation. These approaches combine automated evaluation approaches such as Ragas with human-in-the-loop validation, providing reliable and accurate assessments.

By using our guidance on how to improve evaluation metrics, you can continuously optimize your Amazon Q Business application to meet enterprise needs with Amazon Q Business. Whether you’re using the end-to-end solution or the lightweight approach, these frameworks provide a scalable and efficient path to improve accuracy and relevance.

To learn more about Amazon Q Business and how to evaluate Amazon Q Business results, explore these hands-on workshops:


About the authors

Rui Cardoso is a partner solutions architect at Amazon Web Services (AWS). He is focusing on AI/ML and IoT. He works with AWS Partners and support them in developing solutions in AWS. When not working, he enjoys cycling, hiking and learning new things.

Julia Hu is a Sr. AI/ML Solutions Architect at Amazon Web Services. She is specialized in Generative AI, Applied Data Science and IoT architecture. Currently she is part of the Amazon Bedrock team, and a Gold member/mentor in Machine Learning Technical Field Community. She works with customers, ranging from start-ups to enterprises, to develop AWSome generative AI solutions. She is particularly passionate about leveraging Large Language Models for advanced data analytics and exploring practical applications that address real-world challenges.

Amit GuptaAmit Gupta is a Senior Q Business Solutions Architect Solutions Architect at AWS. He is passionate about enabling customers with well-architected generative AI solutions at scale.

Neil Desai is a technology executive with over 20 years of experience in artificial intelligence (AI), data science, software engineering, and enterprise architecture. At AWS, he leads a team of Worldwide AI services specialist solutions architects who help customers build innovative Generative AI-powered solutions, share best practices with customers, and drive product roadmap. He is passionate about using technology to solve real-world problems and is a strategic thinker with a proven track record of success.

Ricardo Aldao is a Senior Partner Solutions Architect at AWS. He is a passionate AI/ML enthusiast who focuses on supporting partners in building generative AI solutions on AWS.

Read More

Use Amazon Bedrock Intelligent Prompt Routing for cost and latency benefits

Use Amazon Bedrock Intelligent Prompt Routing for cost and latency benefits

In December, we announced the preview availability for Amazon Bedrock Intelligent Prompt Routing, which provides a single serverless endpoint to efficiently route requests between different foundation models within the same model family. To do this, Amazon Bedrock Intelligent Prompt Routing dynamically predicts the response quality of each model for a request and routes the request to the model it determines is most appropriate based on cost and response quality, as shown in the following figure.

Today, we’re happy to announce the general availability of Amazon Bedrock Intelligent Prompt Routing. Over the past several months, we drove several improvements in intelligent prompt routing based on customer feedback and extensive internal testing. Our goal is to enable you to set up automated, optimal routing between large language models (LLMs) through Amazon Bedrock Intelligent Prompt Routing and its deep understanding of model behaviors within each model family, which incorporates state-of-the-art methods for training routers for different sets of models, tasks and prompts.

In this blog post, we detail various highlights from our internal testing, how you can get started, and point out some caveats and best practices. We encourage you to incorporate Amazon Bedrock Intelligent Prompt Routing into your new and existing generative AI applications. Let’s dive in!

Highlights and improvements

Today, you can either use Amazon Bedrock Intelligent Prompt Routing with the default prompt routers provided by Amazon Bedrock or configure your own prompt routers to adjust for performance linearly between the performance of the two candidate LLMs. Default prompt routers—pre-configured routing systems to map performance to the more performant of the two models while lowering costs by sending easier prompts to the cheaper model—are provided by Amazon Bedrock for each model family. These routers come with predefined settings and are designed to work out-of-the-box with specific foundation models. They provide a straightforward, ready-to-use solution without needing to configure any routing settings. Customers who tested Amazon Bedrock Intelligent Prompt Routing in preview (thank you!), you could choose models in the Anthropic and Meta families. Today, you can choose more models from within the Amazon Nova, Anthropic, and Meta families, including:

  • Anthropic’s Claude family: Haiku, Sonnet3.5 v1, Haiku 3.5, Sonnet 3.5 v2
  • Llama family: Llama 3.1 8b, 70b, 3.2 11B, 90B and 3.3 70B
  • Nova family: Nova Pro and Nova lite

You can also configure your own prompt routers to define your own routing configurations tailored to specific needs and preferences. These are more suitable when you require more control over how to route your requests and which models to use. In GA, you can configure your own router by selecting any two models from the same model family and then configuring the response quality difference of your router.

Adding components before invoking the selected LLM with the original prompt can add overhead. We reduced overhead of added components by over 20% to approximately 85 ms (P90). Because the router preferentially invokes the less expensive model while maintaining the same baseline accuracy in the task, you can expect to get an overall latency and cost benefit compared to always hitting the larger/ more expensive model, despite the additional overhead. This is discussed further in the following benchmark results section.

We conducted several internal tests with proprietary and public data to evaluate Amazon Bedrock Intelligent Prompt Routing metrics. First, we used average response quality gain under cost constraints (ARQGC), a normalized (0–1) performance metric for measuring routing system quality for various cost constraints, referenced against a reward model, where 0.5 represents random routing and 1 represents optimal oracle routing performance. We also captured the cost savings with intelligent prompt routing relative to using the largest model in the family, and estimated latency benefit based on average recorded time to first token (TTFT) to showcase the advantages and report them in the following table.

Model family Router overall performance Performance when configuring the router to match performance of the strong model
Average ARQGC Cost savings (%) Latency benefit (%)
Nova 0.75 35% 9.98%
Anthropic 0.86 56% 6.15%
Meta 0.78 16% 9.38%

How to read this table?

It’s important to pause and understand these metrics. First, results shown in the preceding table are only meant for comparing against random routing within the family (that is, improvement in ARQGC over 0.5) and not across families. Second, the results are relevant only within the family of models and are different than other model benchmarks that you might be familiar with that are used to compare models. Third, because the real cost and price change frequently and are dependent on the input and output token counts, it’s challenging to compare the real cost. To solve this problem, we define the cost savings metric as the maximum cost saved compared to the strongest LLM cost for a router to achieve a certain level of response quality. Specifically, in the example shown in the table, there’s an average 35% cost savings using the Nova family router compared to using Nova Pro for all prompts without the router.

You can expect to see varying levels of benefit based on your use case. For example, in an internal test with hundreds of prompts, we achieve 60% cost savings using Amazon Bedrock Intelligent Prompt Routing with the Anthropic family, with the response quality matching that of Claude Sonnet3.5 V2.

What is response quality difference?

The response quality difference measures the disparity between the responses of the fallback model and the other models. A smaller value indicates that the responses are similar. A higher value indicates a significant difference in the responses between the fallback model and the other models. The choice of what you use as a fallback model is important. When configuring a response quality difference of 10% with Anthropic’s Claude 3 Sonnet as the fallback model, the router dynamically selects an LLM to achieve an overall performance with a 10% drop in the response quality from Claude 3 Sonnet. Conversely, if you use a less expensive model such as Claude 3 Haiku as the fallback model, the router dynamically selects an LLM to achieve an overall performance with a more than 10% increase from Claude 3 Haiku.

In the following figure, you can see that the response quality difference is set at 10% with Haiku as the fallback model. If customers want to explore optimal configurations beyond the default settings described previously, they can experiment with different response quality difference thresholds, analyze the router’s response quality, cost, and latency on their development dataset, and select the configuration that best fits their application’s requirements.

When configuring your own prompt router, you can set the threshold for response quality difference as shown in the following image of the Configure prompt router page, under Response quality difference (%) in the Amazon Bedrock console. To do this by using APIs, see How to use intelligent prompt routing.

Benchmark results

When using different model pairings, the ability of the smaller model to service a larger number of input prompts will have significant latency and cost benefits, depending on the model choice and the use case. For example, when comparing between usage of Claude 3 Haiku and Claude 3.5 Haiku along with Claude 3.5 Sonnet, we observe the following with one of our internal datasets:

Case 1: Routing between Claude 3 Haiku and Claude 3.5 Sonnet V2: Cost savings of 48% while maintaining the same response quality as Claude 3.5 Sonnet v2

Case 2: Routing between Claude 3.5 Haiku and Claude 3.5 Sonnet V2: Cost savings of 56% while maintaining the same response quality as Claude 3.5 Sonnet v2

As you can see in case 1 and case 2, as model capabilities for less expensive models improve with respect to more expensive models in the same family (for example Claude 3 Haiku to 3.5 Haiku), you can expect more complex tasks to be reliably solved by them, therefore causing a higher percentage of routing to the less expensive model while still maintaining the same overall accuracy in the task.

We encourage you to test the effectiveness of Amazon Bedrock Intelligent Prompt Routing on your specialized task and domain because results can vary. For example, when we tested Amazon Bedrock Intelligent Prompt Routing with open source and internal Retrieval Augmented Generation (RAG) datasets, we saw an average 63.6% cost savings because of a higher percentage (87%) of prompts being routed to Claude 3.5 Haiku while still maintaining the baseline accuracy with the larger/ more expensive model (Sonnet 3.5 v2 in the following figure) alone, averaged across RAG datasets.

Getting started

You can get started using the AWS Management Console for Amazon Bedrock. As mentioned earlier, you can create your own router or use a default router:

Use the console to configure a router:

  1. In the Amazon Bedrock console, choose Prompt Routers in the navigation pane, and then choose Configure prompt router.
  2. You can then use a previously configured router or a default router in the console-based playground. For example, in the following figure, we attached a 10K document from Amazon.com and asked a specific question about the cost of sales.
  3. Choose the router metrics icon (next to the refresh icon) to see which model the request was routed to. Because this is a nuanced question, Amazon Bedrock Intelligent Prompt Routing correctly routes to Claude 3.5 Sonnet V2 in this case, as shown in the following figure.

You can also use AWS Command Line Interface (AWS CLI) or API, to configure and use a prompt router.

To use the AWS CLI or API to configure a router:

AWS CLI:

aws bedrock create-prompt-router 
    --prompt-router-name my-prompt-router
    --models '[{"modelArn": "arn:aws:bedrock:<region>::foundation-model/<modelA>"}]'
    --fallback-model '[{"modelArn": "arn:aws:bedrock:<region>::foundation-model/<modelB>"}]'
    --routing-criteria '{"responseQualityDifference": 0.5}'

Boto3 SDK:

response = client.create_prompt_router(
    promptRouterName='my-prompt-router',
    models=[
        {
            'modelArn': 'arn:aws:bedrock:<region>::foundation-model/<modelA>'
        },
        {
            'modelArn': 'arn:aws:bedrock:<region>::foundation-model/<modelB>'
        },
    ],
    description='string',
    routingCriteria={
        'responseQualityDifference':0.5
    },
    fallbackModel={
        'modelArn': 'arn:aws:bedrock:<region>::foundation-model/<modelA>'
    },
    tags=[
        {
            'key': 'string',
            'value': 'string'
        },
    ]
)

Caveats and best practices

When using intelligent prompt routing in Amazon Bedrock, note that:

  • Amazon Bedrock Intelligent Prompt Routing is optimized for English prompts for typical chat assistant use cases. For use with other languages or customized use cases, conduct your own tests before implementing prompt routing in production applications or reach out to your AWS account team for help designing and conducting these tests.
  • You can select only two models to be part of the router (pairwise routing), with one of these two models being the fallback model. These two models have to be in the same AWS Region.
  • When starting with Amazon Bedrock Intelligent Prompt Routing, we recommend that you experiment using the default routers provided by Amazon Bedrock before trying to configure custom routers. After you’ve experimented with default routers, you can configure your own routers as needed for your use cases, evaluate the response quality in the playground, and use them for production application if they meet your requirements.
  • Amazon Bedrock Intelligent Prompt Routing can’t adjust routing decisions or responses based on application-specific performance data currently and might not always provide the most optimal routing for unique or specialized, domain-specific use cases. Contact your AWS account team for customization help on specific use cases.

Conclusion

In this post, we explored Amazon Bedrock Intelligent Prompt Routing, highlighting its ability to help optimize both response quality and cost by dynamically routing requests between different foundation models. Benchmark results demonstrate significant cost savings while maintaining high-quality responses and reduced latency benefits across model families. Whether you implement the pre-configured default routers or create custom configurations, Amazon Bedrock Intelligent Prompt Routing offers a powerful way to balance performance and efficiency in generative AI applications. As you implement this feature in your workflows, testing its effectiveness for specific use cases is recommended to take full advantage of the flexibility it provides. To get started, see Understanding intelligent prompt routing in Amazon Bedrock


About the authors

Shreyas Subramanian is a Principal Data Scientist and helps customers by using generative AI and deep learning to solve their business challenges using AWS services. Shreyas has a background in large-scale optimization and ML and in the use of ML and reinforcement learning for accelerating optimization tasks.

Balasubramaniam Srinivasan is a Senior Applied Scientist at Amazon AWS, working on post training methods for generative AI models. He enjoys enriching ML models with domain-specific knowledge and inductive biases to delight customers. Outside of work, he enjoys playing and watching tennis and football (soccer).

Yun Zhou is an Applied Scientist at AWS where he helps with research and development to ensure the success of AWS customers. He works on pioneering solutions for various industries using statistical modeling and machine learning techniques. His interest includes generative models and sequential data modeling.

Haibo Ding is a senior applied scientist at Amazon Machine Learning Solutions Lab. He is broadly interested in Deep Learning and Natural Language Processing. His research focuses on developing new explainable machine learning models, with the goal of making them more efficient and trustworthy for real-world problems. He obtained his Ph.D. from University of Utah and worked as a senior research scientist at Bosch Research North America before joining Amazon. Apart from work, he enjoys hiking, running, and spending time with his family.

Read More

How Infosys improved accessibility for Event Knowledge using Amazon Nova Pro, Amazon Bedrock and Amazon Elemental Media Services

How Infosys improved accessibility for Event Knowledge using Amazon Nova Pro, Amazon Bedrock and Amazon Elemental Media Services

This post is co-written with Saibal Samaddar, Tanushree Halder, and Lokesh Joshi from Infosys Consulting.

Critical insights and expertise are concentrated among thought leaders and experts across the globe. Language barriers often hinder the distribution and comprehension of this knowledge during crucial encounters. Workshops, conferences, and training sessions serve as platforms for collaboration and knowledge sharing, where the attendees can understand the information being conveyed in real-time and in their preferred language.

Infosys, a leading global IT services and consulting organization, used its digital expertise to tackle this challenge by pioneering, Infosys Event AI, an innovative AI-based event assistant. Infosys Event AI is designed to make knowledge universally accessible, making sure that valuable insights are not lost and can be efficiently utilized by individuals and organizations across diverse industries both during the event and after the event has concluded. The absence of such a system hinders effective knowledge sharing and utilization, limiting the overall impact of events and workshops. By transforming ephemeral event content into a persistent and searchable knowledge asset, Infosys Event AI seeks to enhance knowledge utilization and impact.

Some of the challenges in capturing and accessing event knowledge include:

  • Knowledge from events and workshops is often lost due to inadequate capture methods, with traditional note-taking being incomplete and subjective.
  • Reviewing lengthy recordings to find specific information is time-consuming and inefficient, creating barriers to knowledge retention and sharing.
  • People who miss events face significant obstacles accessing the knowledge shared, impacting sectors like education, media, and public sector where information recall is crucial.

To address these challenges, Infosys partnered with Amazon Web Services (AWS) to develop the Infosys Event AI to unlock the insights generated during events. In this post, we explain how Infosys built the Infosys Event AI solution using several AWS services including:

Solution Architecture

In this section, we present an overview of Event AI, highlighting its key features and workflow. Event AI delivers these core functionalities, as illustrated in the architecture diagram that follows:

  1. Seamless live stream acquisition from on-premises sources
  2. Real-time transcription processing for speech-to-text conversion
  3. Post-event processing and knowledge base indexing for structured information retrieval
  4. Automated generation of session summaries and key insights to enhance accessibility
  5. AI-powered chat-based assistant for interactive Q&A and efficient knowledge retrieval from the event session

Solution walkthrough

Next, we break down each functionality in detail. The services used in the solution are granted least-privilege permissions through AWS Identity and Access Management (IAM) policies for security purposes.

Seamless live stream acquisition

The solution begins with an IP-enabled camera capturing the live event feed, as shown in the following section of the architecture diagram. This stream is securely and reliably transported to the cloud using the Secure Reliable Transport (SRT) protocol through MediaConnect. The ingested stream is then received and processed by MediaLive, which encodes the video in real time and generates the necessary outputs.

The workflow follows these steps:

  1. Use an IP-enabled camera or ground encoder to convert non-IP streams into IP streams and transmit them through SRT protocol to MediaConnect for live event ingestion.
  2. MediaConnect securely transmits the stream to MediaLive for processing.

Real-time transcription processing

To facilitate real-time accessibility, the system uses MediaLive to isolate audio from the live video stream. This audio-only stream is then forwarded to a real-time transcriber module. The real-time transcriber module, hosted on an Amazon Elastic Compute Cloud (Amazon EC2) instance, uses the Amazon Transcribe stream API to generate transcriptions with minimal latency. These real-time transcriptions are subsequently delivered to an on-premises web client through secure WebSocket connections. The following screenshot shows a brief demo based on a fictitious scenario to illustrate Event AI’s real-time streaming capability.

The workflow steps for this part of the solution follows these steps:

  1. MediaLive extracts the audio from the live stream and creates an audio-only stream, which it then sends to the real-time transcriber module running on an EC2 instance. MediaLive also extracts the audio-only output and stores it in an Amazon Simple Storage Service (Amazon S3) bucket, facilitating a subsequent postprocessing workflow.
  2. The real-time transcriber module receives the audio-only stream and employs the Amazon Transcribe stream API to produce real-time transcriptions with low latency.
  3. The real-time transcriber module uses a secure WebSocket to transmit the transcribed text.
  4. The on-premises web client receives the transcribed text through a secure WebSocket connection through Amazon CloudFront and displays it on the web client’s UI.

The below diagram shows the live-stream acquisition and real-time transcription.

Post-event processing and knowledge base indexing

After the event concludes, recorded media and transcriptions are securely stored in Amazon S3 for further analysis. A serverless, event-driven workflow using Amazon EventBridge and AWS Lambda automates the post-event processing. Amazon Transcribe processes the recorded content to generate the final transcripts, which are then indexed and stored in an Amazon Bedrock knowledge base for seamless retrieval. Additionally, Amazon Nova Pro enables multilingual translation of the transcripts, providing global accessibility when needed. With its quality and speed, Amazon Nova Pro is ideally suited for this global use case.

The workflow for this part of the process follows these steps:

  1. After the event concludes, MediaLive sends a channel stopped notification to EventBridge
  2. Lambda function, subscribed to the channel stopped event, triggers post-event transcription using Amazon Transcribe
  3. The transcribed content is processed and stored in an S3 bucket
  4. (Optional) Amazon Nova Pro translates transcripts into multiple languages for broader accessibility using Amazon Bedrock
  5. Amazon Transcribe generates a transcription complete event and sends it to EventBridge
  6. A Lambda function, subscribed to the transcription complete event, triggers the synchronization process with Amazon Bedrock Knowledge Bases
  7. The knowledge is then indexed and stored in Amazon Bedrock knowledge base for efficient retrieval

These steps are shown in the following diagram.

Automated generation of session summaries and key insights

To enhance user experience, the solution uses Amazon Bedrock to analyze the transcriptions to generate concise session summaries and key insights. These insights help users quickly understand the essence of the event without going through lengthy transcripts. The below screenshot shows Infosys Event AI’s summarization capability.

The workflow for this part of the solution follows these steps:

  1. Users authenticate in to the web client portal using Amazon Cognito. Once authenticated, the user selects option in the portal UI to view the summaries and key insights.
  2. The user request is delegated to the AI assistant module, where it fetches the complete transcript from the S3 bucket.
  3. The transcript undergoes processing through Amazon Bedrock Pro, which is guided by Amazon Bedrock Guardrails. In line with responsible AI policies, this process results in the generation of secure summaries and the creation of key insights that are safeguarded for the user.

AI-powered chat-based assistant

A key feature of this architecture is an AI-powered chat assistant, which is used to interactively query the event knowledge base. The chat assistant is powered by Amazon Bedrock and retrieves information from the Amazon OpenSearch Serverless index, enabling seamless access to session insights.

The workflow for this part of the solution follows these steps:

  1. Authenticated users engage with the chat assistant using natural language to request specific event messaging details from the client web portal.
  2. The user prompt is directed to the AI assistant module for processing.
  3. The AI assistant module queries Amazon Bedrock Knowledge Bases for relevant answers.
  4. The transcript is processed by Amazon Nova Pro, guided by Amazon Bedrock Guardrails, to generate secure summaries and safeguard key insights. The integration of Amazon Bedrock Guardrails promotes professional, respectful interactions by working to block undesirable and harmful content during user interactions aligned with responsible AI policies.

The following demo demonstrates Event AI’s Q&A capability.

The steps for automated generation of insights and AI-chat assistant are shown in the following diagram.

Results and Impact

Infosys EventAI Assistant was launched on February 2025 during a responsible AI conference event in Bangalore, India, hosted by Infosys in partnership with the British High Commission.

  • Infosys Event AI was used by more than 800 conference attendees
  • It was used by around 230 people every minute during the event
  • The intelligent chat assistant was queried an average of 57 times every minute during the event
  • A total of more than 9,000 event session summaries were generated during the event

By using the solution, Infosys was able to realize the following key benefits for their internal users and for their customers:

  • Enhanced knowledge retention – During the events, Infosys Event AI was accessible from both mobile and laptop devices, providing an immersive participation experience for both the online and offline event.
  • Improved accessibility – Session knowledge became quickly accessible after the event through transcripts, summaries, and the intelligent chat assistant. The event information is readily available for attendees and for those who couldn’t attend. Furthermore, Infosys Event AI aggregates the session information from previous events, creating a knowledge archival system for information retrieval.
  • Increased engagement – The interactive chat assistant provides deeper engagement during the event sessions, which means users can ask specific questions and receive immediate, contextually relevant answers.
  • Time efficiency – Quick access to summaries and chat responses saves time compared to reviewing full session recordings or manual notes when seeking specific information.

Impacting Multiple Industries

Infosys is positioned to accelerate the adoption of Infosys Event AI across diverse industries:

  • AI-powered meeting management for the enterprises – Businesses can use the system for generating meeting minutes, creating training documentation from workshops, and facilitating knowledge sharing within teams. Summaries provide quick recaps of meetings for executives, and transcripts offer detailed records for compliance and reference.
  • Improved transparency and accessibility in the public sector – Parliamentary debates, public hearings, and government briefings are made accessible to the general public through transcripts and summaries, improving transparency and citizen engagement. The platform enables searchable archives of parliamentary proceedings for researchers, policymakers, and the public, creating accessible records for historical reference.
  • Accelerated learnings and knowledge retention in the education sector – Students effectively review lectures, seminars, and workshops through transcripts and summaries, reinforcing learning, and improving knowledge retention. The chat assistant allows for interactive learning and clarification of doubts, acting as a virtual teaching assistant. This is particularly useful in online and hybrid learning environments.
  • Improved media reporting and efficiency in the media industry – Journalists can use Infosys Event AI to rapidly transcribe press conferences, speeches, and interviews, accelerating news cycles and improving reporting accuracy. Summaries provide quick overviews of events, enabling faster news dissemination. The chat assistant facilitates quick fact-checking (with source citation) and information retrieval from event recordings.
  • Improved accessibility and inclusivity across the industry – Real-time transcription provides accessibility for hearing-challenged individuals. Multilingual translation of event transcripts allows participation by attendees for whom the event sessions aren’t in their native language. This promotes inclusivity and a wider participation during events for the purposes of knowledge sharing.

Conclusion

In this post, we explored how Infosys developed Infosys Event AI to unlock the insights generated from events and conferences. Through its suite of features—including real-time transcription, intelligent summaries, and an interactive chat assistant—Infosys Event AI makes event knowledge accessible and provides an immersive engagement solution for the attendees, during and after the event.

Infosys is planning to offer the Infosys Event AI solution to their internal teams and global customers in two versions: as a multi-tenanted, software-as-a-service (SaaS) solution and as a single-deployment solution. Infosys is also adding capabilities to include an event catalogue, knowledge lake, and event archival system to make the event information accessible beyond the scope of the current event. By using AWS managed services, Infosys has made Event AI a readily available, interactive, immersive and valuable resource for students, journalists, policymakers, enterprises, and the public sector. As organizations and institutions increasingly rely on events for knowledge dissemination, collaboration, and public engagement, Event AI is well positioned to unlock the full potential of the events.

Stay updated with new Amazon AI features and releases to advance your AI journey on AWS.


About the Authors

Aparajithan Vaidyanathan is a Principal Enterprise Solutions Architect at AWS. He supports enterprise customers migrate and modernize their workloads on AWS cloud. He is a Cloud Architect with 24+ years of experience designing and developing enterprise, large-scale and distributed software systems. He specializes in Generative AI & Machine Learning with focus on Data and Feature Engineering domain. He is an aspiring marathon runner and his hobbies include hiking, bike riding and spending time with his wife and two boys.

Maheshwaran G is a Specialist Solution Architect working with Media and Entertainment supporting Media companies in India to accelerate growth in an innovative fashion leveraging the power of cloud technologies. He is passionate about innovation and currently holds 8 USPTO and 8 IPO granted patents in diversified domains.

Saibal Samaddar is a senior principal consultant at Infosys Consulting and heads the AI Transformation Consulting (AIX) practice in India. He has over eighteen years of business consulting experience, including 11 years in PwC and KPMG, helping organizations drive strategic transformation by harnessing Digital and AI technologies. Known to be a visionary who can navigate complex transformations and make things happen, he has played a pivotal role in winning multiple new accounts for Infosys Consulting (IC).

Tanushree Halder is a principal consultant with Infosys Consulting and is the Lead – CX and Gen AI capability for AI Transformation Consulting (AIX). She has 11 years of experience working with clients in their transformational journeys. She has travelled to over 10 countries to provide her advisory services in AI with clients in BFSI, retail and logistics, hospitality, healthcare and shared services.

Lokesh Joshi is a consultant at Infosys Consulting. He has worked with multiple clients to strategize and integrate AI based solutions for workflow enhancements. He has over 4 years of experience in AI/ML, GenAI development, full Stack development, and cloud services. He specializes in Machine Learning and Data Science with a focus on Deep Learning and NLP. A fitness enthusiast, his hobbies include programming, hiking, and traveling.

Read More

Making Brain Waves: AI Startup Speeds Disease Research With Lab in the Loop

Making Brain Waves: AI Startup Speeds Disease Research With Lab in the Loop

About 15% of the world’s population — over a billion people — are affected by neurological disorders, from commonly known diseases like Alzheimer’s and Parkinson’s to hundreds of lesser-known, rare conditions.

BrainStorm Therapeutics, a San Diego-based startup, is accelerating the development of cures for these conditions using AI-powered computational drug discovery paired with lab experiments using organoids: tiny, 3D bundles of brain cells created from patient-derived stem cells. This hybrid, iterative method, where clinical data and AI models inform one another to accelerate drug development, is known as lab in the loop.

“The brain is the last frontier in modern biology,” said BrainStorm’s founder and CEO Robert Fremeau, who was previously a scientific director in neuroscience at Amgen and a faculty member at Duke University and the University of California, San Francisco. “By combining our organoid disease models with the power of generative AI, we now have the ability to start to unravel the underlying complex biology of disease networks.”

The company aims to lower the failure rate of drug candidates for brain diseases during clinical trials — currently over 93% — and identify therapeutics that can be applied to multiple diseases. Achieving these goals would make it faster and more economically viable to develop treatments for rare and common conditions.

“This alarmingly high clinical trial failure rate is mainly due to the inability of traditional preclinical models with rodents or 2D cells to predict human efficacy,” said Jun Yin, cofounder and chief technology officer at BrainStorm. “By integrating human-derived brain organoids with AI-driven analysis, we’re building a platform that better reflects the complexity of human neurobiology and improves the likelihood of clinical success.”

Fremeau and Yin believe that BrainStorm’s platform has the potential to accelerate development timelines, reduce research and development costs, and significantly increase the probability of bringing effective therapies to patients.

BrainStorm Therapeutics’ AI models, which run on NVIDIA GPUs in the cloud, were developed using the NVIDIA BioNeMo Framework, a set of programming tools, libraries and models for computational drug discovery. The company is a member of NVIDIA Inception, a global network of cutting-edge startups.

Clinical Trial in a Dish

BrainStorm Therapeutics uses AI models to develop gene maps of brain diseases, which they can use to identify promising targets for potential drugs and clinical biomarkers. Organoids allow them to screen thousands of drug molecules per day directly on human brain cells, enabling them to test the effectiveness of potential therapies before starting clinical trials.

“Brains have brain waves that can be picked up in a scan like an EEG, or electroencephalogram, which measures the electrical activity of neurons,” said Maya Gosztyla, the company’s cofounder and chief operating officer. “Our organoids also have spontaneous brain waves, allowing us to model the complex activity that you would see in the human brain in this much smaller system. We treat it like a clinical trial in a dish for studying brain diseases.”

BrainStorm Therapeutics is currently using patient-derived organoids for its work on drug discovery for Parkinson’s disease, a condition tied to the loss of neurons that produce dopamine, a neurotransmitter that helps with physical movement and cognition.

“In Parkinson’s disease, multiple genetic variants contribute to dysfunction across different cellular pathways, but they converge on a common outcome — the loss of dopamine neurons,” Fremeau said. “By using AI models to map and analyze the biological effects of these variants, we can discover disease-modifying treatments that have the potential to slow, halt or even reverse the progression of Parkinson’s.”

The BrainStorm team used single-cell sequencing data from brain organoids to fine-tune foundation models available through the BioNeMo Framework, including the Geneformer model for gene expression analysis. The organoids were derived from patients with mutations in the GBA1 gene, the most common genetic risk factor for Parkinson’s disease.

BrainStorm is also collaborating with the NVIDIA BioNeMo team to help optimize open-source access to the Geneformer model.

Accelerating Drug Discovery Research

With its proprietary platform, BrainStorm can mirror human brain biology and simulate how different treatments might work in a patient’s brain.

“This can be done thousands of times, much quicker and much cheaper than can be done in a wet lab — so we can narrow down therapeutic options very quickly,” Gosztyla said. “Then we can go in with organoids and test the subset of drugs the AI model thinks will be effective. Only after it gets through those steps will we actually test these drugs in humans.”

View of an organoid using Fluorescence Imaging Plate Reader, or FLIPR — a technique used to study the effect of compounds on cells during drug screening.

This technology led to the discovery that Donepezil, a drug prescribed for Alzheimer’s disease, could also be effective in treating Rett syndrome, a rare genetic neurodevelopmental disorder. Within nine months, the BrainStorm team was able to go from organoid screening to applying for a phase 2 clinical trial of the drug in Rett patients. This application was recently cleared by the U.S. Food and Drug Administration.

BrainStorm also plans to develop multimodal AI models that integrate data from cell sequencing, cell imaging, EEG scans and more.

“You need high-quality, multimodal input data to design the right drugs,” said Yin. “AI models trained on this data will help us understand disease better, find more effective drug candidates and, eventually, find prognostic biomarkers for specific patients that enable the delivery of precision medicine.”

The company’s next project is an initiative with the CURE5 Foundation to conduct the most comprehensive repurposed drug screen to date for CDKL5 Deficiency Disorder, another rare genetic neurodevelopmental disorder.

“Rare disease research is transforming from a high-risk niche to a dynamic frontier,” said Fremeau. “The integration of BrainStorm’s AI-powered organoid technology with NVIDIA accelerated computing resources and the NVIDIA BioNeMo platform is dramatically accelerating the pace of innovation while reducing the cost — so what once required a decade and billions of dollars can now be investigated with significantly leaner resources in a matter of months.”

Get started with NVIDIA BioNeMo for AI-accelerated drug discovery.

Read More

Chill Factor: NVIDIA Blackwell Platform Boosts Water Efficiency by Over 300x

Chill Factor: NVIDIA Blackwell Platform Boosts Water Efficiency by Over 300x

Traditionally, data centers have relied on air cooling — where mechanical chillers circulate chilled air to absorb heat from servers, helping them maintain optimal conditions. But as AI models increase in size, and the use of AI reasoning models rises, maintaining those optimal conditions is not only getting harder and more expensive — but more energy-intensive.

While data centers once operated at 20 kW per rack, today’s hyperscale facilities can support over 135 kW per rack, making it an order of magnitude harder to dissipate the heat generated by high-density racks. To keep AI servers running at peak performance, a new approach is needed for efficiency and scalability.

One key solution is liquid cooling — by reducing dependence on chillers and enabling more efficient heat rejection, liquid cooling is driving the next generation of high-performance, energy-efficient AI infrastructure.

The NVIDIA GB200 NVL72 and the NVIDIA GB300 NVL72 are rack-scale, liquid-cooled systems designed to handle the demanding tasks of trillion-parameter large language model inference. Their architecture is also specifically optimized for test-time scaling accuracy and performance, making it an ideal choice for running AI reasoning models while efficiently managing energy costs and heat.

Liquid-cooled NVIDIA Blackwell compute tray.

Driving Unprecedented Water Efficiency and Cost Savings in AI Data Centers

Historically, cooling alone has accounted for up to 40% of a data center’s electricity consumption, making it one of the most significant areas where efficiency improvements can drive down both operational expenses and energy demands.

Liquid cooling helps mitigate costs and energy use by capturing heat directly at the source. Instead of relying on air as an intermediary, direct-to-chip liquid cooling transfers heat in a technology cooling system loop. That heat is then cycled through a coolant distribution unit via liquid-to-liquid heat exchanger, and ultimately transferred to a facility cooling loop. Because of the higher efficiency of this heat transfer, data centers and AI factories can operate effectively with warmer water temperatures — reducing or eliminating the need for mechanical chillers in a wide range of climates.

The NVIDIA GB200 NVL72 rack-scale, liquid-cooled system, built on the NVIDIA Blackwell platform, offers exceptional performance while balancing energy costs and heat. It packs unprecedented compute density into each server rack, delivering 40x higher revenue potential, 30x higher throughput, 25x more energy efficiency and 300x more water efficiency than traditional air-cooled architectures. Newer NVIDIA GB300 NVL72 systems built on the Blackwell Ultra platform boast a 50x higher revenue potential and 35x higher throughput with 30x more energy efficiency.

Data centers spend an estimated $1.9-2.8M per megawatt (MW) per year, which amounts to nearly $500,000 spent annually on cooling-related energy and water costs. By deploying the liquid-cooled GB200 NVL72 system, hyperscale data centers and AI factories can achieve up to 25x cost savings, leading to over $4 million dollars in annual savings for a 50 MW hyperscale data center.

For data center and AI factory operators, this means lower operational costs, enhanced energy efficiency metrics and a future-proof infrastructure that scales AI workloads efficiently — without the unsustainable water footprint of legacy cooling methods.

Moving Heat Outside the Data Center

As compute density rises and AI workloads drive unprecedented thermal loads, data centers and AI factories must rethink how they remove heat from their infrastructure. The traditional methods of heat rejection that supported predictable CPU-based scaling are no longer sufficient on their own. Today, there are multiple options for moving heat outside the facility, but four major categories dominate current and emerging deployments.

Key Cooling Methods in a Changing Landscape

  • Mechanical Chillers: Mechanical chillers use a vapor compression cycle to cool water, which is then circulated through the data center to absorb heat. These systems are typically air-cooled or water-cooled, with the latter often paired with cooling towers to reject heat. While chillers are reliable and effective across diverse climates, they are also highly energy-intensive. In AI-scale facilities where power consumption and sustainability are top priorities, reliance on chillers can significantly impact both operational costs and carbon footprint.
  • Evaporative Cooling: Evaporative cooling uses the evaporation of water to absorb and remove heat. This can be achieved through direct or indirect systems, or hybrid designs. These systems are much more energy-efficient than chillers but come with high water consumption. In large facilities, they can consume millions of gallons of water per megawatt annually. Their performance is also climate-dependent, making them less effective in humid or water-restricted regions.
  • Dry Coolers: Dry coolers remove heat by transferring it from a closed liquid loop to the ambient air using large finned coils, much like an automotive radiator. These systems don’t rely on water and are ideal for facilities aiming to reduce water usage or operate in dry climates. However, their effectiveness depends heavily on the temperature of the surrounding air. In warmer environments, they may struggle to keep up with high-density cooling demands unless paired with liquid-cooled IT systems that can tolerate higher operating temperatures.
  • Pumped Refrigerant Systems: Pumped refrigerant systems use liquid refrigerants to move heat from the data center to outdoor heat exchangers. Unlike chillers, these systems don’t rely on large compressors inside the facility and they operate without the use of water. This method offers a thermodynamically efficient, compact and scalable solution that works especially well for edge deployments and water-constrained environments. Proper refrigerant handling and monitoring are required, but the benefits in power and water savings are significant.

Each of these methods offers different advantages depending on factors like climate, rack density, facility design and sustainability goals. As liquid cooling becomes more common and servers are designed to operate with warmer water, the door opens to more efficient and environmentally friendly cooling strategies — reducing both energy and water use while enabling higher compute performance.

Optimizing Data Centers for AI Infrastructure

As AI workloads grow exponentially, operators are reimagining data center design with infrastructure built specifically for high-performance AI and energy efficiency. Whether they’re transforming their entire setup into dedicated AI factories or upgrading modular components, optimizing inference performance is crucial for managing costs and operational efficiency.

To get the best performance, high compute capacity GPUs aren’t enough — they need to be able to communicate with each other at lightning speed.

NVIDIA NVLink boosts communication, enabling GPUs to operate as a massive, tightly integrated processing unit for maximum performance with a full-rack power density of 120 kW. This tight, high-speed communication is crucial for today’s AI tasks, where every second saved on transferring data can mean more tokens per second and more efficient AI models.

Traditional air cooling struggles at these power levels. To keep up, data center air would need to be either cooled to below-freezing temperatures or flow at near-gale speeds to carry the heat away, making it increasingly impractical to cool dense racks with air alone.

At nearly 1,000x the density of air, liquid cooling excels at carrying heat away thanks to its superior heat capacitance and thermal conductivity. By efficiently transferring heat away from high-performance GPUs, liquid cooling reduces reliance on energy-intensive and noisy cooling fans, allowing more power to be allocated to computation rather than cooling overhead.

Liquid Cooling in Action

Innovators across the industry are leveraging liquid cooling to slash energy costs, improve density and drive AI efficiency:

Cloud service providers are also adopting cutting-edge cooling and power innovations. Next-generation AWS data centers, featuring jointly developed liquid cooling solutions, increase compute power by 12% while reducing energy consumption by up to 46% — all while maintaining water efficiency.

Cooling the AI Infrastructure of the Future

As AI continues to push the limits of computational scale, innovations in cooling will be essential to meeting the thermal management challenges of the post-Moore’s law era.

NVIDIA is leading this transformation through initiatives like the COOLERCHIPS program, a U.S. Department of Energy-backed effort to develop modular data centers with next-generation cooling systems that are projected to reduce costs by at least 5% and improve efficiency by 20% over traditional air-cooled designs.

Looking ahead, data centers must evolve not only to support AI’s growing demands but do so sustainably — maximizing energy and water efficiency while minimizing environmental impact. By embracing high-density architectures and advanced liquid cooling, the industry is paving the way for a more efficient AI-powered future.

Learn more about breakthrough solutions for data center energy and water efficiency presented at NVIDIA GTC 2025 and discover how accelerated computing is driving a more efficient future with NVIDIA Blackwell.

Read More

Keeping AI on the Planet: NVIDIA Technologies Make Every Day About Earth Day

Keeping AI on the Planet: NVIDIA Technologies Make Every Day About Earth Day

Whether at sea, land or in the sky — even outer space — NVIDIA technology is helping research scientists and developers alike explore and understand oceans, wildlife, the climate and far out existential risks like asteroids.

These increasingly intelligent developments are helping to analyze environmental pollutants, damage to habitats and natural disaster risks at an accelerated pace. This, in turn, enables partnerships with local governments to take climate mitigation steps like pollution prevention and proactive planting.

Sailing the Seas of AI

Amphitrite, based in France, uses satellite data with AI to simulate and predict ocean currents and weather. Its AI models, driven by the NVIDIA AI and Earth-2 platforms, offer insights for positioning vessels to best harness the power of ocean currents. This helps determine when it’s best to travel, as well as the optimal course, reducing travel times, fuel consumption and carbon emissions. Amphitrite is a member of the NVIDIA Inception program for cutting-edge startups.

Watching Over Wildlife With AI

München, Germany-based OroraTech monitors animal poaching and wildfires with NVIDIA CUDA and Jetson. The NVIDIA Inception program member uses the EarthRanger platform to offer a wildfire detection and monitoring service that uses satellite imagery and AI to safeguard the environment and prevent poaching.

Keeping AI on the Weather

Weather agencies and climate scientists worldwide are using NVIDIA CorrDiff, a generative AI weather model enabling kilometer-scale forecasts of wind, temperature and precipitation type and amount. CorrDiff is part of the NVIDIA Earth-2 platform for simulating weather and climate conditions. It’s available as an easy-to-deploy NVIDIA NIM microservice.

In another climate effort, NVIDIA Research announced a new generative AI model, called StormCast, for reliable weather prediction at a scale larger than storms.

The model, outlined in a paper, can help with disaster and mitigation planning, saving lives.

Avoiding Mass Extinction Events

Researchers reported in Nature how a new method was able to spot 10-meter asteroids within the main asteroid belt located between Jupiter and Mars. Such space rocks can range from bus-sized to several Costco stores in width and deliver destruction to cities. It used NASA’s James Webb Space Telescope (JWST), which was tapped for views of these asteroids from previous research and enabled by NVIDIA accelerated computing.

Boosting Energy Efficiency With Liquid-Cooled Blackwell

NVIDIA GB200 NVL72 rack-scale, liquid-cooled systems, built on the Blackwell platform, offer exceptional performance while balancing energy costs and heat. It delivers 40x higher revenue potential, 30x higher throughput, 25x more energy efficiency and 300x more water efficiency than air-cooled architectures. NVIDIA GB300 NVL72 systems built on the Blackwell Ultra platform offer a 50x higher revenue potential, 35x higher throughput with 30x more energy efficiency.

Learn more about NVIDIA Earth-2 and NVIDIA Blackwell.

Read More