Cadence Taps NVIDIA Blackwell to Accelerate AI-Driven Engineering Design and Scientific Simulation

Cadence Taps NVIDIA Blackwell to Accelerate AI-Driven Engineering Design and Scientific Simulation

A new supercomputer offered by Cadence, a leading provider of technology for electronic design automation, is poised to support a suite of engineering design and life sciences applications accelerated by NVIDIA Blackwell systems and NVIDIA CUDA-X software libraries.

Available to deploy in the cloud and on premises, the Millennium M2000 Supercomputer features NVIDIA HGX B200 systems and NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs. Combined with optimized software, the supercomputer delivers up to 80x higher performance for electronic design automation, system design and life sciences workloads compared to its predecessor, a CPU-based system.

With this boost in computational capability, engineers can run massive simulations to drive breakthroughs in the design and development of autonomous machines, drug molecules, semiconductors, data centers and more.

Anirudh Devgan, president and CEO of Cadence, discussed the collaboration with NVIDIA founder and CEO Jensen Huang onstage at CadenceLIVE, taking place today in Santa Clara, California.

“This is years in the making,” Devgan said during the conversation with Huang. “It’s a combination of advancement on the hardware and system side by NVIDIA — and then, of course, we have to rewrite our software to take advantage of that.”

The pair discussed how NVIDIA and Cadence are working together on AI factories, digital twins and agentic AI.

“The work that we’re doing together recognizes that there’s a whole new type of factory that’s necessary. We call them AI factories,” Huang said. “AI is going to infuse into every single aspect of everything we do. Every company will be run better because of AI, or they’ll build better products because of AI.”

Huang also announced that NVIDIA plans to purchase 10 Millennium Supercomputer systems based on the NVIDIA GB200 NVL72 platform to accelerate the company’s chip design workflows.

“This is a big deal for us,” he said. “We started building our data center to get ready for it.”

Enabling Intelligent Design Across Industries 

The Millennium Supercomputer harnesses accelerated software from NVIDIA and Cadence for applications including circuit simulation, computational fluid dynamics, data center design and molecular design.

image of Cadence Millennium M2000 Supercomputer
Cadence Millennium M2000 Supercomputer

With the supercomputer’s optimized hardware and AI software, engineers and researchers can build more complex, detailed simulations that are capable of delivering more accurate insights to enable faster silicon, systems and drug development.

Through this collaboration, Cadence and NVIDIA are solving key design challenges with diverse applications across industries — for example, simulating thermal dynamics for chip design, fluid dynamics for aerospace applications and molecular dynamics for pharmaceutical research.

NVIDIA engineering teams used Cadence Palladium emulation platforms and Protium prototyping platforms to support design verification and chip bring-up workflows for the development of NVIDIA Blackwell.

Cadence used NVIDIA Grace Blackwell-accelerated systems to calculate the fluid dynamics at work when an aircraft takes off and lands. Using NVIDIA GB200 Grace Blackwell Superchips and the Cadence Fidelity CFD Platform, Cadence was able to run in under 24 hours highly complex simulations that would take several days to complete on a CPU cluster with hundreds of thousands of cores.

Cadence also used NVIDIA Omniverse application programming interfaces to visualize these intricate fluid dynamics.

Computational fluid dynamics simulation on the wing and engine of an airplane
NVIDIA Blackwell accelerates computer-aided engineering software by orders of magnitude, enabling complex simulations of fluid dynamics for the aerospace industry.

The company has integrated NVIDIA BioNeMo NIM microservices into Orion, Cadence’s molecular design platform — and NVIDIA Llama Nemotron reasoning models into the Cadence JedAI Platform.

Cadence has also adopted the NVIDIA Omniverse Blueprint for AI factory digital twins. Connected to the Cadence Reality Digital Twin Platform, the blueprint enables engineering teams to test and optimize power, cooling and networking in an AI factory with physically based simulations — long before construction starts in the real world. With these capabilities, engineers can make faster configuration decisions and future-proof the next generation of AI factories.

Learn more about the collaboration between NVIDIA and Cadence and watch this NVIDIA GTC session on advancing physics-based simulation technology for AI factory design.

Images courtesy of Cadence.

Read More

NVIDIA’s Rama Akkiraju on How AI Platform Architects Help Bridge Business Vision and Technical Execution

NVIDIA’s Rama Akkiraju on How AI Platform Architects Help Bridge Business Vision and Technical Execution

Enterprises across industries are exploring AI to rethink problem-solving and redefine business processes. But making these ventures successful requires the right infrastructure, such as AI factories, which allow businesses to convert data into tokens and outcomes.

Rama Akkiraju, vice president of IT for AI and machine learning at NVIDIA, joined the AI Podcast to discuss how enterprises can build the right foundations for AI success.

Drawing on over two decades of experience in the field, Akkiraju provided her perspective on AI’s evolution, from perception AI to generative AI to agentic AI, which allows systems to reason, plan and act autonomously, as well as physical AI, which enables autonomous machines to act in the real world.

What’s striking, Akkiraju pointed out, is the acceleration in the technology’s evolution: the shift from perception to generative AI took about 30 years, but the leap to agentic AI happened in just two. She also emphasized that AI is transforming software development by becoming an integral layer in application architecture — not just a tool.

“Treat AI like a new layer in the development stack, which is fundamentally reshaping the way we write software,” she said.

Akkiraju also spoke about the critical role of AI platform architects in designing and building AI infrastructure based on specific business needs. Enterprise implementations require complex stacks including data ingestion pipelines, vector databases, security controls and evaluation frameworks — and platform architects serve as the bridge between strategic business vision and technical execution.

Looking ahead, Akkiraju identified three trends shaping the future of AI infrastructure: the integration of specialized AI architecture into native enterprise systems, the emergence of domain-specific models and hardware optimized for particular use cases, and increasingly autonomous agentic systems requiring sophisticated memory and context management.

Time Stamps

1:27 – How Akkiraju’s team builds enterprise AI platforms, chatbots and copilots.

4:49 – The accelerated evolution from perception AI to generative AI to agentic AI.

11:22 – The comprehensive stack required for implementing AI in enterprise settings.

29:53 – Three major trends shaping the future of AI infrastructure.

You Might Also Like… 

NVIDIA’s Jacob Liberman on Bringing Agentic AI to Enterprises 

Jacob Liberman, director of product management at NVIDIA, explains how agentic AI bridges the gap between powerful AI models and practical enterprise applications, enabling intelligent multi-agent systems that reason, act and execute complex tasks with autonomy.

Isomorphic Labs Rethinks Drug Discovery With AI 

Isomorphic Labs’ leadership team discusses their AI-first approach to drug discovery, viewing biology as an information processing system and building generalizable AI models capable of learning from the entire universe of protein and chemical interactions.

AI Agents Take Digital Experiences to the Next Level in Gaming and Beyond

AI agents with advanced perception and cognition capabilities are making digital experiences more dynamic and personalized across industries. Inworld AI’s Chris Covert discusses how intelligent digital humans are reshaping interactive experiences, from gaming to healthcare.

Read More

Microsoft Fusion Summit explores how AI can accelerate fusion research

Microsoft Fusion Summit explores how AI can accelerate fusion research

Sir Steven Cowley, professor and director of the Princeton Plasma Physics Laboratory and former head of the UK Atomic Energy Authority, giving a presentation.

The pursuit of nuclear fusion as a limitless, clean energy source has long been one of humanity’s most ambitious scientific goals. Research labs and companies worldwide are working to replicate the fusion process that occurs at the sun’s core, where isotopes of hydrogen combine to form helium, releasing vast amounts of energy. While scalable fusion energy is still years away, researchers are now exploring how AI can help accelerate fusion research and bring this energy to the grid sooner. 

In March 2025, Microsoft Research held its inaugural Fusion Summit, a landmark event that brought together distinguished speakers and panelists from within and outside Microsoft Research to explore this question. 

Ashley Llorens, Corporate Vice President and Managing Director of Microsoft Research Accelerator, opened the Summit by outlining his vision for a self-reinforcing system that uses AI to drive sustainability. Steven Cowley, laboratory director of the U.S. Department of Energy’s Princeton Plasma Physics Laboratory (opens in new tab), professor at Princeton University, and former head of the UK Atomic Energy Authority, followed with a keynote explaining the intricate science and engineering behind fusion reactors. His message was clear: advancing fusion will require international collaboration and the combined power of AI and high-performance computing to model potential fusion reactor designs. 

Applying AI to fusion research

North America’s largest fusion facility, DIII (opens in new tab)-D, operated by General Atomics and owned by the US Department of Energy (DOE), provides a unique platform for developing and testing AI applications for fusion research, thanks to its pioneering data and digital twin platform. 

Richard Buttery (opens in new tab) from DIII-D and Dave Humphreys (opens in new tab) from General Atomics demonstrated how the US DIII-D National Fusion Program (opens in new tab) is already applying AI to advance reactor design and operations, highlighting promising directions for future development. They provided examples of how to apply AI to active plasma control to avoid disruptive instabilities, using AI-controlled trajectories to avoid tearing modes, and implementing feedback control using machine learning-derived density limits for safer high-density operations. 

One persistent challenge in reactor design involves building the interior “first wall,” which must withstand extreme heat and particle bombardment. Zulfi Alam, corporate vice president of Microsoft Quantum (opens in new tab), discussed the potential of using quantum computing in fusion, particularly for addressing material challenges like hydrogen diffusion in reactors.

He noted that silicon nitride shows promise as a barrier to hydrogen and vapor and explained the challenge of binding it to the reaction chamber. He emphasized the potential of quantum computing to improve material prediction and synthesis, enabling more efficient processes. He shared that his team is also investigating advanced silicon nitride materials to protect this critical component from neutron and alpha particle damage—an innovation that could make fusion commercially viable.

Microsoft research blog

PromptWizard: The future of prompt optimization through feedback-driven self-evolving prompts

PromptWizard from Microsoft Research is now open source. It is designed to automate and simplify AI prompt optimization, combining iterative LLM feedback with efficient exploration and refinement techniques to create highly effective prompts in minutes.


Exploring AI’s broader impact on fusion engineering

Lightning talks from Microsoft Research labs addressed the central question of AI’s potential to accelerate fusion research and engineering. Speakers covered a wide range of applications—from using gaming AI for plasma control and robotics for remote maintenance to physics-informed AI for simulating materials and plasma behavior. Closing the session, Archie Manoharan, Microsoft’s director of nuclear engineering for Cloud Operations and Infrastructure, emphasized the need for a comprehensive energy strategy, one that incorporates renewables, efficiency improvements, storage solutions, and carbon-free sources like fusion.

The Summit culminated in a thought-provoking panel discussion moderated by Ade Famoti, featuring Archie Manoharan, Richard Buttery, Steven Cowley, and Chris Bishop, Microsoft Technical Fellow and director of Microsoft Research AI for Science. Their wide-ranging conversation explored the key challenges and opportunities shaping the field of fusion. 

The panel highlighted several themes: the role of new regulatory frameworks that balance innovation with safety and public trust; the importance of materials discovery in developing durable fusion reactor walls; and the game-changing role AI could play in plasma optimization and surrogate modelling of fusion’s underlying physics.

They also examined the importance of global research collaboration, citing projects like the International Thermonuclear Experimental Reactor (opens in new tab) (ITER), the world’s largest experimental fusion device under construction in southern France, as testbeds for shared progress. One persistent challenge, however, is data scarcity. This prompted a discussion of using physics-informed neural networks as a potential approach to supplement limited experimental data. 

Global collaboration and next steps

Microsoft is collaborating with ITER (opens in new tab) to help advance the technologies and infrastructure needed to achieve fusion ignition—the critical point where a self-sustaining fusion reaction begins, using Microsoft 365 Copilot, Azure OpenAI Service, Visual Studio, and GitHub (opens in new tab). Microsoft Research is now cooperating with ITER to identify where AI can be exploited to model future experiments to optimize its design and operations. 

Now Microsoft Research has signed a Memorandum of Understanding with the Princeton Plasma Physics Laboratory (PPPL) (opens in new tab) to foster collaboration through knowledge exchange, workshops, and joint research projects. This effort aims to address key challenges in fusion, materials, plasma control, digital twins, and experiment optimization. Together, Microsoft Research and PPPL will work to drive innovation and advances in these critical areas.

Fusion is a scientific challenge unlike any other and could be key to sustainable energy in the future. We’re excited about the role AI can play in helping make that vision a reality. To learn more, visit the Fusion Summit event page, or connect with us by email at FusionResearch@microsoft.com.

The post Microsoft Fusion Summit explores how AI can accelerate fusion research appeared first on Microsoft Research.

Read More

How Deutsche Bahn redefines forecasting using Chronos models – Now available on Amazon Bedrock Marketplace

How Deutsche Bahn redefines forecasting using Chronos models – Now available on Amazon Bedrock Marketplace

This post is co-written with Kilian Zimmerer and Daniel Ringler from Deutsche Bahn.

Every day, Deutsche Bahn (DB) moves over 6.6 million passengers across Germany, requiring precise time series forecasting for a wide range of purposes. However, building accurate forecasting models traditionally required significant expertise and weeks of development time.

Today, we’re excited to explore how the time series foundation model Chronos-Bolt, recently launched on Amazon Bedrock Marketplace and available through Amazon SageMaker JumpStart, is revolutionizing time series forecasting by enabling accurate predictions with minimal effort. Whereas traditional forecasting methods typically rely on statistical modeling, Chronos treats time series data as a language to be modeled and uses a pre-trained FM to generate forecasts — similar to how large language models (LLMs) generate texts. Chronos helps you achieve accurate predictions faster, significantly reducing development time compared to traditional methods.

In this post, we share how Deutsche Bahn is redefining forecasting using Chronos models, and provide an example use case to demonstrate how you can get started using Chronos.

Chronos: Learning the language of time series

The Chronos model family represents a breakthrough in time series forecasting by using language model architectures. Unlike traditional time series forecasting models that require training on specific datasets, Chronos can be used for forecasting immediately. The original Chronos model quickly became the number #1 most downloaded model on Hugging Face in 2024, demonstrating the strong demand for FMs in time series forecasting.

Building on this success, we recently launched Chronos-Bolt, which delivers higher zero-shot accuracy compared to original Chronos models. It offers the following improvements:

  • Up to 250 times faster inference
  • 20 times better memory efficiency
  • CPU deployment support, making hosting costs up to 10 times less expensive

Now, you can use Amazon Bedrock Marketplace to deploy Chronos-Bolt. Amazon Bedrock Marketplace is a new capability in Amazon Bedrock that enables developers to discover, test, and use over 100 popular, emerging, and specialized FMs alongside the current selection of industry-leading models in Amazon Bedrock.

The challenge

Deutsche Bahn, Germany’s national railway company, serves over 1.8 billion passengers annually in long distance and regional rail passenger transport, making it one of the world’s largest railway operators. For more than a decade, Deutsche Bahn has been innovating together with AWS. AWS is the primary cloud provider for Deutsche Bahn and a strategic partner of DB Systel, a wholly owned subsidiary of DB AG that drives digitalization across all group companies.

Previously, Deutsche Bahn’s forecasting processes were highly heterogeneous across teams, requiring significant effort for each new use case. Different data sources required using multiple specialized forecasting methods, resulting in cost- and time-intensive manual effort. Company-wide, Deutsche Bahn identified dozens of different and independently operated forecasting processes. Smaller teams found it hard to justify developing customized forecasting solutions for their specific needs.

For example, the data analysis platform for passenger train stations of DB InfraGO AG integrates and analyzes diverse data sources, from weather data and SAP Plant Maintenance information to video analytics. Given the diverse data sources, a forecast method that was designed for one data source was usually not transferable to the other data sources.

To democratize forecasting capabilities across the organization, Deutsche Bahn needed a more efficient and scalable approach to handle various forecasting scenarios. Using Chronos, Deutsche Bahn demonstrates how cutting-edge technology can transform enterprise-scale forecasting operations.

Solution overview

A team enrolled in Deutsche Bahn’s accelerator program Skydeck, the innovation lab of DB Systel, developed a time series FM forecasting system using Chronos as the underlying model, in partnership with DB InfraGO AG. This system offers a secured internal API that can be used by Deutsche Bahn teams across the organization for efficient and simple-to-use time series forecasts, without the need to develop customized software.

The following diagram shows a simplified architecture of how Deutsche Bahn uses Chronos.

Architecture diagram of the solution

In the solution workflow, a user can pass timeseries data to Amazon API Gateway which serves as a secure front door for API calls, handling authentication and authorization. For more information on how to limit access to an API to authorized users only, refer to Control and manage access to REST APIs in API Gateway. Then, an AWS Lambda function is used as serverless compute for processing and passing requests to the Chronos model for inference. The fastest way to host a Chronos model is by using Amazon Bedrock Marketplace or SageMaker Jumpstart.

Impact and future plans

Deutsche Bahn tested the service on multiple use cases, such as predicting actual costs for construction projects and forecasting monthly revenue for retail operators in passenger stations. The implementation with Chronos models revealed compelling outcomes. The following table depicts the achieved results. In the first use case, we can observe that in zero-shot scenarios (meaning that the model has never seen the data before), Chronos models can achieve accuracy superior to established statistical methods like AutoARIMA and AutoETS, even though these methods were specifically trained on the data. Additionally, in both use cases, Chronos inference time is up to 100 times faster, and when fine-tuned, Chronos models outperform traditional approaches in both scenarios. For more details on fine-tuning Chronos, refer to Forecasting with Chronos – AutoGluon.

. Model Error (Lower is Better) Prediction Time (seconds) Training Time (seconds)
Deutsche Bahn test use case 1 AutoArima 0.202 40 .
AutoETS 0.2 9.1 .
Chronos Bolt Small (Zero Shot) 0.195 0.4 .
Chronos Bolt Base (Zero Shot) 0.198 0.6 .
Chronos Bolt Small (Fine-Tuned) 0.181 0.4 650
Chronos Bolt Base (Fine-Tuned) 0.186 0.6 1328
Deutsche Bahn test use case 2 AutoArima 0.13 100 .
AutoETS 0.136 18 .
Chronos Bolt Small (Zero Shot) 0.197 0.7 .
Chronos Bolt Base (Zero Shot) 0.185 1.2 .
Chronos Bolt Small (Fine-Tuned) 0.134 0.7 1012
Chronos Bolt Base (Fine-Tuned) 0.127 1.2 1893

Error is measured in SMAPE. Finetuning was stopped after 10,000 steps.

Based on the successful prototype, Deutsche Bahn is developing a company-wide forecasting service accessible to all DB business units, supporting different forecasting scenarios. Importantly, this will democratize the usage of forecasting across the organization. Previously resource-constrained teams are now empowered to generate their own forecasts, and forecast preparation time can be reduced from weeks to hours.

Example use case

Let’s walk through a practical example of using Chronos-Bolt with Amazon Bedrock Marketplace. We will forecast passenger capacity utilization at German long-distance and regional train stations using publicly available data.

Prerequisites

For this, you will use the AWS SDK for Python (Boto3) to programmatically interact with Amazon Bedrock. As prerequisites, you need to have the Python libraries boto3, pandas, and matplotlib installed. In addition, configure a connection to an AWS account such that Boto3 can use Amazon Bedrock. For more information on how to setup Boto3, refer to Quickstart – Boto3. If you are using Python inside an Amazon SageMaker notebook, the necessary packages are already installed.

Forecast passenger capacity

First, load the data with the historical passenger capacity utilization. For this example, focus on train station 239:

import pandas as pd

# Load data
df = pd.read_csv(
    "https://mobilithek.info/mdp-api/files/aux/573351169210855424/benchmark_personenauslastung_bahnhoefe_training.csv"
)
df_train_station = df[df["train_station"] == 239].reset_index(drop=True)

Next, deploy an endpoint on Amazon Bedrock Marketplace containing Chronos-Bolt. This endpoint acts as a hosted service, meaning that it can receive requests containing time series data and return forecasts in response.

Amazon Bedrock will assume an AWS Identity and Access Management (IAM) role to provision the endpoint. Modify the following code to reference your role. For a tutorial on creating an execution role, refer to How to use SageMaker AI execution roles. 

import boto3
import time

def describe_endpoint(bedrock_client, endpoint_arn):
    return bedrock_client.get_marketplace_model_endpoint(endpointArn=endpoint_arn)[
        "marketplaceModelEndpoint"
    ]

def wait_for_endpoint(bedrock_client, endpoint_arn):
    endpoint = describe_endpoint(bedrock_client, endpoint_arn)
    while endpoint["endpointStatus"] in ["Creating", "Updating"]:
        print(
            f"Endpoint {endpoint_arn} status is still {endpoint['endpointStatus']}."
            "Waiting 10 seconds before continuing..."
        )
        time.sleep(10)
        endpoint = describe_endpoint(bedrock_client, endpoint_arn)
    print(f"Endpoint status: {endpoint['status']}")

bedrock_client = boto3.client(service_name="bedrock")
region_name = bedrock_client.meta.region_name
executionRole = "arn:aws:iam::account-id:role/ExecutionRole" # Change to your role

# Deploy Endpoint
body = {
        "modelSourceIdentifier": f"arn:aws:sagemaker:{region_name}:aws:hub-content/SageMakerPublicHub/Model/autogluon-forecasting-chronos-bolt-base/2.0.0",
        "endpointConfig": {
            "sageMaker": {
                "initialInstanceCount": 1,
                "instanceType": "ml.m5.xlarge",
                "executionRole": executionRole,
        }
    },
    "endpointName": "brmp-chronos-endpoint",
    "acceptEula": True,
 }
response = bedrock_client.create_marketplace_model_endpoint(**body)
endpoint_arn = response["marketplaceModelEndpoint"]["endpointArn"]

# Wait until the endpoint is created. This will take a few minutes.
wait_for_endpoint(bedrock_client, endpoint_arn)

Then, invoke the endpoint to make a forecast. Send a payload to the endpoint, which includes historical time series values and configuration parameters, such as the prediction length and quantile levels. The endpoint processes this input and returns a response containing the forecasted values based on the provided data.

import json

# Query endpoint
bedrock_runtime_client = boto3.client(service_name="bedrock-runtime")
body = json.dumps(
    {
        "inputs": [
            {"target": df_train_station["capacity"].values.tolist()},
        ],
        "parameters": {
            "prediction_length": 64,
            "quantile_levels": [0.1, 0.5, 0.9],
        }
    }
)
response = bedrock_runtime_client.invoke_model(modelId=endpoint_arn, body=body)
response_body = json.loads(response["body"].read())  

Now you can visualize the forecasts generated by Chronos-Bolt.

import matplotlib.pyplot as plt

# Plot forecast
forecast_index = range(len(df_train_station), len(df_train_station) + 64)
low = response_body["predictions"][0]["0.1"]
median = response_body["predictions"][0]["0.5"]
high = response_body["predictions"][0]["0.9"]

plt.figure(figsize=(8, 4))
plt.plot(df_train_station["capacity"], color="royalblue", label="historical data")
plt.plot(forecast_index, median, color="tomato", label="median forecast")
plt.fill_between(
    forecast_index,
    low,
    high,
    color="tomato",
    alpha=0.3,
    label="80% prediction interval",
)
plt.legend(loc='upper left')
plt.grid()
plt.show()

The following figure shows the output.

Plot of the predictions

As we can see on the right-hand side of the preceding graph in red, the model is able to pick up the pattern that we can visually recognize on the left part of the plot (in blue). The Chronos model predicts a steep decline followed by two smaller spikes. It is worth highlighting that the model successfully predicted this pattern using zero-shot inference, that is, without being trained on the data. Going back to the original prediction task, we can interpret that this particular train station is underutilized on weekends.

Clean up

To avoid incurring unnecessary costs, use the following code to delete the model endpoint:

bedrock_client.delete_marketplace_model_endpoint(endpointArn=endpoint_arn)

# Confirm that endpoint is deleted
time.sleep(5)
try:
    endpoint = describe_endpoint(bedrock_client, endpoint_arn=endpoint_arn)
    print(endpoint["endpointStatus"])
except ClientError as err:
    assert err.response['Error']['Code'] =='ResourceNotFoundException'
    print(f"Confirmed that endpoint {endpoint_arn} was deleted")

Conclusion

The Chronos family of models, particularly the new Chronos-Bolt model, represents a significant advancement in making accurate time series forecasting accessible. Through the simple deployment options with Amazon Bedrock Marketplace and SageMaker JumpStart, organizations can now implement sophisticated forecasting solutions in hours rather than weeks, while achieving state-of-the-art accuracy.

Whether you’re forecasting retail demand, optimizing operations, or planning resource allocation, Chronos models provide a powerful and efficient solution that can scale with your needs.


About the authors

Kilian Zimmerer is an AI and DevOps Engineer at DB Systel GmbH in Berlin. With his expertise in state-of-the-art machine learning and deep learning, alongside DevOps infrastructure management, he drives projects, defines their technical vision, and supports their successful implementation within Deutsche Bahn.

Daniel Ringler is a software engineer specializing in machine learning at DB Systel GmbH in Berlin. In addition to his professional work, he is a volunteer organizer for PyData Berlin, contributing to the local data science and Python programming community.

Pedro Eduardo Mercado Lopez is an Applied Scientist at Amazon Web Services, where he works on time series forecasting for labor planning and capacity planning with a focus on hierarchical time series and foundation models. He received a PhD from Saarland University, Germany, doing research in spectral clustering for signed and multilayer graphs.

Simeon Brüggenjürgen is a Solutions Architect at Amazon Web Services based in Munich, Germany. With a background in Machine Learning research, Simeon supported Deutsche Bahn on this project.

John Liu has 15 years of experience as a product executive and 9 years of experience as a portfolio manager. At AWS, John is a Principal Product Manager for Amazon Bedrock. Previously, he was the Head of Product for AWS Web3 / Blockchain. Prior to AWS, John held various product leadership roles at public blockchain protocols, fintech companies and also spent 9 years as a portfolio manager at various hedge funds.

Michael Bohlke-Schneider is an Applied Science Manager at Amazon Web Services. At AWS, Michael works on machine learning and forecasting, with a focus on foundation models for structured data and AutoML. He received his PhD from the Technical University Berlin, where he worked on protein structure prediction.

Florian Saupe is a Principal Technical Product Manager at AWS AI/ML research supporting science teams like the graph machine learning group, and ML Systems teams working on large scale distributed training, inference, and fault resilience. Before joining AWS, Florian lead technical product management for automated driving at Bosch, was a strategy consultant at McKinsey & Company, and worked as a control systems and robotics scientist—a field in which he holds a PhD.

Read More

Use custom metrics to evaluate your generative AI application with Amazon Bedrock

Use custom metrics to evaluate your generative AI application with Amazon Bedrock

With Amazon Bedrock Evaluations, you can evaluate foundation models (FMs) and Retrieval Augmented Generation (RAG) systems, whether hosted on Amazon Bedrock or another model or RAG system hosted elsewhere, including Amazon Bedrock Knowledge Bases or multi-cloud and on-premises deployments. We recently announced the general availability of the large language model (LLM)-as-a-judge technique in model evaluation and the new RAG evaluation tool, also powered by an LLM-as-a-judge behind the scenes. These tools are already empowering organizations to systematically evaluate FMs and RAG systems with enterprise-grade tools. We also mentioned that these evaluation tools don’t have to be limited to models or RAG systems hosted on Amazon Bedrock; with the bring your own inference (BYOI) responses feature, you can evaluate models or applications if you use the input formatting requirements for either offering.

The LLM-as-a-judge technique powering these evaluations enables automated, human-like evaluation quality at scale, using FMs to assess quality and responsible AI dimensions without manual intervention. With built-in metrics like correctness (factual accuracy), completeness (response thoroughness), faithfulness (hallucination detection), and responsible AI metrics such as harmfulness and answer refusal, you and your team can evaluate models hosted on Amazon Bedrock and knowledge bases natively, or using BYOI responses from your custom-built systems.

Amazon Bedrock Evaluations offers an extensive list of built-in metrics for both evaluation tools, but there are times when you might want to define these evaluation metrics in a different way, or make completely new metrics that are relevant to your use case. For example, you might want to define a metric that evaluates an application response’s adherence to your specific brand voice, or want to classify responses according to a custom categorical rubric. You might want to use numerical scoring or categorical scoring for various purposes. For these reasons, you need a way to use custom metrics in your evaluations.

Now with Amazon Bedrock, you can develop custom evaluation metrics for both model and RAG evaluations. This capability extends the LLM-as-a-judge framework that drives Amazon Bedrock Evaluations.

In this post, we demonstrate how to use custom metrics in Amazon Bedrock Evaluations to measure and improve the performance of your generative AI applications according to your specific business requirements and evaluation criteria.

Overview

Custom metrics in Amazon Bedrock Evaluations offer the following features:

  • Simplified getting started experience – Pre-built starter templates are available on the AWS Management Console based on our industry-tested built-in metrics, with options to create from scratch for specific evaluation criteria.
  • Flexible scoring systems – Support is available for both quantitative (numerical) and qualitative (categorical) scoring to create ordinal metrics, nominal metrics, or even use evaluation tools for classification tasks.
  • Streamlined workflow management – You can save custom metrics for reuse across multiple evaluation jobs or import previously defined metrics from JSON files.
  • Dynamic content integration – With built-in template variables (for example, {{prompt}}, {{prediction}}, and {{context}}), you can seamlessly inject dataset content and model outputs into evaluation prompts.
  • Customizable output control – You can use our recommended output schema for consistent results, with advanced options to define custom output formats for specialized use cases.

Custom metrics give you unprecedented control over how you measure AI system performance, so you can align evaluations with your specific business requirements and use cases. Whether assessing factuality, coherence, helpfulness, or domain-specific criteria, custom metrics in Amazon Bedrock enable more meaningful and actionable evaluation insights.

In the following sections, we walk through the steps to create a job with model evaluation and custom metrics using both the Amazon Bedrock console and the Python SDK and APIs.

Supported data formats

In this section, we review some important data formats.

Judge prompt uploading

To upload your previously saved custom metrics into an evaluation job, follow the JSON format in the following examples.

The following code illustrates a definition with numerical scale:

{
    "customMetricDefinition": {
        "metricName": "my_custom_metric",
        "instructions": "Your complete custom metric prompt including at least one {{input variable}}",
        "ratingScale": [
            {
                "definition": "first rating definition",
                "value": {
                    "floatValue": 3
                }
            },
            {
                "definition": "second rating definition",
                "value": {
                    "floatValue": 2
                }
            },
            {
                "definition": "third rating definition",
                "value": {
                    "floatValue": 1
                }
            }
        ]
    }
}

The following code illustrates a definition with string scale:

{
    "customMetricDefinition": {
        "metricName": "my_custom_metric",
        "instructions": "Your complete custom metric prompt including at least one {{input variable}}",
        "ratingScale": [
            {
                "definition": "first rating definition",
                "value": {
                    "stringValue": "first value"
                }
            },
            {
                "definition": "second rating definition",
                "value": {
                    "stringValue": "second value"
                }
            },
            {
                "definition": "third rating definition",
                "value": {
                    "stringValue": "third value"
                }
            }
        ]
    }
}

The following code illustrates a definition with no scale:

{
    "customMetricDefinition": {
        "metricName": "my_custom_metric",
        "instructions": "Your complete custom metric prompt including at least one {{input variable}}"
    }
}

For more information on defining a judge prompt with no scale, see the best practices section later in this post.

Model evaluation dataset format

When using LLM-as-a-judge, only one model can be evaluated per evaluation job. Consequently, you must provide a single entry in the modelResponses list for each evaluation, though you can run multiple evaluation jobs to compare different models. The modelResponses field is required for BYOI jobs, but not needed for non-BYOI jobs. The following is the input JSONL format for LLM-as-a-judge in model evaluation. Fields marked with ? are optional.

{
    "prompt": string
    "referenceResponse"?: string
    "category"?: string
     "modelResponses"?: [
        {
            "response": string
            "modelIdentifier": string
        }
    ]
}

RAG evaluation dataset format

We updated the evaluation job input dataset format to be even more flexible for RAG evaluation. Now, you can bring referenceContexts, which are expected retrieved passages, so you can compare your actual retrieved contexts to your expected retrieved contexts. You can find the new referenceContexts field in the updated JSONL schema for RAG evaluation:

{
    "conversationTurns": [{
            "prompt": {
                "content": [{
                    "text": string
                }]
            },
            "referenceResponses": [{
                "content": [{
                    "text": string
                }]
            }],
            "referenceContexts" ? : [{
                "content": [{
                    "text": string
                }]
            }],
            "output": {
                "text": string "modelIdentifier" ? : string "knowledgeBaseIdentifier": string "retrievedPassages": {
                    "retrievalResults": [{
                        "name" ? : string "content": {
                            "text": string
                        },
                        "metadata" ? : {
                            [key: string]: string
                        }
                    }]
                }
            }]
    }

Variables for data injection into judge prompts

To make sure that your data is injected into the judge prompts in the right place, use the variables from the following table. We have also included a guide to show you where the evaluation tool will pull data from your input file, if applicable. There are cases where if you bring your own inference responses to the evaluation job, we will use that data from your input file; if you don’t use bring your own inference responses, then we will call the Amazon Bedrock model or knowledge base and prepare the responses for you.

The following table summarizes the variables for model evaluation.

Plain Name Variable Input Dataset JSONL Key Mandatory or Optional
Prompt {{prompt}} prompt Optional
Response {{prediction}}

For a BYOI job:

modelResponses.response 

If you don’t bring your own inference responses, the evaluation job will call the model and prepare this data for you.

Mandatory
Ground truth response {{ground_truth}} referenceResponse Optional

The following table summarizes the variables for RAG evaluation (retrieve only).

Plain Name Variable Input Dataset JSONL Key Mandatory or Optional
Prompt {{prompt}} prompt Optional
Ground truth response {{ground_truth}}

For a BYOI job:

output.retrievedResults.retrievalResults 

If you don’t bring your own inference responses, the evaluation job will call the Amazon Bedrock knowledge base and prepare this data for you.

Optional
Retrieved passage {{context}}

For a BYOI job:

output.retrievedResults.retrievalResults 

If you don’t bring your own inference responses, the evaluation job will call the Amazon Bedrock knowledge base and prepare this data for you.

Mandatory
Ground truth retrieved passage {{reference_contexts}} referenceContexts Optional

The following table summarizes the variables for RAG evaluation (retrieve and generate).

Plain Name Variable Input dataset JSONL key Mandatory or optional
Prompt {{prompt}} prompt Optional
Response {{prediction}}

For a BYOI job:

Output.text

If you don’t bring your own inference responses, the evaluation job will call the Amazon Bedrock knowledge base and prepare this data for you.

Mandatory
Ground truth response {{ground_truth}} referenceResponses Optional
Retrieved passage {{context}}

For a BYOI job:

Output.retrievedResults.retrievalResults

If you don’t bring your own inference responses, the evaluation job will call the Amazon Bedrock knowledge base and prepare this data for you.

Optional
Ground truth retrieved passage {{reference_contexts}} referenceContexts Optional

Prerequisites

To use the LLM-as-a-judge model evaluation and RAG evaluation features with BYOI, you must have the following prerequisites:

Create a model evaluation job with custom metrics using Amazon Bedrock Evaluations

Complete the following steps to create a job with model evaluation and custom metrics using Amazon Bedrock Evaluations:

  1. On the Amazon Bedrock console, choose Evaluations in the navigation pane and choose the Models
  2. In the Model evaluation section, on the Create dropdown menu, choose Automatic: model as a judge.
  3. For the Model evaluation details, enter an evaluation name and optional description.
  4. For Evaluator model, choose the model you want to use for automatic evaluation.
  5. For Inference source, select the source and choose the model you want to evaluate.

For this example, we chose Claude 3.5 Sonnet as the evaluator model, Bedrock models as our inference source, and Claude 3.5 Haiku as our model to evaluate.

  1. The console will display the default metrics for the evaluator model you chose. You can select other metrics as needed.
  2. In the Custom Metrics section, we create a new metric called “Comprehensiveness.” Use the template provided and modify based on your metrics. You can use the following variables to define the metric, where only {{prediction}} is mandatory:
    1. prompt
    2. prediction
    3. ground_truth

The following is the metric we defined in full:

Your role is to judge the comprehensiveness of an answer based on the question and 
the prediction. Assess the quality, accuracy, and helpfulness of language model response,
 and use these to judge how comprehensive the response is. Award higher scores to responses
 that are detailed and thoughtful.

Carefully evaluate the comprehensiveness of the LLM response for the given query (prompt)
 against all specified criteria. Assign a single overall score that best represents the 
comprehensivenss, and provide a brief explanation justifying your rating, referencing 
specific strengths and weaknesses observed.

When evaluating the response quality, consider the following rubrics:
- Accuracy: Factual correctness of information provided
- Completeness: Coverage of important aspects of the query
- Clarity: Clear organization and presentation of information
- Helpfulness: Practical utility of the response to the user

Evaluate the following:

Query:
{{prompt}}

Response to evaluate:
{{prediction}}

  1. Create the output schema and additional metrics. Here, we define a scale that provides maximum points (10) if the response is very comprehensive, and 1 if the response is not comprehensive at all.
  2. For Datasets, enter your input and output locations in Amazon S3.
  3. For Amazon Bedrock IAM role – Permissions, select Use an existing service role and choose a role.
  4. Choose Create and wait for the job to complete.

Considerations and best practices

When using the output schema of the custom metrics, note the following:

  • If you use the built-in output schema (recommended), do not add your grading scale into the main judge prompt. The evaluation service will automatically concatenate your judge prompt instructions with your defined output schema rating scale and some structured output instructions (unique to each judge model) behind the scenes. This is so the evaluation service can parse the judge model’s results and display them on the console in graphs and calculate average values of numerical scores.
  • The fully concatenated judge prompts are visible in the Preview window if you are using the Amazon Bedrock console to construct your custom metrics. Because judge LLMs are inherently stochastic, there might be some responses we can’t parse and display on the console and use in your average score calculations. However, the raw judge responses are always loaded into your S3 output file, even if the evaluation service cannot parse the response score from the judge model.
  • If you don’t use the built-in output schema feature (we recommend you use it instead of ignoring it), then you are responsible for providing your rating scale in the judge prompt instructions body. However, the evaluation service will not add structured output instructions and will not parse the results to show graphs; you will see the full judge output plaintext results on the console without graphs and the raw data will still be in your S3 bucket.

Create a model evaluation job with custom metrics using the Python SDK and APIs

To use the Python SDK to create a model evaluation job with custom metrics, follow these steps (or refer to our example notebook):

  1. Set up the required configurations, which should include your model identifier for the default metrics and custom metrics evaluator, IAM role with appropriate permissions, Amazon S3 paths for input data containing your inference responses, and output location for results:
    import boto3
    import time
    from datetime import datetime
    
    # Configure knowledge base and model settings
    evaluator_model = "anthropic.claude-3-5-sonnet-20240620-v1:0"
    generator_model = "amazon.nova-lite-v1:0"
    custom_metrics_evaluator_model = "anthropic.claude-3-5-sonnet-20240620-v1:0"
    role_arn = "arn:aws:iam::<YOUR_ACCOUNT_ID>:role/<YOUR_IAM_ROLE>"
    BUCKET_NAME = "<YOUR_BUCKET_NAME>"
    
    # Specify S3 locations
    input_data = f"s3://{BUCKET_NAME}/evaluation_data/input.jsonl"
    output_path = f"s3://{BUCKET_NAME}/evaluation_output/"
    
    # Create Bedrock client
    # NOTE: You can change the region name to the region of your choosing.
    bedrock_client = boto3.client('bedrock', region_name='us-east-1') 

  2. To define a custom metric for model evaluation, create a JSON structure with a customMetricDefinition Include your metric’s name, write detailed evaluation instructions incorporating template variables (such as {{prompt}} and {{prediction}}), and define your ratingScale array with assessment values using either numerical scores (floatValue) or categorical labels (stringValue). This properly formatted JSON schema enables Amazon Bedrock to evaluate model outputs consistently according to your specific criteria.
    comprehensiveness_metric ={
        "customMetricDefinition": {
            "name": "comprehensiveness",
            "instructions": """Your role is to judge the comprehensiveness of an 
    answer based on the question and the prediction. Assess the quality, accuracy, 
    and helpfulness of language model response, and use these to judge how comprehensive
     the response is. Award higher scores to responses that are detailed and thoughtful.
    
    Carefully evaluate the comprehensiveness of the LLM response for the given query (prompt)
     against all specified criteria. Assign a single overall score that best represents the 
    comprehensivenss, and provide a brief explanation justifying your rating, referencing 
    specific strengths and weaknesses observed.
    
    When evaluating the response quality, consider the following rubrics:
    - Accuracy: Factual correctness of information provided
    - Completeness: Coverage of important aspects of the query
    - Clarity: Clear organization and presentation of information
    - Helpfulness: Practical utility of the response to the user
    
    Evaluate the following:
    
    Query:
    {{prompt}}
    
    Response to evaluate:
    {{prediction}}""",
            "ratingScale": [
                {
                    "definition": "Very comprehensive",
                    "value": {
                        "floatValue": 10
                    }
                },
                {
                    "definition": "Mildly comprehensive",
                    "value": {
                        "floatValue": 3
                    }
                },
                {
                    "definition": "Not at all comprehensive",
                    "value": {
                        "floatValue": 1
                    }
                }
            ]
        }
    }

  3. To create a model evaluation job with custom metrics, use the create_evaluation_job API and include your custom metric in the customMetricConfig section, specifying both built-in metrics (such as Builtin.Correctness) and your custom metric in the metricNames array. Configure the job with your generator model, evaluator model, and proper Amazon S3 paths for input dataset and output results.
    # Create the model evaluation job
    model_eval_job_name = f"model-evaluation-custom-metrics{datetime.now().strftime('%Y-%m-%d-%H-%M-%S')}"
    
    model_eval_job = bedrock_client.create_evaluation_job(
        jobName=model_eval_job_name,
        jobDescription="Evaluate model performance with custom comprehensiveness metric",
        roleArn=role_arn,
        applicationType="ModelEvaluation",
        inferenceConfig={
            "models": [{
                "bedrockModel": {
                    "modelIdentifier": generator_model
                }
            }]
        },
        outputDataConfig={
            "s3Uri": output_path
        },
        evaluationConfig={
            "automated": {
                "datasetMetricConfigs": [{
                    "taskType": "General",
                    "dataset": {
                        "name": "ModelEvalDataset",
                        "datasetLocation": {
                            "s3Uri": input_data
                        }
                    },
                    "metricNames": [
                        "Builtin.Correctness",
                        "Builtin.Completeness",
                        "Builtin.Coherence",
                        "Builtin.Relevance",
                        "Builtin.FollowingInstructions",
                        "comprehensiveness"
                    ]
                }],
                "customMetricConfig": {
                    "customMetrics": [
                        comprehensiveness_metric
                    ],
                    "evaluatorModelConfig": {
                        "bedrockEvaluatorModels": [{
                            "modelIdentifier": custom_metrics_evaluator_model
                        }]
                    }
                },
                "evaluatorModelConfig": {
                    "bedrockEvaluatorModels": [{
                        "modelIdentifier": evaluator_model
                    }]
                }
            }
        }
    )
    
    print(f"Created model evaluation job: {model_eval_job_name}")
    print(f"Job ID: {model_eval_job['jobArn']}")

  4. After submitting the evaluation job, monitor its status with get_evaluation_job and access results at your specified Amazon S3 location when complete, including the standard and custom metric performance data.

Create a RAG system evaluation with custom metrics using Amazon Bedrock Evaluations

In this example, we walk through a RAG system evaluation with a combination of built-in metrics and custom evaluation metrics on the Amazon Bedrock console. Complete the following steps:

  1. On the Amazon Bedrock console, choose Evaluations in the navigation pane.
  2. On the RAG tab, choose Create.
  3. For the RAG evaluation details, enter an evaluation name and optional description.
  4. For Evaluator model, choose the model you want to use for automatic evaluation. The evaluator model selected here will be used to calculate default metrics if selected. For this example, we chose Claude 3.5 Sonnet as the evaluator model.
  5. Include any optional tags.
  6. For Inference source, select the source. Here, you have the option to select between Bedrock Knowledge Bases and Bring your own inference responses. If you’re using Amazon Bedrock Knowledge Bases, you will need to choose a previously created knowledge base or create a new one. For BYOI responses, you can bring the prompt dataset, context, and output from a RAG system. For this example, we chose Bedrock Knowledge Base as our inference source.
  7. Specify the evaluation type, response generator model, and built-in metrics. You can choose between a combined retrieval and response evaluation or a retrieval only evaluation, with options to use default metrics, custom metrics, or both for your RAG evaluation. The response generator model is only required when using an Amazon Bedrock knowledge base as the inference source. For the BYOI configuration, you can proceed without a response generator. For this example, we selected Retrieval and response generation as our evaluation type and chose Nova Lite 1.0 as our response generator model.
  8. In the Custom Metrics section, choose your evaluator model. We selected Claude 3.5 Sonnet v1 as our evaluator model for custom metrics.
  9. Choose Add custom metrics.
  10. Create your new metric. For this example, we create a new custom metric for our RAG evaluation called information_comprehensiveness. This metric evaluates how thoroughly and completely the response addresses the query by using the retrieved information. It measures the extent to which the response extracts and incorporates relevant information from the retrieved passages to provide a comprehensive answer.
  11. You can choose between importing a JSON file, using a preconfigured template, or creating a custom metric with full configuration control. For example, you can select the preconfigured templates for the default metrics and change the scoring system or rubric. For our information_comprehensiveness metric, we select the custom option, which allows us to input our evaluator prompt directly.
  12. For Instructions, enter your prompt. For example:
    Your role is to evaluate how comprehensively the response addresses the query 
    using the retrieved information. Assess whether the response provides a thorough 
    treatment of the subject by effectively utilizing the available retrieved passages.
    
    Carefully evaluate the comprehensiveness of the RAG response for the given query
     against all specified criteria. Assign a single overall score that best represents
     the comprehensiveness, and provide a brief explanation justifying your rating, 
    referencing specific strengths and weaknesses observed.
    
    When evaluating response comprehensiveness, consider the following rubrics:
    - Coverage: Does the response utilize the key relevant information from the retrieved
     passages?
    - Depth: Does the response provide sufficient detail on important aspects from the
     retrieved information?
    - Context utilization: How effectively does the response leverage the available
     retrieved passages?
    - Information synthesis: Does the response combine retrieved information to create
     a thorough treatment?
    
    Evaluate the following:
    
    Query: {{prompt}}
    
    Retrieved passages: {{context}}
    
    Response to evaluate: {{prediction}}

  13. Enter your output schema to define how the custom metric results will be structured, visualized, normalized (if applicable), and explained by the model.

If you use the built-in output schema (recommended), do not add your rating scale into the main judge prompt. The evaluation service will automatically concatenate your judge prompt instructions with your defined output schema rating scale and some structured output instructions (unique to each judge model) behind the scenes so that your judge model results can be parsed. The fully concatenated judge prompts are visible in the Preview window if you are using the Amazon Bedrock console to construct your custom metrics.

  1. For Dataset and evaluation results S3 location, enter your input and output locations in Amazon S3.
  2. For Amazon Bedrock IAM role – Permissions, select Use an existing service role and choose your role.
  3. Choose Create and wait for the job to complete.

Start a RAG evaluation job with custom metrics using the Python SDK and APIs

To use the Python SDK for creating an RAG evaluation job with custom metrics, follow these steps (or refer to our example notebook):

  1. Set up the required configurations, which should include your model identifier for the default metrics and custom metrics evaluator, IAM role with appropriate permissions, knowledge base ID, Amazon S3 paths for input data containing your inference responses, and output location for results:
    import boto3
    import time
    from datetime import datetime
    
    # Configure knowledge base and model settings
    knowledge_base_id = "<YOUR_KB_ID>"
    evaluator_model = "anthropic.claude-3-5-sonnet-20240620-v1:0"
    generator_model = "amazon.nova-lite-v1:0"
    custom_metrics_evaluator_model = "anthropic.claude-3-5-sonnet-20240620-v1:0"
    role_arn = "arn:aws:iam::<YOUR_ACCOUNT_ID>:role/<YOUR_IAM_ROLE>"
    BUCKET_NAME = "<YOUR_BUCKET_NAME>"
    
    # Specify S3 locations
    input_data = f"s3://{BUCKET_NAME}/evaluation_data/input.jsonl"
    output_path = f"s3://{BUCKET_NAME}/evaluation_output/"
    
    # Configure retrieval settings
    num_results = 10
    search_type = "HYBRID"
    
    # Create Bedrock client
    # NOTE: You can change the region name to the region of your choosing
    bedrock_client = boto3.client('bedrock', region_name='us-east-1') 

  2. To define a custom metric for RAG evaluation, create a JSON structure with a customMetricDefinition Include your metric’s name, write detailed evaluation instructions incorporating template variables (such as {{prompt}}, {{context}}, and {{prediction}}), and define your ratingScale array with assessment values using either numerical scores (floatValue) or categorical labels (stringValue). This properly formatted JSON schema enables Amazon Bedrock to evaluate responses consistently according to your specific criteria.
    # Define our custom information_comprehensiveness metric
    information_comprehensiveness_metric = {
        "customMetricDefinition": {
            "name": "information_comprehensiveness",
            "instructions": """
            Your role is to evaluate how comprehensively the response addresses the 
    query using the retrieved information. 
            Assess whether the response provides a thorough treatment of the subject
    by effectively utilizing the available retrieved passages.
    
    Carefully evaluate the comprehensiveness of the RAG response for the given query
    against all specified criteria. 
    Assign a single overall score that best represents the comprehensiveness, and 
    provide a brief explanation justifying your rating, referencing specific strengths
    and weaknesses observed.
    
    When evaluating response comprehensiveness, consider the following rubrics:
    - Coverage: Does the response utilize the key relevant information from the 
    retrieved passages?
    - Depth: Does the response provide sufficient detail on important aspects from 
    the retrieved information?
    - Context utilization: How effectively does the response leverage the available 
    retrieved passages?
    - Information synthesis: Does the response combine retrieved information to 
    create a thorough treatment?
    
    Evaluate using the following:
    
    Query: {{prompt}}
    
    Retrieved passages: {{context}}
    
    Response to evaluate: {{prediction}}
    """,
            "ratingScale": [
                {
                    "definition": "Very comprehensive",
                    "value": {
                        "floatValue": 3
                    }
                },
                {
                    "definition": "Moderately comprehensive",
                    "value": {
                        "floatValue": 2
                    }
                },
                {
                    "definition": "Minimally comprehensive",
                    "value": {
                        "floatValue": 1
                    }
                },
                {
                    "definition": "Not at all comprehensive",
                    "value": {
                        "floatValue": 0
                    }
                }
            ]
        }
    }

  3. To create a RAG evaluation job with custom metrics, use the create_evaluation_job API and include your custom metric in the customMetricConfig section, specifying both built-in metrics (Builtin.Correctness) and your custom metric in the metricNames array. Configure the job with your knowledge base ID, generator model, evaluator model, and proper Amazon S3 paths for input dataset and output results.
    # Create the evaluation job
    retrieve_generate_job_name = f"rag-evaluation-generate-{datetime.now().strftime('%Y-%m-%d-%H-%M-%S')}"
    
    retrieve_generate_job = bedrock_client.create_evaluation_job(
        jobName=retrieve_generate_job_name,
        jobDescription="Evaluate retrieval and generation with custom metric",
        roleArn=role_arn,
        applicationType="RagEvaluation",
        inferenceConfig={
            "ragConfigs": [{
                "knowledgeBaseConfig": {
                    "retrieveAndGenerateConfig": {
                        "type": "KNOWLEDGE_BASE",
                        "knowledgeBaseConfiguration": {
                            "knowledgeBaseId": knowledge_base_id,
                            "modelArn": generator_model,
                            "retrievalConfiguration": {
                                "vectorSearchConfiguration": {
                                    "numberOfResults": num_results
                                }
                            }
                        }
                    }
                }
            }]
        },
        outputDataConfig={
            "s3Uri": output_path
        },
        evaluationConfig={
            "automated": {
                "datasetMetricConfigs": [{
                    "taskType": "General",
                    "dataset": {
                        "name": "RagDataset",
                        "datasetLocation": {
                            "s3Uri": input_data
                        }
                    },
                    "metricNames": [
                        "Builtin.Correctness",
                        "Builtin.Completeness",
                        "Builtin.Helpfulness",
                        "information_comprehensiveness"
                    ]
                }],
                "evaluatorModelConfig": {
                    "bedrockEvaluatorModels": [{
                        "modelIdentifier": evaluator_model
                    }]
                },
                "customMetricConfig": {
                    "customMetrics": [
                        information_comprehensiveness_metric
                    ],
                    "evaluatorModelConfig": {
                        "bedrockEvaluatorModels": [{
                            "modelIdentifier": custom_metrics_evaluator_model
                        }]
                    }
                }
            }
        }
    )
    
    print(f"Created evaluation job: {retrieve_generate_job_name}")
    print(f"Job ID: {retrieve_generate_job['jobArn']}")

  4. After submitting the evaluation job, you can check its status using the get_evaluation_job method and retrieve the results when the job is complete. The output will be stored at the Amazon S3 location specified in the output_path parameter, containing detailed metrics on how your RAG system performed across the evaluation dimensions including custom metrics.

Custom metrics are only available for LLM-as-a-judge. At the time of writing, we don’t accept custom AWS Lambda functions or endpoints for code-based custom metric evaluators. Human-based model evaluation has supported custom metric definition since its launch in November 2023.

Clean up

To avoid incurring future charges, delete the S3 bucket, notebook instances, and other resources that were deployed as part of the post.

Conclusion

The addition of custom metrics to Amazon Bedrock Evaluations empowers organizations to define their own evaluation criteria for generative AI systems. By extending the LLM-as-a-judge framework with custom metrics, businesses can now measure what matters for their specific use cases alongside built-in metrics. With support for both numerical and categorical scoring systems, these custom metrics enable consistent assessment aligned with organizational standards and goals.

As generative AI becomes increasingly integrated into business processes, the ability to evaluate outputs against custom-defined criteria is essential for maintaining quality and driving continuous improvement. We encourage you to explore these new capabilities through the Amazon Bedrock console and API examples provided, and discover how personalized evaluation frameworks can enhance your AI systems’ performance and business impact.


About the Authors

Shreyas Subramanian is a Principal Data Scientist and helps customers by using generative AI and deep learning to solve their business challenges using AWS services. Shreyas has a background in large-scale optimization and ML and in the use of ML and reinforcement learning for accelerating optimization tasks.

Adewale Akinfaderin is a Sr. Data Scientist–Generative AI, Amazon Bedrock, where he contributes to cutting edge innovations in foundational models and generative AI applications at AWS. His expertise is in reproducible and end-to-end AI/ML methods, practical implementations, and helping global customers formulate and develop scalable solutions to interdisciplinary problems. He has two graduate degrees in physics and a doctorate in engineering.

Jesse Manders is a Senior Product Manager on Amazon Bedrock, the AWS Generative AI developer service. He works at the intersection of AI and human interaction with the goal of creating and improving generative AI products and services to meet our needs. Previously, Jesse held engineering team leadership roles at Apple and Lumileds, and was a senior scientist in a Silicon Valley startup. He has an M.S. and Ph.D. from the University of Florida, and an MBA from the University of California, Berkeley, Haas School of Business.

Ishan Singh is a Sr. Generative AI Data Scientist at Amazon Web Services, where he helps customers build innovative and responsible generative AI solutions and products. With a strong background in AI/ML, Ishan specializes in building Generative AI solutions that drive business value. Outside of work, he enjoys playing volleyball, exploring local bike trails, and spending time with his wife and dog, Beau.

Read More

PyTorch Foundation Launches New Website

Welcome to the new PyTorch Foundation website! We’re delighted to introduce you to our new PyTorch Foundation website, which serves as a fresh and centralized hub for information on open source AI and the PyTorch Foundation. As the community and ecosystem around PyTorch continue to grow, this new platform is designed to serve as the cornerstone for our global community to converge around PyTorch, fostering open collaboration and innovation in AI.

Why a New Website?

To better support the dynamic and diverse PyTorch community, we recognized the need for a dedicated space to:

  • Highlight the Foundation’s Vision: Share our mission and initiatives as a neutral steward of the PyTorch ecosystem.
  • Provide Centralized Resources: Offer easy access to news, events, governance updates, and community resources.
  • Engage the Community: Create pathways for collaboration and contribution, including updates on technical topics, ecosystem tools, events, etc.
  • Celebrate Impact: Showcase the incredible innovations powered by PyTorch across research, industry, and beyond.
  • Improve Navigability: Enable you to find what you need quickly and efficiently.
  • Refresh our Look: Update our site to reflect a more modern and sleek design.

What You’ll Find on the New Website

The new site is designed to be both intuitive and comprehensive. Here’s a breakdown of the top-level navigation:

About

Learn about the organization’s mission, members, and impact, meet the leaders guiding the PyTorch Foundation’s direction, and reach out with inquiries or collaboration opportunities.

Learn

Find step-by-step guides to help you begin your journey with PyTorch, get hands-on tutorials to build practical skills and understanding, and get up to speed on how to get started with the PyTorch Foundation.

Community

Contribute to and collaborate within the growing PyTorch ecosystem. Explore tools and libraries that extend PyTorch functionality and connect with fellow developers, researchers, and users.

Projects

The PyTorch Foundation will host a range of diverse, high-quality AI and ML projects beyond PyTorch Core and these projects will be listed here. 

Docs

Find core documentation for the PyTorch framework and explore domain-specific applications and resources. Note: We moved pytorch.org/docs to docs.pytorch.org and implemented redirects.

Blog & News

Get the latest news from the PyTorch Foundation, including technical deep dives on the blog. Stay up-to-date with information about upcoming conferences, meetups, and webinars, and subscribe to the PyTorch Foundation newsletter.

Join PyTorch Foundation

For organizations interested in becoming a member of the PyTorch Foundation, find out how to get involved and support the growth of the PyTorch Foundation.

This intuitive navigation ensures you can quickly find the resources and information you need on our website.

Designed with You in Mind

Whether you’re a researcher, developer, educator, or industry professional, the website is built to meet your needs. We’ve focused on simplicity, accessibility, and discoverability to make it easier than ever to:

  • Find the latest updates about the PyTorch Foundation’s work.
  • Navigate key resources to enhance your AI and ML projects.
  • Connect with a vibrant and collaborative community.

We’re eager to hear your feedback—what works, what doesn’t, and what you’d love to see next. Contact us to share your feedback.

Check It Out Today

We invite you to explore the new site at pytorch.org and use it as your go-to resource for all things PyTorch Foundation.

Read More

Your Service Teams Just Got a New Coworker — and It’s a 15B-Parameter Super Genius Built by ServiceNow and NVIDIA

Your Service Teams Just Got a New Coworker — and It’s a 15B-Parameter Super Genius Built by ServiceNow and NVIDIA

ServiceNow is accelerating enterprise AI with a new reasoning model built in partnership with NVIDIA — enabling AI agents that respond in real time, handle complex workflows and scale functions like IT, HR and customer service teams worldwide.

Unveiled today at ServiceNow’s Knowledge 2025 — where NVIDIA CEO and founder Jensen Huang joined ServiceNow chairman and CEO Bill McDermott during his keynote address — Apriel Nemotron 15B is compact, cost-efficient and tuned for action. It’s designed to drive the next step forward in enterprise large language models (LLMs).

Apriel Nemotron 15B was developed with NVIDIA NeMo, the open NVIDIA Llama Nemotron Post-Training Dataset and ServiceNow domain-specific data, and was trained on NVIDIA DGX Cloud running on Amazon Web Services (AWS).

The news follows the April release of the NVIDIA Llama Nemotron Ultra model, which harnesses the NVIDIA open dataset that ServiceNow used to build its Apriel Nemotron 15B model. Ultra is among the strongest open-source models at reasoning, including scientific reasoning, coding, advanced math and other agentic AI tasks.

Smaller Model, Bigger Impact

Apriel Nemotron 15B is engineered for reasoning — drawing inferences, weighing goals and navigating rules in real time. It’s smaller than some of the latest general-purpose LLMs that can run to more than a trillion parameters, which means it delivers faster responses and lower inference costs, while still packing enterprise-grade intelligence.

The model’s post-training took place on NVIDIA DGX Cloud hosted on AWS, tapping high-performance infrastructure to accelerate development. The result? An AI model that’s optimized not just for accuracy, but for speed, efficiency and scalability — key ingredients for powering AI agents that can support thousands of concurrent enterprise workflows.

A Closed Loop for Continuous Learning

Beyond the model itself, ServiceNow and NVIDIA are introducing a new data flywheel architecture — integrating ServiceNow’s Workflow Data Fabric with NVIDIA NeMo microservices, including NeMo Customizer and NeMo Evaluator.

This setup enables a closed-loop process that refines and improves AI performance by using workflow data to personalize responses and improve accuracy over time. Guardrails ensure customers are in control of how their data is used in a secure and compliant manner.

From Complexity to Clarity

In a keynote demo, ServiceNow is showing how these agentic models have been deployed in real enterprise scenarios, including with AstraZeneca, where AI agents will help employees resolve issues and make decisions with greater speed and precision — giving 90,000 hours back to employees.

“The Apriel Nemotron 15B model — developed by two of the most advanced enterprise AI companies — features purpose-built reasoning to power the next generation of intelligent AI agents,” said Jon Sigler, executive vice president of Platform and AI at ServiceNow. “This achieves what generic models can’t, combining real-time enterprise data, workflow context and advanced reasoning to help AI agents drive real productivity.”

“Together with ServiceNow, we’ve built an efficient, enterprise-ready model to fuel a new class of intelligent AI agents that can reason to boost team productivity,” added Kari Briski, vice president of generative AI software at NVIDIA. “By using the NVIDIA Llama Nemotron Post-Training Dataset and ServiceNow domain-specific data, Apriel Nemotron 15B delivers advanced reasoning capabilities in a smaller size, making it faster, more accurate and cost-effective to run.”

Scaling the AI Agent Era

The collaboration marks a shift in enterprise AI strategy. Enterprises are moving from static models to intelligent systems that evolve. It also marks another milestone in the partnership between ServiceNow and NVIDIA, pushing agentic AI forward across industries.

For businesses, this means faster resolution times, greater productivity and more responsive digital experiences. For technology leaders, it’s a model that fits today’s performance and cost requirements — and can scale as needs grow.

Availability

ServiceNow AI Agents, powered by Apriel Nemotron 15B, are expected to roll out following Knowledge 2025. The model will support ServiceNow’s Now LLM services and will become a key engine behind the company’s agentic AI offerings.

Learn more about the launch and how NVIDIA and ServiceNow are shaping the future of enterprise AI at Knowledge 2025. 

Read More