March 2025 – Page 7

New NVIDIA Software for Blackwell Infrastructure Runs AI Factories at Light Speed

The industrial age was fueled by steam. The digital age brought a shift through software. Now, the AI age is marked by the development of generative AI, agentic AI and AI reasoning, which enables models to process more data to learn and reason to solve complex problems.

Just as industrial factories transform raw materials into goods, modern businesses require AI factories to quickly transform data into insights that are scalable, accurate and reliable.

Orchestrating this new infrastructure is far more complex than it was to build steam-powered factories. State-of-the-art models demand supercomputing-scale resources. Any downtime risks derailing weeks of progress and reducing GPU utilization.

To enable enterprises and developers to manage and run AI factories at light speed, NVIDIA today announced at the NVIDIA GTC global AI conference NVIDIA Mission Control — the only unified operations and orchestration software platform that automates the complex management of AI data centers and workloads.

NVIDIA Mission Control enhances every aspect of AI factory operations. From configuring deployments to validating infrastructure to operating developer workloads, its capabilities help enterprises get frontier models up and running faster.

It is designed to easily transition NVIDIA Blackwell-based systems from pretraining to post-training — and now test-time scaling — with speed and efficiency. The software enables enterprises to easily pivot between training and inference workloads on their Blackwell-based NVIDIA DGX systems and NVIDIA Grace Blackwell systems, dynamically reallocating cluster resources to match shifting priorities.

In addition, Mission Control includes NVIDIA Run:ai technology to streamline operations and job orchestration for development, training and inference, boosting infrastructure utilization by up to 5x.

Mission Control’s autonomous recovery capabilities, supported by rapid checkpointing and automated tiered restart features, can deliver up to 10x faster job recovery compared with traditional methods that rely on manual intervention, boosting AI training and inference efficiency to keep AI applications in operation.

Built on decades of NVIDIA supercomputing expertise, Mission Control lets enterprises simply run models by minimizing time spent managing AI infrastructure. It automates the lifecycle of AI factory infrastructure for all NVIDIA Blackwell-based NVIDIA DGX systems and NVIDIA Grace Blackwell systems from Dell Technologies, Hewlett Packard Enterprise (HPE), Lenovo and Supermicro to make advanced AI infrastructure more accessible to the world’s industries.

Enterprises can further simplify and speed deployments of NVIDIA DGX GB300 and DGX B300 systems by using Mission Control with the NVIDIA Instant AI Factory service preconfigured in Equinix AI-ready data centers across 45 markets globally.

Advanced Software Provides Enterprises Uninterrupted Infrastructure Oversight

Mission Control automates end-to-end infrastructure management — including provisioning, monitoring and error diagnosis — to deliver uninterrupted operations. Plus, it continuously monitors every layer of the application and infrastructure stack to predict and identify sources of downtime and inefficiency — saving time, energy and costs.

Additional NVIDIA Mission Control software benefits include:

Simplified cluster setup and provisioning with new automation and standardized application programming interfaces to speed time to deployment with integrated inventory management and visualizations.
Seamless workload orchestration for simplified Slurm and Kubernetes workflows.
Energy-optimized power profiles to balance power requirements and tune GPU performance for various workload types with developer-selectable controls.
Autonomous job recovery to identify, isolate and recover from inefficiencies without manual intervention to maximize developer productivity and infrastructure resiliency.
Customizable dashboards that track key performance indicators with access to critical telemetry data about clusters.
On-demand health checks to validate hardware and cluster performance throughout the infrastructure lifecycle.
Building management integration for enhanced coordination with building management systems to provide more control for power and cooling events, including rapid leakage detection.

Leading System Makers Bring NVIDIA Mission Control to Grace Blackwell Servers

Leading system makers plan to offer NVIDIA GB200 NVL72 and GB300 NVL72 systems with NVIDIA Mission Control.

Dell plans to offer NVIDIA Mission Control software as part of the Dell AI Factory with NVIDIA.

“The AI industrial revolution demands efficient infrastructure that adapts as fast as business evolves, and the Dell AI Factory with NVIDIA delivers with comprehensive compute, networking, storage and support,” said Ihab Tarazi, chief technology officer and senior vice president at Dell Technologies. “Pairing NVIDIA Mission Control software and Dell PowerEdge XE9712 and XE9680 servers helps enterprises scale models effortlessly to meet the demands of both training and inference, turning data into actionable insights faster than ever before.”

HPE will offer the NVIDIA GB200 NVL72 by HPE and GB300 NVL72 by HPE systems with NVIDIA Mission Control software.

“We are helping service providers and cutting-edge enterprises to rapidly deploy, scale, and optimize complex AI clusters capable of training trillion parameter models,” said Trish Damkroger, senior vice president and general manager, HPC & AI Infrastructure Solutions at HPE. “As part of our collaboration with NVIDIA, we will deliver NVIDIA Grace Blackwell rack-scale systems and Mission Control software with HPE’s global services and direct liquid cooling expertise to power the new AI era.”

Lenovo plans to update its Lenovo Hybrid AI Advantage with NVIDIA systems to include NVIDIA Mission Control software.

“Bringing NVIDIA Mission Control software to Lenovo Hybrid AI Advantage with NVIDIA systems empowers enterprises to navigate the demands of generative and agentic AI workloads with unmatched agility,” said Brian Connors, worldwide vice president and general manager of enterprise and SMB segment and AI, infrastructure solutions group, at Lenovo. “By automating infrastructure orchestration and enabling seamless transitions between training and inference workloads, Lenovo and NVIDIA are helping customers scale AI innovation at the speed of business.”

Supermicro plans to incorporate NVIDIA Mission Control software into its Supercluster systems.

“Supermicro is proud to team with NVIDIA on a Grace Blackwell NVL72 system that is fully supported by NVIDIA Mission Control software,” Cenly Chen, chief growth officer at Supermicro. “Running on Supermicro’s AI SuperCluster systems with NVIDIA Grace Blackwell, NVIDIA Mission Control software provides customers with a seamless management software suite to maximize performance on both current NVIDIA GB200 NVL72 systems and future platforms such as NVIDIA GB300 NVL72.”

Base Command Manager Offers Free Kickstart for AI Cluster Management

To help enterprises with infrastructure management, NVIDIA Base Command Manager software is expected to soon be available for free for up to eight accelerators per system, for any cluster size, with the option to purchase NVIDIA Enterprise Support separately.

Availability

NVIDIA Mission Control for NVIDIA DGX GB200 and DGX B200 systems is available now. NVIDIA GB200 NVL72 systems with Mission Control are expected to soon be available from Dell, HPE, LeNewfonovo and Supermicro.

NVIDIA Mission Control is expected to become available for the latest NVIDIA DGX GB300 and DGX B300 systems, as well as GB300 NVL72 systems from leading global providers, later this year.

See notice regarding software product information.

Where AI and Graphics Converge: NVIDIA Blackwell Universal Data Center GPU Accelerates Demanding Enterprise Workloads

The first NVIDIA Blackwell-powered data center GPU built for both enterprise AI and visual computing — the NVIDIA RTX PRO 6000 Blackwell Server Edition — is designed to accelerate the most demanding AI and graphics applications for every industry.

Compared to the previous-generation NVIDIA Ada Lovelace architecture L40S GPU, the RTX PRO 6000 Blackwell Server Edition GPU will deliver a multifold increase in performance across a wide array of enterprise workloads — up to 5x higher large language model (LLM) inference throughput for agentic AI applications, nearly 7x faster genomics sequencing, 3.3x speedups for text-to-video generation, nearly 2x faster inference for recommender systems and over 2x speedups for rendering.

It’s part of the NVIDIA RTX PRO Blackwell series of workstation and server GPUs announced today at NVIDIA GTC, the global AI conference taking place through Friday, March 21, in San Jose, California. The RTX PRO lineup includes desktop, laptop and data center GPUs that support AI and creative workloads across industries.

With the RTX PRO 6000 Blackwell Server Edition, enterprises across various sectors — including architecture, automotive, cloud services, financial services, game development, healthcare, manufacturing, media and entertainment and retail — can enable breakthrough performance for workloads such as multimodal generative AI, data analytics, engineering simulation, and visual computing.

Content creation, semiconductor manufacturing and genomics analysis companies are already set to harness its capabilities to accelerate compute-intensive, AI-enabled workflows.

Universal GPU Delivers Powerful Capabilities for AI and Graphics

The RTX PRO 6000 Blackwell Server Edition packages powerful RTX AI and graphics capabilities in a passively cooled form factor designed to run 24/7 in data center environments. With 96GB of ultrafast GDDR7 memory and support for Multi-Instance GPU, or MIG, each RTX PRO 6000 can be partitioned into as many as four fully isolated instances with 24GB each to run simultaneous AI and graphics workloads.

RTX PRO 6000 is the first universal GPU to enable secure AI with NVIDIA Confidential Computing, which protects AI models and sensitive data from unauthorized access with strong, hardware-based security — providing a physically isolated trusted execution environment to secure the entire workload while data is in use.

To support enterprise-scale deployments, the RTX PRO 6000 can be configured in high-density accelerated computing platforms for distributed inference workloads — or used to deliver virtual workstations with NVIDIA vGPU software to power AI development and graphics-intensive applications.

The RTX PRO 6000 GPU delivers supercharged inferencing performance across a broad range of AI models and accelerates real-time, photorealistic ray tracing of complex virtual environments. It includes the latest Blackwell hardware and software innovations like fifth-generation Tensor Cores, fourth-generation RT Cores, DLSS 4, a fully integrated media pipeline and second-generation Transformer Engine with support for FP4 precision.

Enterprises can run the NVIDIA Omniverse and NVIDIA AI Enterprise platforms at scale on RTX PRO 6000 Blackwell Server Edition GPUs to accelerate the development and deployment of agentic and physical AI applications, such as image and video generation, LLM inference, recommender systems, computer vision, digital twins and robotics simulation.

Accelerated AI Inference and Visual Computing for Any Industry

Black Forest Labs, creator of the popular FLUX image generation AI, aims to develop and optimize state-of-the-art text-to-image models using RTX PRO 6000 Server Edition GPUs.

“With the powerful multimodal inference capabilities of the RTX PRO 6000 Server Edition, our customers will be able to significantly reduce latency for image generation workflows,” said name, title at Black Forest Labs. “We anticipate that, with the server edition GPUs’ support for FP4 precision, our Flux models will run faster, enabling interactive, AI-accelerated content creation.”

Cloud graphics company OTOY will optimize its OctaneRender real-time rendering application for NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs.

“The new NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs unlock brand-new workflows that were previously out of reach for 3D content creators,” said Jules Urbach, CEO of OTOY and founder of the Render Network. “With 96 GB of VRAM, the new server-edition GPUs can run complex neural rendering models within OctaneRender’s GPU path-tracer, enabling artists to tap into incredible new features and tools that blend the precision of traditional CGI augmented with frontier generative AI technology.”

Semiconductor equipment manufacturer KLA plans to use the RTX PRO 6000 Blackwell Server Edition to accelerate inference workloads powering the wafer manufacturing process — the creation of thin discs of semiconductor materials that are core to integrated circuits.

KLA and NVIDIA have worked together since 2008 to advance KLA’s physics-based AI with optimized high-performance computing solutions. KLA’s industry-leading inspection and metrology systems capture and process images by running complex AI algorithms at lightning-fast speeds to find the most critical semiconductor defects.

“Based on early results, we expect great performance from the RTX PRO 6000 Blackwell Server Edition,” said Kris Bhaskar, senior fellow and vice president of AI initiatives at KLA. “The increased memory capacity, FP4 reduced precision and new computational capabilities of NVIDIA Blackwell are going to be particularly helpful to KLA and its customers.”

Boosting Genomics and Drug Discovery Workloads

The RTX PRO 6000 Blackwell Server Edition also demonstrates game-changing acceleration for genomic analysis and drug discovery inference workloads, enabled by a new class of dynamic programming instructions.

On a single RTX PRO 6000 Blackwell Server Edition GPU, Fastq2bam and DeepVariant — elements of the NVIDIA Parabricks pipeline for germline analysis — run up to 1.5x faster compared with using an L40S GPU, and 1.75x faster compared with using an NVIDIA H100 GPU.

For Smith-Waterman, a core algorithm used in many sequence alignment and variant calling applications, RTX PRO 6000 Blackwell Server Edition GPUs accelerate throughput up to 6.8x compared with L40S GPUs.

And for OpenFold2, an AI model that predicts protein structures for drug discovery research, RTX PRO 6000 Blackwell Server Edition GPUs boost inference performance by up to 4.8x compared with L40S GPUs.

Genomics company Oxford Nanopore Technologies is collaborating with NVIDIA to bring the latest AI and accelerated computing technologies to its sequencing systems.

“The NVIDIA Blackwell architecture will help us drive the real-time sequencing analysis of anything, by anyone, anywhere,” said Chris Seymour, vice president of advanced platform development at Oxford Nanopore Technologies. “With the RTX PRO 6000 Blackwell Server Edition, we have seen up to a 2x improvement in basecalling speed across our Dorado platform.”

Availability via Global Network of Cloud Providers and System Partners

Platforms featuring the RTX PRO 6000 Blackwell Server Edition will be available from a global ecosystem of partners starting in May.

AWS, Google Cloud, Microsoft Azure, IBM Cloud, CoreWeave, Crusoe, Lambda, Nebius and Vultr will be among the first cloud service providers and GPU cloud providers to offer instances featuring the RTX PRO 6000 Blackwell Server Edition.

Cisco, Dell Technologies, Hewlett Packard Enterprise, Lenovo and Supermicro are expected to deliver a wide range of servers featuring the RTX PRO 6000 Blackwell Server Edition, as are Advantech, Aetina, Aivres, ASRockRack, ASUS, Compal, Foxconn, GIGABYTE, Inventec, MSI, Pegatron, Quanta Cloud Technology (QCT), MiTAC Computing, NationGate, Wistron and Wiwynn.

To learn more about the NVIDIA RTX PRO Blackwell series and other advancements in AI, watch the GTC keynote by NVIDIA founder and CEO Jensen Huang:

AI Factories, Built Smarter: New Omniverse Blueprint Advances AI Factory Design and Simulation

AI is now mainstream and driving unprecedented demand for AI factories — purpose-built infrastructure dedicated to AI training and inference — and the production of intelligence.

Many of these AI factories will be gigawatt-scale. Bringing up a single gigawatt AI factory is an extraordinary act of engineering and logistics — requiring tens of thousands of workers across suppliers, architects, contractors and engineers to build, ship and assemble nearly 5 billion components and over 210,000 miles of fiber cable.

To help design and optimize these AI factories, NVIDIA today unveiled at GTC the NVIDIA Omniverse Blueprint for AI factory design and operations.

During his GTC keynote, NVIDIA founder and CEO Jensen Huang showcased how NVIDIA’s data center engineering team developed an application on the Omniverse Blueprint to plan, optimize and simulate a 1 gigawatt AI factory. Connected to leading simulation tools such as Cadence Reality Digital Twin Platform and ETAP, the engineering teams can test and optimize power, cooling and networking long before construction starts.

Engineering AI Factories: A Simulation-First Approach

The NVIDIA Omniverse Blueprint for AI factory design and operations uses OpenUSD libraries that enable developers to aggregate 3D data from disparate sources such as the building itself, NVIDIA accelerated computing systems and power or cooling units from providers such as Schneider Electric and Vertiv.

By unifying the design and simulation of billions of components, the blueprint helps engineers address complex challenges like:

Component integration and space optimization — Unifying the design and simulation of NVIDIA DGX SuperPODs, GB300 NVL72 systems and their 5 billion components.
Cooling system performance and efficiency — Using Cadence Reality Digital Twin Platform, accelerated by NVIDIA CUDA and Omniverse libraries, to simulate and evaluate hybrid air- and liquid-cooling solutions from Vertiv and Schneider Electric.
Power distribution and reliability — Designing scalable, redundant electrical systems with ETAP to simulate power-block efficiency and reliability.
Networking topology and logic — Fine-tuning high-bandwidth infrastructure with NVIDIA Spectrum-X networking and the NVIDIA Air platform.

Breaking Down Engineering Silos With Omniverse

One of the biggest challenges in AI factory construction is that different teams — power, cooling and networking — operate in silos, leading to inefficiencies and potential failures.

Using the blueprint, engineers can now:

Collaborate in full context — Multiple disciplines can iterate in parallel, sharing live simulations that reveal how changes in one domain affect another.
Optimize energy usage — Real-time simulation updates enable teams to find the most efficient designs for AI workloads.
Eliminate failure points — By validating redundancy configurations before deployment, organizations reduce the risk of costly downtime.
Model real-world conditions — Predict and test how different AI workloads will impact cooling, power stability and network congestion.

By integrating real-time simulation across disciplines, the blueprint allows engineering teams to explore various configurations to model cost of ownership and optimize power utilization.

Real-Time Simulations for Faster Decision-Making

In Huang’s demo, engineers adjust AI factory configurations in real time — and instantly see the impact.

For example, a small tweak in cooling layout significantly improved efficiency — a detail that could have been missed on paper. And instead of waiting hours for simulation results, teams could test and refine strategies in just seconds.

Once an optimal design was finalized, Omniverse streamlined communication with suppliers and construction teams — ensuring that what gets built matches the model, down to the last detail.

Future-Proofing AI Factories

AI workloads aren’t static. The next wave of AI applications will push power, cooling and networking demands even further. The Omniverse Blueprint for AI factory design and operations helps ensure AI factories are ready by offering:

Workload-aware simulation — Predict how changes in AI workloads will affect power and cooling at data center scale.
Failure scenario testing — Model grid failures, cooling leaks and power spikes to ensure resilience.
Scalable upgrades — Plan for AI factory expansions and estimate infrastructure needs years ahead.

And when planning for retrofits and upgrades, users can easily test and simulate cost and downtime — delivering a future-proof AI factory.

For AI factory operators, staying ahead isn’t just about efficiency — it’s about preventing infrastructure failures that could cost millions of dollars per day.

For a 1 gigawatt AI factory, every day of downtime can cost over $100 million. By solving infrastructure challenges in advance, the blueprint reduces both risk and time to deployment.

Road to Agentic AI for AI Factory Operation

NVIDIA is working on the next evolution of the blueprint to expand into AI-enabled operations, working with key companies such as Vertech and Phaidra.

Vertech is collaborating with the NVIDIA data center engineering team on NVIDIA’s advanced AI factory control system, which integrates IT and operational technology data to enhance resiliency and operational visibility.

Phaidra is working with NVIDIA to integrate reinforcement-learning AI agents into Omniverse. These agents optimize thermal stability and energy efficiency through real-time scenario simulation, creating digital twins that continuously adapt to changing hardware and environmental conditions.

The AI Data Center Boom

AI is reshaping the global data center landscape. With $1 trillion projected for AI-driven data center upgrades, digital twin technology is no longer optional — it’s essential.

The NVIDIA Omniverse Blueprint for AI factory design and operations is poised to help NVIDIA and its ecosystem of partners lead this transformation — letting AI factory operators stay ahead of ever-evolving AI workloads, minimize downtime and maximize efficiency.

Learn more about NVIDIA Omniverse, watch the GTC keynote, register for Cadence’s GTC session to see the Omniverse Blueprint in action and read more about AI factories.

See notice regarding software product information.

Amazon Bedrock Guardrails announces IAM Policy-based enforcement to deliver safe AI interactions

As generative AI adoption accelerates across enterprises, maintaining safe, responsible, and compliant AI interactions has never been more critical. Amazon Bedrock Guardrails provides configurable safeguards that help organizations build generative AI applications with industry-leading safety protections. With Amazon Bedrock Guardrails, you can implement safeguards in your generative AI applications that are customized to your use cases and responsible AI policies. You can create multiple guardrails tailored to different use cases and apply them across multiple foundation models (FMs), improving user experiences and standardizing safety controls across generative AI applications. Beyond Amazon Bedrock models, the service offers the flexible ApplyGuardrails API that enables you to assess text using your pre-configured guardrails without invoking FMs, allowing you to implement safety controls across generative AI applications—whether running on Amazon Bedrock or on other systems—at both input and output levels.

Today, we’re announcing a significant enhancement to Amazon Bedrock Guardrails: AWS Identity and Access Management (IAM) policy-based enforcement. This powerful capability enables security and compliance teams to establish mandatory guardrails for every model inference call, making sure organizational safety policies are consistently enforced across AI interactions. This feature enhances AI governance by enabling centralized control over guardrail implementation.

Challenges with building generative AI applications

Organizations deploying generative AI face critical governance challenges: content appropriateness, where models might produce undesirable responses to problematic prompts; safety concerns, with potential generation of harmful content even from innocent prompts; privacy protection requirements for handling sensitive information; and consistent policy enforcement across AI deployments.

Perhaps most challenging is making sure that appropriate safeguards are applied consistently across AI interactions within an organization, regardless of which team or individual is developing or deploying applications.

Amazon Bedrock Guardrails capabilities

Amazon Bedrock Guardrails enables you to implement safeguards in generative AI applications customized to your specific use cases and responsible AI policies. Guardrails currently supports six types of policies:

Content filters – Configurable thresholds across six harmful categories: hate, insults, sexual, violence, misconduct, and prompt injections
Denied topics – Definition of specific topics to be avoided in the context of an application
Sensitive information filters – Detection and removal of personally identifiable information (PII) and custom regex entities to protect user privacy
Word filters – Blocking of specific words in generative AI applications, such as harmful words, profanity, or competitor names and products
Contextual grounding checks – Detection and filtering of hallucinations in model responses by verifying if the response is properly grounded in the provided reference source and relevant to the user query
Automated reasoning – Prevention of factual errors from hallucinations using sound mathematical, logic-based algorithmic verification and reasoning processes to verify the information generated by a model, so outputs align with known facts and aren’t based on fabricated or inconsistent data

Policy-based enforcement of guardrails

Security teams often have organizational requirements to enforce the use of Amazon Bedrock Guardrails for every inference call to Amazon Bedrock. To support this requirement, Amazon Bedrock Guardrails provides the new IAM condition key bedrock:GuardrailIdentifier, which can be used in IAM policies to enforce the use of a specific guardrail for model inference. The condition key in the IAM policy can be applied to the following APIs:

The following diagram illustrates the policy-based enforcement workflow.

If the guardrail configured in your IAM policy doesn’t match the guardrail specified in the request, the request will be rejected with an access denied exception, enforcing compliance with organizational policies.

Policy examples

In this section, we present several policy examples demonstrating how to enforce guardrails for model inference.

Example 1: Enforce the use of a specific guardrail and its numeric version

The following example illustrates the enforcement of exampleguardrail and its numeric version 1 during model inference:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "InvokeFoundationModelStatement1",
            "Effect": "Allow",
            "Action": [
                "bedrock:InvokeModel",
                "bedrock:InvokeModelWithResponseStream"
            ],
            "Resource": [
                "arn:aws:bedrock:region::foundation-model/*"
            ],
            "Condition": {
                "StringEquals": {
                    "bedrock:GuardrailIdentifier": "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail:1"
                }
            }
        },
        {
            "Sid": "InvokeFoundationModelStatement2",
            "Effect": "Deny",
            "Action": [
                "bedrock:InvokeModel",
                "bedrock:InvokeModelWithResponseStream"
            ],
            "Resource": [
                "arn:aws:bedrock:region::foundation-model/*"
            ],
            "Condition": {
                "StringNotEquals": {
                    "bedrock:GuardrailIdentifier": "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail:1"
                }
            }
        },
        {
            "Sid": "ApplyGuardrail",
            "Effect": "Allow",
            "Action": [
                "bedrock:ApplyGuardrail"
            ],
            "Resource": [
                "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail"
            ]
        }
    ]
}

The added explicit deny denies the user request for calling the listed actions with other GuardrailIdentifier and GuardrailVersion values irrespective of other permissions the user might have.

Example 2: Enforce the use of a specific guardrail and its draft version

The following example illustrates the enforcement of exampleguardrail and its draft version during model inference:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "InvokeFoundationModelStatement1",
            "Effect": "Allow",
            "Action": [
                "bedrock:InvokeModel",
                "bedrock:InvokeModelWithResponseStream"
            ],
            "Resource": [
                "arn:aws:bedrock:region::foundation-model/*"
            ],
            "Condition": {
                "StringEquals": {
                    "bedrock:GuardrailIdentifier": "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail"
                }
            }
        },
        {
            "Sid": "InvokeFoundationModelStatement2",
            "Effect": "Deny",
            "Action": [
                "bedrock:InvokeModel",
                "bedrock:InvokeModelWithResponseStream"
            ],
            "Resource": [
                "arn:aws:bedrock:region::foundation-model/*"
            ],
            "Condition": {
                "StringNotEquals": {
                    "bedrock:GuardrailIdentifier": "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail"
                }
            }
        },
        {
            "Sid": "ApplyGuardrail",
            "Effect": "Allow",
            "Action": [
                "bedrock:ApplyGuardrail"
            ],
            "Resource": [
                "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail"
            ]
        }
    ]
}

Example 3: Enforce the use of a specific guardrail and its numeric versions

The following example illustrates the enforcement of exampleguardrail and its numeric versions during model inference:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "InvokeFoundationModelStatement1",
            "Effect": "Allow",
            "Action": [
                "bedrock:InvokeModel",
                "bedrock:InvokeModelWithResponseStream"
            ],
            "Resource": [
                "arn:aws:bedrock:region::foundation-model/*"
            ],
            "Condition": {
                "StringLike": {
                    "bedrock:GuardrailIdentifier": "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail:*"
                }
            }
        },
        {
            "Sid": "InvokeFoundationModelStatement2",
            "Effect": "Deny",
            "Action": [
                "bedrock:InvokeModel",
                "bedrock:InvokeModelWithResponseStream"
            ],
            "Resource": [
                "arn:aws:bedrock:region::foundation-model/*"
            ],
            "Condition": {
                "StringNotLike": {
                    "bedrock:GuardrailIdentifier": "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail:*"
                }
            }
        },
        {
            "Sid": "ApplyGuardrail",
            "Effect": "Allow",
            "Action": [
                "bedrock:ApplyGuardrail"
            ],
            "Resource": [
                "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail"
            ]
        }
    ]
}

Example 4: Enforce the use of a specific guardrail and its versions, including the draft

The following example illustrates the enforcement of exampleguardrail and its versions, including the draft, during model inference:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "InvokeFoundationModelStatement1",
            "Effect": "Allow",
            "Action": [
                "bedrock:InvokeModel",
                "bedrock:InvokeModelWithResponseStream"
            ],
            "Resource": [
                "arn:aws:bedrock:region::foundation-model/*"
            ],
            "Condition": {
                "StringLike": {
                    "bedrock:GuardrailIdentifier": "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail*"
                }
            }
        },
        {
            "Sid": "InvokeFoundationModelStatement2",
            "Effect": "Deny",
            "Action": [
                "bedrock:InvokeModel",
                "bedrock:InvokeModelWithResponseStream"
            ],
            "Resource": [
                "arn:aws:bedrock:region::foundation-model/*"
            ],
            "Condition": {
                "StringNotLike": {
                    "bedrock:GuardrailIdentifier": "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail*"
                }
            }
        },
        {
            "Sid": "ApplyGuardrail",
            "Effect": "Allow",
            "Action": [
                "bedrock:ApplyGuardrail"
            ],
            "Resource": [
                "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail"
            ]
        }
    ]
}

Example 5: Enforce the use of a specific guardrail and version pair from a list of guardrail and version pairs

The following example illustrates the enforcement of exampleguardrail1 and its version 1, or exampleguardrail2 and its version 2, or exampleguardrail3 and its version 3 and its draft during model inference:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "InvokeFoundationModelStatement1",
            "Effect": "Allow",
            "Action": [
                "bedrock:InvokeModel",
                "bedrock:InvokeModelWithResponseStream"
            ],
            "Resource": [
                "arn:aws:bedrock:region::foundation-model/*"
            ],
            "Condition": {
                "StringEquals": {
                    "bedrock:GuardrailIdentifier": [
                        "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail1:1",
                        "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail2:2",
                        "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail3"
                    ]
                }
            }
        },
        {
            "Sid": "InvokeFoundationModelStatement2",
            "Effect": "Deny",
            "Action": [
                "bedrock:InvokeModel",
                "bedrock:InvokeModelWithResponseStream"
            ],
            "Resource": [
                "arn:aws:bedrock:region::foundation-model/*"
            ],
            "Condition": {
                "StringNotEquals": {
                    "bedrock:GuardrailIdentifier": [
                        "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail1:1",
                        "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail2:2",
                        "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail3"
                    ]
                }
            }
        },
        {
            "Sid": "ApplyGuardrail",
            "Effect": "Allow",
            "Action": [
                "bedrock:ApplyGuardrail"
            ],
            "Resource": [
                "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail1",
                "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail2",
                "arn:aws:bedrock:<region>:<account-id>:guardrail/exampleguardrail3"
            ]
        }
    ]
}

Known limitations

When implementing policy-based guardrail enforcement, be aware of these limitations:

At the time of this writing, Amazon Bedrock Guardrails doesn’t support resource-based policies for cross-account access.
If a user assumes a role that has a specific guardrail configured using the bedrock:GuardrailIdentifier condition key, the user can strategically use input tags to help avoid having guardrail checks applied to certain parts of their prompt. Input tags allow users to mark specific sections of text that should be processed by guardrails, leaving other sections unprocessed. For example, a user could intentionally leave sensitive or potentially harmful content outside of the tagged sections, preventing those portions from being evaluated against the guardrail policies. However, regardless of how the prompt is structured or tagged, the guardrail will still be fully applied to the model’s response.
If a user has a role configured with a specific guardrail requirement (using the bedrock:GuardrailIdentifier condition), they shouldn’t use that same role to access services like Amazon Bedrock Knowledge Bases RetrieveAndGenerate or Amazon Bedrock Agents InvokeAgent. These higher-level services work by making multiple InvokeModel calls behind the scenes on the user’s behalf. Although some of these calls might include the required guardrail, others don’t. When the system attempts to make these guardrail-free calls using a role that requires guardrails, it results in AccessDenied errors, breaking the functionality of these services. To help avoid this issue, organizations should separate permissions—using different roles for direct model access with guardrails versus access to these composite Amazon Bedrock services.

Conclusion

The new IAM policy-based guardrail enforcement in Amazon Bedrock represents a crucial advancement in AI governance as generative AI becomes integrated into business operations. By enabling centralized policy enforcement, security teams can maintain consistent safety controls across AI applications regardless of who develops or deploys them, effectively mitigating risks related to harmful content, privacy violations, and bias. This approach offers significant advantages: it scales efficiently as organizations expand their AI initiatives without creating administrative bottlenecks, helps prevent technical debt by standardizing safety implementations, and enhances the developer experience by allowing teams to focus on innovation rather than compliance mechanics.

This capability demonstrates organizational commitment to responsible AI practices through comprehensive monitoring and audit mechanisms. Organizations can use model invocation logging in Amazon Bedrock to capture complete request and response data in Amazon CloudWatch Logs or Amazon Simple Storage Service (Amazon S3) buckets, including specific guardrail trace documentation showing when and how content was filtered. Combined with AWS CloudTrail integration that records guardrail configurations and policy enforcement actions, businesses can confidently scale their generative AI initiatives with appropriate safety mechanisms protecting their brand, customers, and data—striking the essential balance between innovation and ethical responsibility needed to build trust in AI systems.

Get started today with Amazon Bedrock Guardrails and implement configurable safeguards that balance innovation with responsible AI governance across your organization.

About the Authors

Shyam Srinivasan is on the Amazon Bedrock Guardrails product team. He cares about making the world a better place through technology and loves being part of this journey. In his spare time, Shyam likes to run long distances, travel around the world, and experience new cultures with family and friends.

Antonio Rodriguez is a Principal Generative AI Specialist Solutions Architect at AWS. He helps companies of all sizes solve their challenges, embrace innovation, and create new business opportunities with Amazon Bedrock. Apart from work, he loves to spend time with his family and play sports with his friends.

Satveer Khurpa is a Sr. WW Specialist Solutions Architect, Amazon Bedrock at Amazon Web Services. In this role, he uses his expertise in cloud-based architectures to develop innovative generative AI solutions for clients across diverse industries. Satveer’s deep understanding of generative AI technologies allows him to design scalable, secure, and responsible applications that unlock new business opportunities and drive tangible value.

NVIDIA Launches NVIDIA Halos, a Full-Stack, Comprehensive Safety System for Autonomous Vehicles

Physical AI is unlocking new possibilities at the intersection of autonomy and robotics — accelerating, in particular, the development of autonomous vehicles (AVs). The right technology and frameworks are crucial to ensuring the safety of drivers, passengers and pedestrians.

That’s why NVIDIA today announced NVIDIA Halos — a comprehensive safety system bringing together NVIDIA’s lineup of automotive hardware and software safety solutions with its cutting-edge AI research in AV safety.

Halos spans chips and software to tools and services to help ensure safe development of AVs from the cloud to the car, with a focus on AI-based, end-to-end AV stacks.

“With the launch of Halos, we’re empowering partners and developers to choose the state-of-the-art technology elements they need to build their own unique offerings, driving forward a shared mission to create safe and reliable autonomous vehicles,” said Riccardo Mariani, vice president of industry safety at NVIDIA. “Halos complements existing safety practices and can potentially accelerate standardization and regulatory compliance.”

At the Heart of Halos

Halos is a holistic safety system on three different but complementary levels.

At the technology level, it spans platform, algorithmic and ecosystem safety. At the development level, it includes design-time, deployment-time and validation-time guardrails. And at the computational level, it spans AI training to deployment, using three powerful computers — NVIDIA DGX for AI training, NVIDIA Omniverse and NVIDIA Cosmos running on NVIDIA OVX for simulation, and NVIDIA DRIVE AGX for deployment.

“Halos’ holistic approach to safety is particularly critical in a setting where companies want to harness the power of generative AI for increasingly capable AV systems developed end to end, which preclude traditional compositional design and verification,” said Marco Pavone, lead AV researcher at NVIDIA.

AI Systems Inspection Lab

Serving as an entry point to Halos is the NVIDIA AI Systems Inspection Lab, which allows automakers and developers to verify the safe integration of their products with NVIDIA technology.

The AI Systems Inspection Lab, announced at the CES trade show earlier this year, is the first worldwide program to be accredited by the ANSI National Accreditation Board for an inspection plan integrating functional safety, cybersecurity, AI safety and regulations into a unified safety framework.

Inaugural members of the AI Systems Inspection Lab include Ficosa, OMNIVISION, onsemi and Continental.

“Being a member of the AI Systems Inspection Lab means working at the forefront of automotive systems innovation and integrity,” said Cristian Casorran Hontiyuelo, advanced driver-assistance system engineering and product manager at Ficosa.

“Cars are so much more than just transportation,” said Paul Wu, head of product marketing for automotive at OMNIVISION. “They’ve also become our entertainment and information hubs. Vehicles must continually evolve in their ability to keep us safe. We are pleased to join NVIDIA’s new AI Systems Safety Lab as a demonstration of our commitment to achieving the highest levels of safety in our product offerings.”

“We are delighted to be working with NVIDIA and included in the launch of the NVIDIA AI Systems Inspection Lab,” said Geoff Ballew, general manager of the automotive sensing division at onsemi. “This unique initiative will improve road safety in an innovative way. We look forward to the advancements it will bring.”

“We are pleased to participate in the newly launched NVIDIA Drive AI Systems Inspection Lab and to further intensify the fruitful, ongoing collaboration between our two companies,” said Nobert Hammerschmidt, head of components business at Continental.

Key Elements of Halos

Halos is built on three focus areas: platform safety, algorithmic safety and ecosystem safety.

Platform Safety

Halos features a safety-assessed system-on-a-chip (SoC) with hundreds of built-in safety mechanisms.

It also includes NVIDIA DriveOS software, a safety-certified operating system that extends from CPU to GPU; a safety-assessed base platform that delivers the foundational computer needed to enable safe systems for all types of applications; and DRIVE AGX Hyperion, a hardware platform that connects SoC, DriveOS and sensors in an electronic control unit architecture.

Algorithmic Safety

Halos includes libraries for safety data loading and accelerators, and application programming interfaces for safety data creation, curation and reconstruction to filter out, for example, undesirable behaviors and biases before training.

It also features rich training, simulation and validation environments harnessing the NVIDIA Omniverse Blueprint for AV simulation with NVIDIA Cosmos world foundation models to train, test and validate AVs. In addition, it boasts a diverse AV stack combining modular components with end-to-end AI models to ensure safety with cutting-edge AI models in the loop.

Ecosystem Safety

Halos includes safety datasets with diverse, unbiased data, as well as safe deployment workflows, comprising triaging workflows and automated safety evaluations, along with a data flywheel for continual safety improvements — demonstrating leadership in AV safety standardization and regulation.

Safety Track Record

Halos brings together a vast amount of safety-focused technology research, development, deployment, partnerships and collaborations by NVIDIA, including:

15,000+ engineering years invested in vehicle safety
10,000+ hours of contributions to international standards committees
1,000+ AV-safety patents filed
240+ AV-safety research papers published
30+ safety and cybersecurity certificates

It also dovetails with recent significant safety certifications and assessments of NVIDIA automotive products, including:

The NVIDIA DriveOS 6.0 operating system conforms with ISO 26262 automotive safety integrity level (ASIL D) standards.
TÜV SÜD granted the ISO/SAE 21434 Cybersecurity Process certification to NVIDIA for its automotive SoC, platform and software engineering processes.
TÜV Rheinland performed an independent safety assessment of NVIDIA DRIVE AV for the United Nations Economic Commission for Europe related to safety requirements for complex electronic systems.

To learn more about NVIDIA’s approach to automotive safety, attend AV Safety Day today at NVIDIA GTC, a global AI conference running through Friday, March 21.

See notice regarding software product information.

$NVIDIA Accelerates Science and Engineering With CUDA-X Libraries Powered by GH200 and GB200 Superchips$

NVIDIA Accelerates Science and Engineering With CUDA-X Libraries Powered by GH200 and GB200 Superchips

Scientists and engineers of all kinds are equipped to solve tough problems a lot faster with NVIDIA CUDA-X libraries powered by NVIDIA GB200 and GH200 superchips.

Announced today at the NVIDIA GTC global AI conference, developers can now take advantage of tighter automatic integration and coordination between CPU and GPU resources — enabled by CUDA-X working with these latest superchip architectures — resulting in up to 11x speedups for computational engineering tools and 5x larger calculations compared with using traditional accelerated computing architectures.

This greatly accelerates and improves workflows in engineering simulation, design optimization and more, helping scientists and researchers reach groundbreaking results faster.

NVIDIA released CUDA in 2006, opening up a world of applications to the power of accelerated computing. Since then, NVIDIA has built more than 900 domain-specific NVIDIA CUDA-X libraries and AI models, making it easier to adopt accelerated computing and driving incredible scientific breakthroughs. Now, CUDA-X brings accelerated computing to a broad new set of engineering disciplines, including astronomy, particle physics, quantum physics, automotive, aerospace and semiconductor design.

The NVIDIA Grace CPU architecture delivers a significant boost to memory bandwidth while reducing power consumption. And NVIDIA NVLink-C2C interconnects provide such high bandwidth that the GPU and CPU can share memory, allowing developers to write less-specialized code, run larger problems and improve application performance.

Accelerating Engineering Solvers With NVIDIA cuDSS

NVIDIA’s superchip architectures allow users to extract greater performance from the same underlying GPU by making more efficient use of CPU and GPU processing capabilities.

The NVIDIA cuDSS library is used to solve large engineering simulation problems involving sparse matrices for applications such as design optimization, electromagnetic simulation workflows and more. cuDSS uses Grace GPU memory and the high-bandwidth NVLink-C2C interconnect to factorize and solve large matrices that normally wouldn’t fit in device memory. This enables users to solve extremely large problems in a fraction of the time.

The coherent shared memory between the GPU and Grace GPU minimizes data movement, significantly reducing overhead for large systems. For a range of large computational engineering problems, tapping the Grace CPU memory and superchip architecture accelerated the most heavy-duty solution steps by up to 4x with the same GPU, with cuDSS hybrid memory.

Ansys has integrated cuDSS into its HFSS solver, delivering significant performance enhancements for electromagnetic simulations. With cuDSS, HFSS software achieves up to an 11x speed improvement for the matrix solver.

Altair OptiStruct has also adopted the cuDSS Direct Sparse Solver library, substantially accelerating its finite element analysis workloads.

These performance gains are achieved by optimizing key operations on the GPU while intelligently using CPUs for shared memory and heterogeneous CPU and GPU execution. cuDSS automatically detects areas where CPU utilization provides additional benefits, further enhancing efficiency.

Scaling Up at Warp Speed With Superchip Memory

Scaling memory-limited applications on a single GPU becomes possible with the GB200 and GH200 architectures’ NVLink-CNC interconnects that provide CPU and GPU memory coherency.

Many engineering simulations are limited by scale and require massive simulations to produce the resolution necessary to design equipment with intricate components, such as aircraft engines. By tapping into the ability to seamlessly read and write between CPU and GPU memories, engineers can easily implement out-of-core solvers to process larger data.

For example, using NVIDIA Warp —a Python-based framework for accelerating data generation and spatial computing applications — Autodesk performed simulations of up to 48 billion cells using eight GH200 nodes. This is more than 5x larger than the simulations possible using eight NVIDIA H100 nodes.

Powering Quantum Computing Research With NVIDIA cuQuantum

Quantum computers promise to accelerate problems that are core to many science and industry disciplines. Shortening the time to useful quantum computing rests heavily on the ability to simulate extremely complex quantum systems.

Simulations allow researchers to develop new algorithms today that will run at scales suitable for tomorrow’s quantum computers. They also play a key role in improving quantum processors, running complex simulations of performance and noise characteristics of new qubit designs.

So-called state vector simulations of quantum algorithms require matrix operations to be performed on exponentially large vector objects that must be stored in memory. Tensor network simulations, on the other hand, simulate quantum algorithms through tensor contractions and can enable hundreds or thousands of qubits to be simulated for certain important classes of applications.

The NVIDIA cuQuantum library accelerates these workloads. cuQuantum is integrated with every leading quantum computing framework, so all quantum researchers can tap into simulation performance with no code changes.

Simulations of quantum algorithms are generally limited in scale by memory requirements. The GB200 and GH200 architectures provide an ideal platform for scaling up quantum simulations, as they enable large CPU memory to be used without bottlenecking performance. A GH200 system is up to 3x faster than an H100 system with x86 on quantum computing benchmarks.

Learn more about CUDA-X libraries, attend the GTC session on how math libraries can help accelerate applications on NVIDIA Blackwell GPUs and watch NVIDIA founder and CEO Jensen Huang’s GTC keynote.

NVIDIA Open-Sources cuOpt, Ushering in New Era of Decision Optimization

Every second, businesses worldwide are making critical decisions. A logistics company decides which trucks to send where. A retailer figures out how to stock its shelves. An airline scrambles to reroute flights after a storm. These aren’t just routing choices — they’re high-stakes puzzles with millions of variables, and getting them wrong costs money and, sometimes, customers.

That’s changing.

NVIDIA today announced it will open-source cuOpt, an AI-powered decision optimization engine — making the powerful software free for developers to unlock real-time optimization at an unprecedented scale.

Optimization ecosystem leaders COPT, the Xpress team at FICO, HiGHS, IBM and SimpleRose are integrating or evaluating cuOpt, accelerating decision-making across industries.

Gurobi Optimization is evaluating and testing cuOpt solvers to refine first-order algorithms for next-level performance.

NVIDIA is working with the COIN-OR Foundation to make cuOpt open source, in what is widely regarded as the oldest and largest such repository for operations research software.

Meanwhile, a team of researchers at Arizona State University, Cornell Tech, Princeton University, University of Pavia and Zuse Institute of Berlin are exploring its capabilities, developing next-generation solvers and tackling complex optimization problems with exceptional speed.

With the technology, airlines can reconfigure flight schedules mid-air to prevent cascading delays, power grids can rebalance in real time to avoid blackouts and financial institutions can manage portfolios with up-to-the-moment risk analysis.

Faster Optimization, Smarter Decisions

The best-known AI applications are all about predictions — whether forecasting weather or generating the next word in a sentence. But prediction is only half the challenge. The real power comes from acting on information in real time.

That’s where cuOpt comes in.

cuOpt dynamically evaluates billions of variables — inventory levels, factory output, shipping delays, fuel costs, risk factors and regulations — and delivers the best move in near real time.

As AI agents and large language model-driven simulations take on more decision-making tasks, the need for instant optimization has never been greater. cuOpt, powered by NVIDIA GPUs, accelerates these computations by orders of magnitude.

Unlike traditional optimization methods that navigate solution spaces sequentially or with limited parallelism, cuOpt taps into GPU acceleration to evaluate millions of possibilities simultaneously — finding optimal solutions exponentially faster for specific instances.

It doesn’t replace existing techniques — it enhances them. By working alongside traditional solvers, cuOpt rapidly identifies high-quality solutions, helping CPU-based models discard bad paths faster.

Why Optimization Is So Hard — and How cuOpt Does It Better

Every decision — where to send a truck, how to schedule workers and when to rebalance power grids — is a puzzle with an exponential number of possible answers.

To put this into perspective, the number of possible ways to schedule 100 nurses in a hospital for the next month is greater than the number of atoms in the observable universe.

Many traditional solvers search for solutions sequentially or with limited parallelism — like navigating a vast maze with a flashlight, one corridor at a time. cuOpt rewrites the rules by evaluating millions of possibilities intelligently, accelerating optimization exponentially.

For years, workforce scheduling, logistics routing and supply-chain planning all took hours — sometimes days — to compute.

NVIDIA cuOpt changes that — the numbers tell the story:

Linear programming acceleration: 70x faster on average than a CPU-based PDLP solver on large-scale benchmarks, with a 10x to 3,000x speedup range.
Mixed-integer programming (MIP): 60x faster MIP solves, as demonstrated by SimpleRose.
Vehicle routing: 240x speedup in dynamic routing, enabling cost to serve insights and near time route adjustments, as demonstrated by Lyric.

Decisions that once took hours or days now take seconds.

Optimizing for a Better World

Better optimization doesn’t just make businesses more efficient — it makes the world more sustainable, resilient and equitable.

Smarter decision-making leads to less waste. Energy grids can distribute power more efficiently, reducing blackouts and seamlessly integrating renewables like wind and solar. Supply chains can adjust dynamically to minimize excess inventory, cutting both costs and emissions.

Hospitals in underserved regions can allocate beds, doctors and medicine in real time, helping lifesaving treatments reach patients faster. Humanitarian aid groups responding to disasters can instantly recalculate the best way to distribute food, water and medicine, reducing delays in critical moments. And public transit systems can adjust dynamically to demand, reducing congestion and travel times for millions of people.

cuOpt isn’t just about more hardware — it’s about smarter search. Instead of going through every possibility, cuOpt intelligently navigates massive search spaces, focusing on constraint edges to converge faster. By using GPU acceleration, it evaluates multiple solutions in parallel, delivering real-time, high-efficiency optimization.

Industry Support — a New Era for Decision Intelligence

Optimization leaders such as FICO, Gurobi Optimization, IBM and SimpleRose are among the companies who are exploring the benefits of GPU acceleration or evaluating the possibility of integrating cuOpt into their workflows and evaluating its potential, spanning industrial planning to supply chain management and scheduling.

Smarter Decisions, Stronger Systems, Better Outcomes

cuOpt redefines optimization at scale.

For businesses, as described, it means AI-powered optimization can reconfigure schedules, route fleets and reallocate resources in real time — cutting costs and boosting agility.

For developers, it provides a high-performance AI toolkit that can solve decision problems up to 3,000x faster than CPU solvers in complex optimization challenges such as network data routing — optimizing the flow of video, voice, and web traffic to reduce congestion and improve efficiency — or electricity distribution, balancing supply and demand across power grids while minimizing losses and ensuring stable transmission.

For researchers, it’s an open playground for pushing AI-driven decision-making to new frontiers.

cuOpt will be released as open source and freely available for developers, researchers and enterprises later this year.

See cuOpt in Action

Explore real-world applications of cuOpt at these NVIDIA GTC sessions:

For enterprise production deployments, cuOpt is supported as part of the NVIDIA AI Enterprise software platform and can be deployed as an NVIDIA NIM microservice — making it easy to integrate, scale and deploy across cloud, on-premises and edge environments.

With its open-source release, developers will be able to easily access, modify and integrate the cuOpt source code into their own solutions.

Learn more about how companies are already transforming their operations with cuOpt and sign up to be notified when the open-source software is available.

See notice regarding software product information.

Build your gen AI–based text-to-SQL application using RAG, powered by Amazon Bedrock (Claude 3 Sonnet and Amazon Titan for embedding)

SQL is one of the key languages widely used across businesses, and it requires an understanding of databases and table metadata. This can be overwhelming for nontechnical users who lack proficiency in SQL. Today, generative AI can help bridge this knowledge gap for nontechnical users to generate SQL queries by using a text-to-SQL application. This application allows users to ask questions in natural language and then generates a SQL query for the user’s request.

Large language models (LLMs) are trained to generate accurate SQL queries for natural language instructions. However, off-the-shelf LLMs can’t be used without some modification. Firstly, LLMs don’t have access to enterprise databases, and the models need to be customized to understand the specific database of an enterprise. Additionally, the complexity increases due to the presence of synonyms for columns and internal metrics available.

The limitation of LLMs in understanding enterprise datasets and human context can be addressed using Retrieval Augmented Generation (RAG). In this post, we explore using Amazon Bedrock to create a text-to-SQL application using RAG. We use Anthropic’s Claude 3.5 Sonnet model to generate SQL queries, Amazon Titan in Amazon Bedrock for text embedding and Amazon Bedrock to access these models.

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI.

Solution overview

This solution is primarily based on the following services:

Foundational model – We use Anthropic’s Claude 3.5 Sonnet on Amazon Bedrock as our LLM to generate SQL queries for user inputs.
Vector embeddings – We use Amazon Titan Text Embeddings v2 on Amazon Bedrock for embeddings. Embedding is the process by which text, images, and audio are given numerical representation in a vector space. Embedding is usually performed by a machine learning (ML) model. The following diagram provides more details about embeddings.
RAG – We use RAG for providing more context about table schema, column synonyms, and sample queries to the FM. RAG is a framework for building generative AI applications that can make use of enterprise data sources and vector databases to overcome knowledge limitations. RAG works by using a retriever module to find relevant information from an external data store in response to a user’s prompt. This retrieved data is used as context, combined with the original prompt, to create an expanded prompt that is passed to the LLM. The language model then generates a SQL query that incorporates the enterprise knowledge. The following diagram illustrates the RAG framework.
Streamlit – This open source Python library makes it straightforward to create and share beautiful, custom web apps for ML and data science. In just a few minutes you can build powerful data apps using only Python.

The following diagram shows the solution architecture.

We need to update the LLMs with an enterprise-specific database. This make sure that the model can correctly understand the database and generate a response tailored to enterprise-based data schema and tables. There are multiple file formats available for storing this information, such as JSON, PDF, TXT, and YAML. In our case, we created JSON files to store table schema, table descriptions, columns with synonyms, and sample queries. JSON’s inherently structured format allows for clear and organized representation of complex data such as table schemas, column definitions, synonyms, and sample queries. This structure facilitates quick parsing and manipulation of data in most programming languages, reducing the need for custom parsing logic.

There can be multiple tables with similar information, which can lower the model’s accuracy. To increase the accuracy, we categorized the tables in four different types based on the schema and created four JSON files to store different tables. We’ve added one dropdown menu with four choices. Each choice represents one of these four categories and is lined to individual JSON files. After the user selects the value from the dropdown menu, the relevant JSON file is passed to Amazon Titan Text Embeddings v2, which can convert text into embeddings. These embeddings are stored in a vector database for faster retrieval.

We added the prompt template to the FM to define the roles and responsibilities of the model. You can add additional information such as which SQL engine should be used to generate the SQL queries.

When the user provides the input through the chat prompt, we use similarity search to find the relevant table metadata from the vector database for the user’s query. The user input is combined with relevant table metadata and the prompt template, which is passed to the FM as a single input all together. The FM generates the SQL query based on the final input.

To evaluate the model’s accuracy and track the mechanism, we store every user input and output in Amazon Simple Storage Service (Amazon S3).

Prerequisites

To create this solution, complete the following prerequisites:

Sign up for an AWS account if you don’t already have one.
Enable model access for Amazon Titan Text Embeddings v2 and Anthropic’s Claude 3.5 Sonnet on Amazon Bedrock.
Create an S3 bucket as ‘simplesql-logs-****‘, replace ‘****’ with your unique identifier. Bucket names are unique globally across the entire Amazon S3 service.
Choose your testing environment. We recommend that you test in Amazon SageMaker Studio, although you can use other available local environments.

Install the following libraries to execute the code:

pip install streamlit
pip install jq
pip install openpyxl
pip install "faiss-cpu"
pip install langchain

Procedure

There are three main components in this solution:

JSON files store the table schema and configure the LLM
Vector indexing using Amazon Bedrock
Streamlit for the front-end UI

You can download all three components and code snippets provided in the following section.

Generate the table schema

We use the JSON format to store the table schema. To provide more inputs to the model, we added a table name and its description, columns and their synonyms, and sample queries in our JSON files. Create a JSON file as Table_Schema_A.json by copying the following code into it:

{
  "tables": [
    {
      "separator": "table_1",
      "name": "schema_a.orders",
      "schema": "CREATE TABLE schema_a.orders (order_id character varying(200), order_date timestamp without time zone, customer_id numeric(38,0), order_status character varying(200), item_id character varying(200) );",
      "description": "This table stores information about orders placed by customers.",
      "columns": [
        {
          "name": "order_id",
          "description": "unique identifier for orders.",
          "synonyms": ["order id"]
        },
        {
          "name": "order_date",
          "description": "timestamp when the order was placed",
          "synonyms": ["order time", "order day"]
        },
        {
          "name": "customer_id",
          "description": "Id of the customer associated with the order",
          "synonyms": ["customer id", "userid"]
        },
        {
          "name": "order_status",
          "description": "current status of the order, sample values are: shipped, delivered, cancelled",
          "synonyms": ["order status"]
        },
        {
          "name": "item_id",
          "description": "item associated with the order",
          "synonyms": ["item id"]
        }
      ],
      "sample_queries": [
        {
          "query": "select count(order_id) as total_orders from schema_a.orders where customer_id = '9782226' and order_status = 'cancelled'",
          "user_input": "Count of orders cancelled by customer id: 978226"
        }
      ]
    },
    {
      "separator": "table_2",
      "name": "schema_a.customers",
      "schema": "CREATE TABLE schema_a.customers (customer_id numeric(38,0), customer_name character varying(200), registration_date timestamp without time zone, country character varying(200) );",
      "description": "This table stores the details of customers.",
      "columns": [
        {
          "name": "customer_id",
          "description": "Id of the customer, unique identifier for customers",
          "synonyms": ["customer id"]
        },
        {
          "name": "customer_name",
          "description": "name of the customer",
          "synonyms": ["name"]
        },
        {
          "name": "registration_date",
          "description": "registration timestamp when customer registered",
          "synonyms": ["sign up time", "registration time"]
        },
        {
          "name": "country",
          "description": "customer's original country",
          "synonyms": ["location", "customer's region"]
        }
      ],
      "sample_queries": [
        {
          "query": "select count(customer_id) as total_customers from schema_a.customers where country = 'India' and to_char(registration_date, 'YYYY') = '2024'",
          "user_input": "The number of customers registered from India in 2024"
        },
        {
          "query": "select count(o.order_id) as order_count from schema_a.orders o join schema_a.customers c on o.customer_id = c.customer_id where c.customer_name = 'john' and to_char(o.order_date, 'YYYY-MM') = '2024-01'",
          "user_input": "Total orders placed in January 2024 by customer name john"
        }
      ]
    },
    {
      "separator": "table_3",
      "name": "schema_a.items",
      "schema": "CREATE TABLE schema_a.items (item_id character varying(200), item_name character varying(200), listing_date timestamp without time zone );",
      "description": "This table stores the complete details of items listed in the catalog.",
      "columns": [
        {
          "name": "item_id",
          "description": "Id of the item, unique identifier for items",
          "synonyms": ["item id"]
        },
        {
          "name": "item_name",
          "description": "name of the item",
          "synonyms": ["name"]
        },
        {
          "name": "listing_date",
          "description": "listing timestamp when the item was registered",
          "synonyms": ["listing time", "registration time"]
        }
      ],
      "sample_queries": [
        {
          "query": "select count(item_id) as total_items from schema_a.items where to_char(listing_date, 'YYYY') = '2024'",
          "user_input": "how many items are listed in 2024"
        },
        {
          "query": "select count(o.order_id) as order_count from schema_a.orders o join schema_a.customers c on o.customer_id = c.customer_id join schema_a.items i on o.item_id = i.item_id where c.customer_name = 'john' and i.item_name = 'iphone'",
          "user_input": "how many orders are placed for item 'iphone' by customer name john"
        }
      ]
    }
  ]
}

Configure the LLM and initialize vector indexing using Amazon Bedrock

Create a Python file as library.py by following these steps:

Add the following import statements to add the necessary libraries:

import boto3  # AWS SDK for Python
from langchain_community.document_loaders import JSONLoader  # Utility to load JSON files
from langchain.llms import Bedrock  # Large Language Model (LLM) from Anthropic
from langchain_community.chat_models import BedrockChat  # Chat interface for Bedrock LLM
from langchain.embeddings import BedrockEmbeddings  # Embeddings for Titan model
from langchain.memory import ConversationBufferWindowMemory  # Memory to store chat conversations
from langchain.indexes import VectorstoreIndexCreator  # Create vector indexes
from langchain.vectorstores import FAISS  # Vector store using FAISS library
from langchain.text_splitter import RecursiveCharacterTextSplitter  # Split text into chunks
from langchain.chains import ConversationalRetrievalChain  # Conversational retrieval chain
from langchain.callbacks.manager import CallbackManager

Initialize the Amazon Bedrock client and configure Anthropic’s Claude 3.5 You can limit the number of output tokens to optimize the cost:

# Create a Boto3 client for Bedrock Runtime
bedrock_runtime = boto3.client(
    service_name="bedrock-runtime",
    region_name="us-east-1"
)

# Function to get the LLM (Large Language Model)
def get_llm():
    model_kwargs = {  # Configuration for Anthropic model
        "max_tokens": 512,  # Maximum number of tokens to generate
        "temperature": 0.2,  # Sampling temperature for controlling randomness
        "top_k": 250,  # Consider the top k tokens for sampling
        "top_p": 1,  # Consider the top p probability tokens for sampling
        "stop_sequences": ["nnHuman:"]  # Stop sequence for generation
    }
    # Create a callback manager with a default callback handler
    callback_manager = CallbackManager([])
    
    llm = BedrockChat(
        model_id="anthropic.claude-3-5-sonnet-20240620-v1:0",  # Set the foundation model
        model_kwargs=model_kwargs,  # Pass the configuration to the model
        callback_manager=callback_manager
        
    )

    return llm

Create and return an index for the given schema type. This approach is an efficient way to filter tables and provide relevant input to the model:

# Function to load the schema file based on the schema type
def load_schema_file(schema_type):
    if schema_type == 'Schema_Type_A':
        schema_file = "Table_Schema_A.json"  # Path to Schema Type A
    elif schema_type == 'Schema_Type_B':
        schema_file = "Table_Schema_B.json"  # Path to Schema Type B
    elif schema_type == 'Schema_Type_C':
        schema_file = "Table_Schema_C.json"  # Path to Schema Type C
    return schema_file

# Function to get the vector index for the given schema type
def get_index(schema_type):
    embeddings = BedrockEmbeddings(model_id="amazon.titan-embed-text-v2:0",
                                   client=bedrock_runtime)  # Initialize embeddings

    db_schema_loader = JSONLoader(
        file_path=load_schema_file(schema_type),  # Load the schema file
        # file_path="Table_Schema_RP.json",  # Uncomment to use a different file
        jq_schema='.',  # Select the entire JSON content
        text_content=False)  # Treat the content as text

    db_schema_text_splitter = RecursiveCharacterTextSplitter(  # Create a text splitter
        separators=["separator"],  # Split chunks at the "separator" string
        chunk_size=10000,  # Divide into 10,000-character chunks
        chunk_overlap=100  # Allow 100 characters to overlap with previous chunk
    )

    db_schema_index_creator = VectorstoreIndexCreator(
        vectorstore_cls=FAISS,  # Use FAISS vector store
        embedding=embeddings,  # Use the initialized embeddings
        text_splitter=db_schema_text_splitter  # Use the text splitter
    )

    db_index_from_loader = db_schema_index_creator.from_loaders([db_schema_loader])  # Create index from loader

    return db_index_from_loader

Use the following function to create and return memory for the chat session:

# Function to get the memory for storing chat conversations
def get_memory():
    memory = ConversationBufferWindowMemory(memory_key="chat_history", return_messages=True)  # Create memory

    return memory

Use the following prompt template to generate SQL queries based on user input:

# Template for the question prompt
template = """ Read table information from the context. Each table contains the following information:
- Name: The name of the table
- Description: A brief description of the table
- Columns: The columns of the table, listed under the 'columns' key. Each column contains:
  - Name: The name of the column
  - Description: A brief description of the column
  - Type: The data type of the column
  - Synonyms: Optional synonyms for the column name
- Sample Queries: Optional sample queries for the table, listed under the 'sample_data' key

Given this structure, Your task is to provide the SQL query using Amazon Redshift syntax that would retrieve the data for following question. The produced query should be functional, efficient, and adhere to best practices in SQL query optimization.

Question: {}
"""

Use the following function to get a response from the RAG chat model:

# Function to get the response from the conversational retrieval chain
def get_rag_chat_response(input_text, memory, index):
    llm = get_llm()  # Get the LLM

    conversation_with_retrieval = ConversationalRetrievalChain.from_llm(
        llm, index.vectorstore.as_retriever(), memory=memory, verbose=True)  # Create conversational retrieval chain

    chat_response = conversation_with_retrieval.invoke({"question": template.format(input_text)})  # Invoke the chain

    return chat_response['answer']  # Return the answer

Configure Streamlit for the front-end UI

Create the file app.py by following these steps:

Import the necessary libraries:

import streamlit as st
import library as lib
from io import StringIO
import boto3
from datetime import datetime
import csv
import pandas as pd
from io import BytesIO

Initialize the S3 client:

s3_client = boto3.client('s3')
bucket_name = 'simplesql-logs-****'
#replace the 'simplesql-logs-****’ with your S3 bucket name
log_file_key = 'logs.xlsx'

Configure Streamlit for UI:

st.set_page_config(page_title="Your App Name")
st.title("Your App Name")

# Define the available menu items for the sidebar
menu_items = ["Home", "How To", "Generate SQL Query"]

# Create a sidebar menu using radio buttons
selected_menu_item = st.sidebar.radio("Menu", menu_items)

# Home page content
if selected_menu_item == "Home":
    # Display introductory information about the application
    st.write("This application allows you to generate SQL queries from natural language input.")
    st.write("")
    st.write("**Get Started** by selecting the button Generate SQL Query !")
    st.write("")
    st.write("")
    st.write("**Disclaimer :**")
    st.write("- Model's response depends on user's input (prompt). Please visit How-to section for writing efficient prompts.")
           
# How-to page content
elif selected_menu_item == "How To":
    # Provide guidance on how to use the application effectively
    st.write("The model's output completely depends on the natural language input. Below are some examples which you can keep in mind while asking the questions.")
    st.write("")
    st.write("")
    st.write("")
    st.write("")
    st.write("**Case 1 :**")
    st.write("- **Bad Input :** Cancelled orders")
    st.write("- **Good Input :** Write a query to extract the cancelled order count for the items which were listed this year")
    st.write("- It is always recommended to add required attributes, filters in your prompt.")
    st.write("**Case 2 :**")
    st.write("- **Bad Input :** I am working on XYZ project. I am creating a new metric and need the sales data. Can you provide me the sales at country level for 2023 ?")
    st.write("- **Good Input :** Write an query to extract sales at country level for orders placed in 2023 ")
    st.write("- Every input is processed as tokens. Do not provide un-necessary details as there is a cost associated with every token processed. Provide inputs only relevant to your query requirement.")

Generate the query:

# SQL-AI page content
elif selected_menu_item == "Generate SQL Query":
    # Define the available schema types for selection
    schema_types = ["Schema_Type_A", "Schema_Type_B", "Schema_Type_C"]
    schema_type = st.sidebar.selectbox("Select Schema Type", schema_types)

Use the following for SQL generation:

if schema_type:
        # Initialize or retrieve conversation memory from session state
        if 'memory' not in st.session_state:
            st.session_state.memory = lib.get_memory()

        # Initialize or retrieve chat history from session state
        if 'chat_history' not in st.session_state:
            st.session_state.chat_history = []

        # Initialize or update vector index based on selected schema type
        if 'vector_index' not in st.session_state or 'current_schema' not in st.session_state or st.session_state.current_schema != schema_type:
            with st.spinner("Indexing document..."):
                # Create a new index for the selected schema type
                st.session_state.vector_index = lib.get_index(schema_type)
                # Update the current schema in session state
                st.session_state.current_schema = schema_type

        # Display the chat history
        for message in st.session_state.chat_history:
            with st.chat_message(message["role"]):
                st.markdown(message["text"])

        # Get user input through the chat interface, set the max limit to control the input tokens.
        input_text = st.chat_input("Chat with your bot here", max_chars=100)
        
        if input_text:
            # Display user input in the chat interface
            with st.chat_message("user"):
                st.markdown(input_text)

            # Add user input to the chat history
            st.session_state.chat_history.append({"role": "user", "text": input_text})

            # Generate chatbot response using the RAG model
            chat_response = lib.get_rag_chat_response(
                input_text=input_text, 
                memory=st.session_state.memory,
                index=st.session_state.vector_index
            )
            
            # Display chatbot response in the chat interface
            with st.chat_message("assistant"):
                st.markdown(chat_response)

            # Add chatbot response to the chat history
            st.session_state.chat_history.append({"role": "assistant", "text": chat_response})

Log the conversations to the S3 bucket:

timestamp = datetime.now().strftime('%Y-%m-%d %H:%M:%S')

            try:
                # Attempt to download the existing log file from S3
                log_file_obj = s3_client.get_object(Bucket=bucket_name, Key=log_file_key)
                log_file_content = log_file_obj['Body'].read()
                df = pd.read_excel(BytesIO(log_file_content))

            except s3_client.exceptions.NoSuchKey:
                # If the log file doesn't exist, create a new DataFrame
                df = pd.DataFrame(columns=["User Input", "Model Output", "Timestamp", "Schema Type"])

            # Create a new row with the current conversation data
            new_row = pd.DataFrame({
                "User Input": [input_text], 
                "Model Output": [chat_response], 
                "Timestamp": [timestamp],
                "Schema Type": [schema_type]
            })
            # Append the new row to the existing DataFrame
            df = pd.concat([df, new_row], ignore_index=True)
            
            # Prepare the updated DataFrame for S3 upload
            output = BytesIO()
            df.to_excel(output, index=False)
            output.seek(0)
            
            # Upload the updated log file to S3
            s3_client.put_object(Body=output.getvalue(), Bucket=bucket_name, Key=log_file_key)

Test the solution

Open your terminal and invoke the following command to run the Streamlit application.

streamlit run app.py

To visit the application using your browser, navigate to the localhost.

To visit the application using SageMaker, copy your notebook URL and replace ‘default/lab’ in the URL with ‘default/proxy/8501/ ‘ . It should look something like the following:

https://your_sagemaker_lab_url.studio.us-east-1.sagemaker.aws/jupyterlab/default/proxy/8501/

Choose Generate SQL query to open the chat window. Test your application by asking questions in natural language. We tested the application with the following questions and it generated accurate SQL queries.

Count of orders placed from India last month?
Write a query to extract the canceled order count for the items that were listed this year.
Write a query to extract the top 10 item names having highest order for each country.

Troubleshooting tips

Use the following solutions to address errors:

Error – An error raised by inference endpoint means that an error occurred (AccessDeniedException) when calling the InvokeModel operation. You don’t have access to the model with the specified model ID.
Solution – Make sure you have access to the FMs in Amazon Bedrock, Amazon Titan Text Embeddings v2, and Anthropic’s Claude 3.5 Sonnet.

Error – app.py does not exist
Solution – Make sure your JSON file and Python files are in the same folder and you’re invoking the command in the same folder.

Error – No module named streamlit
Solution – Open the terminal and install the streamlit module by running the command pip install streamlit

Error – An error occurred (NoSuchBucket) when calling the GetObject operation. The specified bucket doesn’t exist.
Solution – Verify your bucket name in the app.py file and update the name based on your S3 bucket name.

Clean up

Clean up the resources you created to avoid incurring charges. To clean up your S3 bucket, refer to Emptying a bucket.

Conclusion

In this post, we showed how Amazon Bedrock can be used to create a text-to-SQL application based on enterprise-specific datasets. We used Amazon S3 to store the outputs generated by the model for corresponding inputs. These logs can be used to test the accuracy and enhance the context by providing more details in the knowledge base. With the aid of a tool like this, you can create automated solutions that are accessible to nontechnical users, empowering them to interact with data more efficiently.

Ready to get started with Amazon Bedrock? Start learning with these interactive workshops.

For more information on SQL generation, refer to these posts:

We recently launched a managed NL2SQL module to retrieve structured data in Amazon Bedrock Knowledge . To learn more, visit Amazon Bedrock Knowledge Bases now supports structured data retrieval.

About the Author

Rajendra Choudhary is a Sr. Business Analyst at Amazon. With 7 years of experience in developing data solutions, he possesses profound expertise in data visualization, data modeling, and data engineering. He is passionate about supporting customers by leveraging generative AI–based solutions. Outside of work, Rajendra is an avid foodie and music enthusiast, and he enjoys swimming and hiking.

Unleash AI innovation with Amazon SageMaker HyperPod

The rise of generative AI has significantly increased the complexity of building, training, and deploying machine learning (ML) models. It now demands deep expertise, access to vast datasets, and the management of extensive compute clusters. Customers also face the challenges of writing specialized code for distributed training, continuously optimizing models, addressing hardware issues, and keeping projects on track and within budget. To simplify this process, AWS introduced Amazon SageMaker HyperPod during AWS re:Invent 2023, and it has emerged as a pioneering solution, revolutionizing how companies approach AI development and deployment.

As Amazon CEO Andy Jassy recently shared, “One of the most exciting innovations we’ve introduced is SageMaker HyperPod. HyperPod accelerates the training of machine learning models by distributing and parallelizing workloads across numerous powerful processors like AWS’s Trainium chips or GPUs. HyperPod also constantly monitor your infrastructure for problems, automatically repairing them when detected. During repair, your work is automatically saved, ensuring seamless resumption. This innovation is widely adopted, with most SageMaker AI customers relying on HyperPod for their demanding training needs.”

In this post, we show how SageMaker HyperPod, and its new features introduced at AWS re:Invent 2024, is designed to meet the demands of modern AI workloads, offering a persistent and optimized cluster tailored for distributed training and accelerated inference at cloud scale and attractive price-performance.

Customers using SageMaker HyperPod

Leading startups like Writer, Luma AI, and Perplexity, as well as major enterprises such as Thomson Reuters and Salesforce, are accelerating model development with SageMaker HyperPod. Amazon itself used SageMaker HyperPod to train its new Amazon Nova models, significantly reducing training costs, enhancing infrastructure performance, and saving months of manual effort that would have otherwise been spent on cluster setup and end-to-end process management.

Today, more organizations are eager to fine-tune popular publicly available models or train their own specialized models to revolutionize their businesses and applications with generative AI. To support this demand, SageMaker HyperPod continues to evolve, introducing new innovations that make it straightforward, faster, and more cost-effective for customers to build, train, and deploy these models at scale.

Deep infrastructure control

SageMaker HyperPod offers persistent clusters with deep infrastructure control, enabling builders to securely connect using SSH to Amazon Elastic Compute Cloud (Amazon EC2) instances for advanced model training, infrastructure management, and debugging. To maximize availability, HyperPod maintains a pool of dedicated and spare instances (at no additional cost), minimizing downtime during critical node replacements.

You can use familiar orchestration tools such as Slurm or Amazon Elastic Kubernetes Service (Amazon EKS), along with the libraries built on these tools, to enable flexible job scheduling and compute sharing. Integrating SageMaker HyperPod clusters with Slurm also allows the use of NVIDIA’s Enroot and Pyxis for efficient container scheduling in performant, unprivileged sandboxes. The underlying operating system and software stack are based on the Deep Learning AMI, preconfigured with NVIDIA CUDA, NVIDIA cuDNN, and the latest versions of PyTorch and TensorFlow. SageMaker HyperPod also is integrated with Amazon SageMaker AI distributed training libraries, optimized for AWS infrastructure, enabling automatic workload distribution across thousands of accelerators for efficient parallel training.

Builders can use built-in ML tools within SageMaker HyperPod to enhance model performance. For example, Amazon SageMaker with TensorBoard helps visualize model architecture and address convergence issues, as shown in the following screenshot. Integration with observability tools like Amazon CloudWatch Container Insights, Amazon Managed Service for Prometheus, and Amazon Managed Grafana offers deeper insights into cluster performance, health, and utilization, streamlining development time.

SageMaker HyperPod allows you to implement custom libraries and frameworks, enabling the service to be tailored to specific AI project needs. This level of personalization is essential in the rapidly evolving AI landscape, where innovation often requires experimenting with cutting-edge techniques and technologies. The adaptability of SageMaker HyperPod means that businesses are not constrained by infrastructure limitations, fostering creativity and technological advancement.

Intelligent resource management

As organizations increasingly provision large amounts of accelerated compute capacity for model training, they face challenges in effectively governing resource usage. These compute resources are both expensive and finite, making it crucial to prioritize critical model development tasks and avoid waste or under utilization. Without proper controls over task prioritization and resource allocation, some projects stall due to insufficient resources, while others leave resources underused. This creates a significant burden for administrators, who must constantly reallocate resources, and for data scientists, who struggle to maintain progress. These inefficiencies delay AI innovation and drive up costs.

SageMaker HyperPod addresses these challenges with its task governance capabilities, enabling you to maximize accelerator utilization for model training, fine-tuning, and inference. With just a few clicks, you can define task priorities and set limits on compute resource usage for teams. Once configured, SageMaker HyperPod automatically manages the task queue, making sure the most critical work receives the necessary resources. This reduction in operational overhead allows organizations to reallocate valuable human resources toward more innovative and strategic initiatives. This reduces model development costs by up to 40%.

For instance, if an inference task powering a customer-facing service requires urgent compute capacity but all resources are currently in use, SageMaker HyperPod reallocates underutilized or non-urgent resources to prioritize the critical task. Non-urgent tasks are automatically paused, checkpoints are saved to preserve progress, and these tasks resume seamlessly when resources become available. This makes sure you maximize your compute investments without compromising ongoing work.

As a fast-growing generative AI startup, Articul8 AI constantly optimizes their compute environment to allocate accelerated compute resources as efficiently as possible. With automated task prioritization and resource allocation in SageMaker HyperPod, they have seen a dramatic improvement in GPU utilization, reducing idle time and accelerating their model development process by optimizing tasks ranging from training and fine-tuning to inference. The ability to automatically shift resources to high-priority tasks has increased their team’s productivity, allowing them to bring new generative AI innovations to market faster than ever before.

At its core, SageMaker HyperPod represents a paradigm shift in AI infrastructure, moving beyond the traditional emphasis on raw computational power to focus on intelligent and adaptive resource management. By prioritizing optimized resource allocation, SageMaker HyperPod minimizes waste, maximizes efficiency, and accelerates innovation—all while reducing costs. This makes AI development more accessible and scalable for organizations of all sizes.

Get started faster with SageMaker HyperPod recipes

Many customers want to customize popular publicly available models, like Meta’s Llama and Mistral, for their specific use cases using their organization’s data. However, optimizing training performance often requires weeks of iterative testing—experimenting with algorithms, fine-tuning parameters, monitoring training impact, debugging issues, and benchmarking performance.

To simplify this process, SageMaker HyperPod now offers over 30 curated model training recipes for some of today’s most popular models, including DeepSeek R1, DeepSeek R1 Distill Llama, DeepSeek R1 Distill Qwen, Llama, Mistral, and Mixtral. These recipes enable you to get started in minutes by automating key steps like loading training datasets, applying distributed training techniques, and configuring systems for checkpointing and recovery from infrastructure failures. This empowers users of all skill levels to achieve better price-performance for model training on AWS infrastructure from the outset, eliminating weeks of manual evaluation and testing.

You can browse the GitHub repo to explore available training recipes, customize parameters to fit your needs, and deploy in minutes. With a simple one-line change, you can seamlessly switch between GPU or AWS Trainium based instances to further optimize price-performance.

Researchers at Salesforce were looking for ways to quickly get started with foundation model (FM) training and fine-tuning, without having to worry about the infrastructure, or spend weeks optimizing their training stack for each new model. With SageMaker HyperPod recipes, researchers at Salesforce can conduct rapid prototyping when customizing FMs. Now, Salesforce’s AI Research teams are able to get started in minutes with a variety of pre-training and fine-tuning recipes, and can operationalize frontier models with high performance.

Integrating Kubernetes with SageMaker Hyperpod

Though the standalone capabilities of SageMaker HyperPod are impressive, its integration with Amazon EKS takes AI workloads to new levels of power and flexibility. Amazon EKS simplifies the deployment, scaling, and management of containerized applications, making it an ideal solution for orchestrating complex AI/ML infrastructure.

By running SageMaker HyperPod on Amazon EKS, organizations can use Kubernetes’s advanced scheduling and orchestration features to dynamically provision and manage compute resources for AI/ML workloads, providing optimal resource utilization and scalability.

“We were able to meet our large language model training requirements using Amazon SageMaker HyperPod,” says John Duprey, Distinguished Engineer, Thomson Reuters Labs. “Using Amazon EKS on SageMaker HyperPod, we were able to scale up capacity and easily run training jobs, enabling us to unlock benefits of LLMs in areas such as legal summarization and classification.”

This integration also enhances fault tolerance and high availability. With self-healing capabilities, HyperPod automatically replaces failed nodes, maintaining workload continuity. Automated GPU health monitoring and seamless node replacement provide reliable execution of AI/ML workloads with minimal downtime, even during hardware failures.

Additionally, running SageMaker HyperPod on Amazon EKS enables efficient resource isolation and sharing using Kubernetes namespaces and resource quotas. Organizations can isolate different AI/ML workloads or teams while maximizing resource utilization across the cluster.

Flexible training plans help meet timelines and budgets

Although infrastructure innovations help reduce costs and improve training efficiency, customers still face challenges in planning and managing the compute capacity needed to complete training tasks on time and within budget. To address this, AWS is introducing flexible training plans for SageMaker HyperPod.

With just a few clicks, you can specify your desired completion date and the maximum amount of compute resources needed. SageMaker HyperPod then helps acquire capacity and sets up clusters, saving teams weeks of preparation time. This eliminates much of the uncertainty customers encounter when acquiring large compute clusters for model development tasks.

SageMaker HyperPod training plans are now available in US East (N. Virginia), US East (Ohio), and US West (Oregon) AWS Regions and support ml.p4d.48xlarge, ml.p5.48xlarge, ml.p5e.48xlarge, ml.p5en.48xlarge, and ml.trn2.48xlarge instances. Trn2 and P5en instances are only in the US East (Ohio) Region. To learn more, visit the SageMaker HyperPod product page and SageMaker pricing page.

Hippocratic AI is an AI company that develops the first safety-focused large language model (LLM) for healthcare. To train its primary LLM and the supervisor models, Hippocratic AI required powerful compute resources, which were in high demand and difficult to obtain. SageMaker HyperPod flexible training plans made it straightforward for them to gain access to EC2 P5 instances.

Developers and data scientists at OpenBabylon, an AI company that customizes LLMs for underrepresented languages, has been using SageMaker HyperPod flexible training plans for a few months to streamline their access to GPU resources to run large-scale experiments. Using the multi-node SageMaker HyperPod distributed training capabilities, they conducted 100 large-scale model training experiments, achieving state-of-the-art results in English-to-Ukrainian translation. This breakthrough was achieved on time and cost-effectively, demonstrating the ability of SageMaker HyperPod to successfully deliver complex projects on time and at budget.

Integrating training and inference infrastructures

A key focus area is integrating next-generation AI accelerators like the anticipated AWS Trainium2 release. These advanced accelerators promise unparalleled computational performance, offering 30–40% better price-performance than the current generation of GPU-based EC2 instances, significantly boosting AI model training and deployment efficiency and speed. This will be crucial for real-time applications and processing vast datasets simultaneously. The seamless accelerator integration with SageMaker HyperPod enables businesses to harness cutting-edge hardware advancements, driving AI initiatives forward.

Another pivotal aspect is that SageMaker HyperPod, through its integration with Amazon EKS, enables scalable inference solutions. As real-time data processing and decision-making demands grow, the SageMaker HyperPod architecture efficiently handles these requirements. This capability is essential across sectors like healthcare, finance, and autonomous systems, where timely, accurate AI inferences are critical. Offering scalable inference enables deploying high-performance AI models under varying workloads, enhancing operational effectiveness.

Moreover, integrating training and inference infrastructures represents a significant advancement, streamlining the AI lifecycle from development to deployment and providing optimal resource utilization throughout. Bridging this gap facilitates a cohesive, efficient workflow, reducing transition complexities from development to real-world applications. This holistic integration supports continuous learning and adaptation, which is key for next-generation, self-evolving AI models (continuously learning models, which possess the ability to adapt and refine themselves in real time based on their interactions with the environment).

SageMaker HyperPod uses established open source technologies, including MLflow integration through SageMaker, container orchestration through Amazon EKS, and Slurm workload management, providing users with familiar and proven tools for their ML workflows. By engaging the global AI community and encouraging knowledge sharing, SageMaker HyperPod continuously evolves, incorporating the latest research advancements. This collaborative approach helps SageMaker HyperPod remain at the forefront of AI technology, providing the tools to drive transformative change.

Conclusion

SageMaker HyperPod represents a fundamental change in AI infrastructure, offering a future-fit solution that empowers organizations to unlock the full potential of AI technologies. With its intelligent resource management, versatility, scalability, and forward-thinking design, SageMaker HyperPod enables businesses to accelerate innovation, reduce operational costs, and stay ahead of the curve in the rapidly evolving AI landscape.

Whether it’s optimizing the training of LLMs, processing complex datasets for medical imaging inference, or exploring novel AI architectures, SageMaker HyperPod provides a robust and flexible foundation for organizations to push the boundaries of what is possible in AI.

As AI continues to reshape industries and redefine what is possible, SageMaker HyperPod stands at the forefront, enabling organizations to navigate the complexities of AI workloads with unparalleled agility, efficiency, and innovation. With its commitment to continuous improvement, strategic partnerships, and alignment with emerging technologies, SageMaker HyperPod is poised to play a pivotal role in shaping the future of AI, empowering organizations to unlock new realms of possibility and drive transformative change.

Take the first step towards revolutionizing your AI initiatives by scheduling a consultation with our experts. Let us guide you through the process of harnessing the power of SageMaker HyperPod and unlock a world of possibilities for your business.

About the authors

Ilan Gleiser is a Principal GenAI Specialist at AWS WWSO Frameworks team focusing on developing scalable Artificial General Intelligence architectures and optimizing foundation model training and inference. With a rich background in AI and machine learning, Ilan has published over 20 blogs and delivered 100+ prototypes globally over the last 5 years. Ilan holds a Master’s degree in mathematical economics.

Trevor Harvey is a Principal Specialist in Generative AI at Amazon Web Services and an AWS Certified Solutions Architect – Professional. Trevor works with customers to design and implement machine learning solutions and leads go-to-market strategies for generative AI services.

Shubha Kumbadakone is a Sr. Mgr on the AWS WWSO Frameworks team focusing on Foundation Model Builders and self-managed machine learning with a focus on open-source software and tools. She has more than 19 years of experience in cloud infrastructure and machine learning and is helping customers build their distributed training and inference at scale for their ML models on AWS. She also holds a patent on a caching algorithm for rapid resume from hibernation for mobile systems.

Matt Nightingale is a Solutions Architect Manager on the AWS WWSO Frameworks team focusing on Generative AI Training and Inference. Matt specializes in distributed training architectures with a focus on hardware performance and reliability. Matt holds a bachelors degree from University of Virginia and is based in Boston, Massachusetts.

Revolutionizing clinical trials with the power of voice and AI

In the rapidly evolving healthcare landscape, patients often find themselves navigating a maze of complex medical information, seeking answers to their questions and concerns. However, accessing accurate and comprehensible information can be a daunting task, leading to confusion and frustration. This is where the integration of cutting-edge technologies, such as audio-to-text translation and large language models (LLMs), holds the potential to revolutionize the way patients receive, process, and act on vital medical information.

As the healthcare industry continues to embrace digital transformation, solutions that combine advanced technologies like audio-to-text translation and LLMs will become increasingly valuable in addressing key challenges, such as patient education, engagement, and empowerment. By taking advantage of these innovative technologies, healthcare providers can deliver more personalized, efficient, and effective care, ultimately improving patient outcomes and driving progress in the life sciences domain.

For instance, envision a voice-enabled virtual assistant that not only understands your spoken queries, but also transcribes them into text with remarkable accuracy. This transcription then serves as the input for a powerful LLM, which draws upon its vast knowledge base to provide personalized, context-aware responses tailored to your specific situation. This solution can transform the patient education experience, empowering individuals to make informed decisions about their healthcare journey.

In this post, we discuss possible use cases for combining speech recognition technology with LLMs, and how the solution can revolutionize clinical trials.

By combining speech recognition technology with LLMs, the solution can accurately transcribe a patient’s spoken queries into text, enabling the LLM to understand and analyze the context of the question. The LLM can then use its extensive knowledge base, which can be regularly updated with the latest medical research and clinical trial data, to provide relevant and trustworthy responses tailored to the patient’s specific situation.

Some of the potential benefits of this integrated approach are that patients can receive instant access to reliable information, empowering them to make more informed decisions about their healthcare. Additionally, the solution can help alleviate the burden on healthcare professionals by providing patients with a convenient and accessible source of information, freeing up valuable time for more critical tasks. Furthermore, the voice-enabled interface can enhance accessibility for patients with disabilities or those who prefer verbal communication, making sure that no one is left behind in the pursuit of better health outcomes.

Use cases overview

In this section, we discuss several possible use cases for this solution.

Use case 1: Audio-to-text translation and LLM integration for clinical trial patient interactions

In the domain of clinical trials, effective communication between patients and physicians is crucial for gathering accurate data, enforcing patient adherence, and maintaining study integrity. This use case demonstrates how audio-to-text translation combined with LLM capabilities can streamline and enhance the process of capturing and analyzing patient-physician interactions during clinical trial visits and telemedicine sessions.

Don’t feel like reading the full use case? No problem! You can listen to the key details in our audio file instead.

The process flow consists of the following steps:

Audio capture – During patient visits or telemedicine sessions, the audio of the patient-physician interaction is recorded securely, with appropriate consent and privacy measures in place.
Audio-to-text translation – The recorded audio is processed through an advanced speech recognition (ASR) system, which converts the audio into text transcripts. This step provides an accurate and efficient conversion of spoken words into a format suitable for further analysis.
Text preprocessing – The transcribed text undergoes preprocessing steps, such as removing identifying information, formatting the data, and enforcing compliance with relevant data privacy regulations.
LLM integration – The preprocessed text is fed into a powerful LLM tailored for the healthcare and life sciences (HCLS) domain. The LLM analyzes the text, identifying key information relevant to the clinical trial, such as patient symptoms, adverse events, medication adherence, and treatment responses.
Intelligent insights and recommendations – Using its large knowledge base and advanced natural language processing (NLP) capabilities, the LLM provides intelligent insights and recommendations based on the analyzed patient-physician interaction. These insights can include:
1. Potential adverse event detection and reporting.
2. Identification of protocol deviations or non-compliance.
3. Recommendations for personalized patient care or adjustments to treatment regimens.
4. Extraction of relevant data points for electronic health records (EHRs) and clinical trial databases.
Data integration and reporting – The extracted insights and recommendations are integrated into the relevant clinical trial management systems, EHRs, and reporting mechanisms. This streamlines the process of data collection, analysis, and decision-making for clinical trial stakeholders, including investigators, sponsors, and regulatory authorities.

The solution offers the following potential benefits:

Improved data accuracy – By accurately capturing and analyzing patient-physician interactions, this approach minimizes the risks of manual transcription errors and provides high-quality data for clinical trial analysis and decision-making.
Enhanced patient safety – The LLM’s ability to detect potential adverse events and protocol deviations can help identify and mitigate risks, improving patient safety and study integrity.
Personalized patient care – Using the LLM’s insights, physicians can provide personalized care recommendations, tailored treatment plans, and better manage patient adherence, leading to improved patient outcomes.
Streamlined data collection and analysis – Automating the process of extracting relevant data points from patient-physician interactions can significantly reduce the time and effort required for manual data entry and analysis, enabling more efficient clinical trial management.
Regulatory compliance – By integrating the extracted insights and recommendations into clinical trial management systems and EHRs, this approach facilitates compliance with regulatory requirements for data capture, adverse event reporting, and trial monitoring.

This use case demonstrates the potential of combining audio-to-text translation and LLM capabilities to enhance patient-physician communication, improve data quality, and support informed decision-making in the context of clinical trials. By using advanced technologies, this integrated approach can contribute to more efficient, effective, and patient-centric clinical research processes.

Use case 2: Intelligent site monitoring with audio-to-text translation and LLM capabilities

In the HCLS domain, site monitoring plays a crucial role in maintaining the integrity and compliance of clinical trials. Site monitors conduct on-site visits, interview personnel, and verify documentation to assess adherence to protocols and regulatory requirements. However, this process can be time-consuming and prone to errors, particularly when dealing with extensive audio recordings and voluminous documentation.

By integrating audio-to-text translation and LLM capabilities, we can streamline and enhance the site monitoring process, leading to improved efficiency, accuracy, and decision-making support.

Don’t feel like reading the full use case? No problem! You can listen to the key details in our audio file instead.

The process flow consists of the following steps:

Audio capture and transcription – During site visits, monitors record interviews with site personnel, capturing valuable insights and observations. These audio recordings are then converted into text using ASR and audio-to-text translation technologies.
Document ingestion – Relevant site documents, such as patient records, consent forms, and protocol manuals, are digitized and ingested into the system.
LLM-powered data analysis – The transcribed interviews and ingested documents are fed into a powerful LLM, which can understand and correlate the information from multiple sources. The LLM can identify key insights, potential issues, and areas of non-compliance by analyzing the content and context of the data.
Case report form generation – Based on the LLM’s analysis, a comprehensive case report form (CRF) is generated, summarizing the site visit findings, identifying potential risks or deviations, and providing recommendations for corrective actions or improvements.
Decision support and site selection – The CRFs and associated data can be further analyzed by the LLM to identify patterns, trends, and potential risks across multiple sites. This information can be used to support decision-making processes, such as site selection for future clinical trials, based on historical performance and compliance data.

The solution offers the following potential benefits:

Improved efficiency – By automating the transcription and data analysis processes, site monitors can save significant time and effort, allowing them to focus on more critical tasks and cover more sites within the same time frame.
Enhanced accuracy – LLMs can identify and correlate subtle patterns and nuances within the data, reducing the risk of overlooking critical information or making erroneous assumptions.
Comprehensive documentation – The generated CRFs provide a standardized and detailed record of site visits, facilitating better communication and collaboration among stakeholders.
Regulatory compliance – The LLM-powered analysis can help identify potential areas of non-compliance, enabling proactive measures to address issues and mitigate risks.
Informed decision-making – The insights derived from the LLM’s analysis can support data-driven decision-making processes, such as site selection for future clinical trials, based on historical performance and compliance data.

By combining audio-to-text translation and LLM capabilities, this integrated approach offers a powerful solution for intelligent site monitoring in the HCLS domain, supporting improved efficiency, accuracy, and decision-making while providing regulatory compliance and quality assurance.

Use case 3: Enhancing adverse event reporting in clinical trials with audio-to-text and LLMs

Clinical trials are crucial for evaluating the safety and efficacy of investigational drugs and therapies. Accurate and comprehensive adverse event reporting is essential for identifying potential risks and making informed decisions. By combining audio-to-text translation with LLM capabilities, we can streamline and augment the adverse event reporting process, leading to improved patient safety and more efficient clinical research.

Don’t feel like reading the full use case? No problem! You can listen to the key details in our audio file instead.

The process flow consists of the following steps:

Audio data collection – During clinical trial visits or follow-ups, audio recordings of patient-doctor interactions are captured, capturing detailed descriptions of adverse events or symptoms experienced by the participants. These audio recordings can be obtained through various channels, such as in-person visits, telemedicine consultations, or dedicated voice reporting systems.
Audio-to-text transcription – The audio recordings are processed through an audio-to-text translation system, converting the spoken words into written text format. ASR and NLP techniques provide accurate transcription, accounting for factors like accents, background noise, and medical terminology.
Text data integration – The transcribed text data is integrated with other sources of adverse event reporting, such as electronic case report forms (eCRFs), patient diaries, and medication logs. This comprehensive dataset provides a holistic view of the adverse events reported across multiple data sources.
LLM analysis – The integrated dataset is fed into an LLM specifically trained on medical and clinical trial data. The LLM analyzes the textual data, identifying patterns, extracting relevant information, and generating insights related to adverse event occurrences, severity, and potential causal relationships.
Intelligent reporting and decision support – The LLM generates detailed adverse event reports, highlighting key findings, trends, and potential safety signals. These reports can be presented to clinical trial teams, regulatory bodies, and safety monitoring committees, supporting informed decision-making processes. The LLM can also provide recommendations for further investigation, protocol modifications, or risk mitigation strategies based on the identified adverse event patterns.

The solution offers the following potential benefits:

Improved data capture – By using audio-to-text translation, valuable information from patient-doctor interactions can be captured and included in adverse event reporting, reducing the risk of missed or incomplete data.
Enhanced accuracy and completeness – The integration of multiple data sources, combined with the LLM’s analysis capabilities, provides a comprehensive and accurate understanding of adverse events, reducing the potential for errors or omissions.
Efficient data analysis – The LLM can rapidly process large volumes of textual data, identifying patterns and insights that might be difficult or time-consuming for human analysts to detect manually.
Timely decision support – Real-time adverse event reporting and analysis enable clinical trial teams to promptly identify and address potential safety concerns, mitigating risks and providing participant well-being.
Regulatory compliance – Comprehensive adverse event reporting and detailed documentation facilitate compliance with regulatory requirements and support transparent communication with regulatory agencies.

By integrating audio-to-text translation with LLM capabilities, this approach addresses the critical need for accurate and timely adverse event reporting in clinical trials, ultimately enhancing patient safety, improving research efficiency, and supporting informed decision-making in the HCLS domain.

Use case 4: Audio-to-text and LLM integration for enhanced patient care

In the healthcare domain, effective communication and accurate data capture are crucial for providing personalized and high-quality care. By integrating audio-to-text translation capabilities with LLM technology, we can streamline processes and unlock valuable insights, ultimately improving patient outcomes.

Don’t feel like reading the full use case? No problem! You can listen to the key details in our audio file instead.

The process flow consists of the following steps:

Audio input collection – Caregivers or healthcare professionals can record audio updates on a patient’s condition, mood, or relevant observations using a secure and user-friendly interface. This could be done through mobile devices, dedicated recording stations, or during virtual consultations.
Audio-to-text transcription – The recorded audio files are securely transmitted to a speech-to-text engine, which converts the spoken words into text format. Advanced NLP techniques provide accurate transcription, handling accents, medical terminology, and background noise.
Text processing and contextualization – The transcribed text is then fed into an LLM trained on various healthcare datasets, including medical literature, clinical guidelines, and deidentified patient records. The LLM processes the text, identifies key information, and extracts relevant context and insights.
LLM-powered analysis and recommendations – Using its sizeable knowledge base and natural language understanding capabilities, the LLM can perform various tasks, such as:
1. Identifying potential health concerns or risks based on the reported symptoms and observations.
2. Suggesting personalized care plans or treatment options aligned with evidence-based practices.
3. Providing recommendations for follow-up assessments, diagnostic tests, or specialist consultations.
4. Flagging potential drug interactions or contraindications based on the patient’s medical history.
5. Generating summaries or reports in a structured format for efficient documentation and communication.
Integration with EHRs – The analyzed data and recommendations from the LLM can be seamlessly integrated into the patient’s EHR, providing a comprehensive and up-to-date medical profile. This enables healthcare professionals to access relevant information promptly and make informed decisions during consultations or treatment planning.

The solution offers the following potential benefits:

Improved efficiency – By automating the transcription and analysis process, healthcare professionals can save time and focus on providing personalized care, rather than spending extensive hours on documentation and data entry.
Enhanced accuracy – ASR and NLP techniques provide accurate transcription, reducing errors and improving data quality.
Comprehensive patient insights – The LLM’s ability to process and contextualize unstructured audio data provides a more holistic understanding of the patient’s condition, enabling better-informed decision-making.
Personalized care plans – By using the LLM’s knowledge base and analytical capabilities, healthcare professionals can develop tailored care plans aligned with the patient’s specific needs and medical history.
Streamlined communication – Structured reports and summaries generated by the LLM facilitate efficient communication among healthcare teams, making sure everyone has access to the latest patient information.
Continuous learning and improvement – As more data is processed, the LLM can continuously learn and refine its recommendations, improving its performance over time.

By integrating audio-to-text translation and LLM capabilities, healthcare organizations can unlock new efficiencies, enhance patient-provider communication, and ultimately deliver superior care while staying at the forefront of technological advancements in the industry.

Use case 5: Audio-to-text translation and LLM integration for clinical trial protocol design

Efficient and accurate protocol design is crucial for successful study execution and regulatory compliance. By combining audio-to-text translation capabilities with the power of LLMs, we can streamline the protocol design process, using diverse data sources and AI-driven insights to create high-quality protocols in a timely manner.

Don’t feel like reading the full use case? No problem! You can listen to the key details in our audio file instead.

The process flow consists of the following steps:

Audio input collection – Clinical researchers, subject matter experts, and stakeholders provide audio inputs, such as recorded meetings, discussions, or interviews, related to the proposed clinical trial. These audio files can capture valuable insights, requirements, and domain-specific knowledge.
Audio-to-text transcription – Using ASR technology, the audio inputs are converted into text transcripts with high accuracy. This step makes sure that valuable information is captured and transformed into a format suitable for further processing by LLMs.
Data integration – Relevant data sources, such as previous clinical trial protocols, regulatory guidelines, scientific literature, and medical databases, are integrated into the workflow. These data sources provide contextual information and serve as a knowledge base for the LLM.
LLM processing – The transcribed text, along with the integrated data sources, is fed into a powerful LLM. The LLM uses its knowledge base and NLP capabilities to analyze the inputs, identify key elements, and generate a draft clinical trial protocol.
Protocol refinement and review – The draft protocol generated by the LLM is reviewed by clinical researchers, medical experts, and regulatory professionals. They provide feedback, make necessary modifications, and enforce compliance with relevant guidelines and best practices.
Iterative improvement – As the AI system receives feedback and correlated outcomes from completed clinical trials, it continuously learns and refines its protocol design capabilities. This iterative process enables the LLM to become more accurate and efficient over time, leading to higher-quality protocol designs.

The solution offers the following potential benefits:

Efficiency – By automating the initial protocol design process, researchers can save valuable time and resources, allowing them to focus on more critical aspects of clinical trial execution.
Accuracy and consistency – LLMs can use vast amounts of data and domain-specific knowledge, reducing the risk of errors and providing consistency across protocols.
Knowledge integration – The ability to seamlessly integrate diverse data sources, including audio recordings, scientific literature, and regulatory guidelines, enhances the quality and comprehensiveness of the protocol design.
Continuous improvement – The iterative learning process allows the AI system to adapt and improve its protocol design capabilities based on real-world outcomes, leading to increasingly accurate and effective protocols over time.
Decision-making support – By providing well-structured and comprehensive protocols, the AI-driven approach enables better-informed decision-making for clinical researchers, sponsors, and regulatory bodies.

This integrated approach using audio-to-text translation and LLM capabilities has the potential to revolutionize the clinical trial protocol design process, ultimately contributing to more efficient and successful clinical trials, accelerating the development of life-saving treatments, and improving patient outcomes.

Use case 6: Voice-enabled clinical trial and disease information assistant

In the HCLS domain, effective communication and access to accurate information are crucial for patients, caregivers, and healthcare professionals. This use case demonstrates how audio-to-text translation combined with LLM capabilities can address these needs by providing an intelligent, voice-enabled assistant for clinical trial and disease information.

Don’t feel like reading the full use case? No problem! You can listen to the key details in our audio file instead.

The process flow consists of the following steps:

Audio input – The user, whether a patient, caregiver, or healthcare professional, can initiate the process by providing a voice query related to a specific disease or clinical trial. This could include questions about the disease itself, treatment options, ongoing trials, eligibility criteria, or other relevant information.
Audio-to-text translation – The audio input is converted into text using state-of-the-art speech recognition technology. This step makes sure that the user’s query is accurately transcribed and ready for further processing by the LLM.
Data integration – The system integrates various data sources, including clinical trial data, disease-specific information from reputable sources (such as PubMed or WebMD), and other relevant third-party resources. This comprehensive data integration makes sure that the LLM has access to a large knowledge base for generating accurate and comprehensive responses.
LLM processing – The transcribed query is fed into the LLM, which uses its natural language understanding capabilities to comprehend the user’s intent and extract relevant information from the integrated data sources. The LLM can provide intelligent responses, insights, and recommendations based on the query and the available data.
Response generation – The LLM generates a detailed, context-aware response addressing the user’s query. This response can be presented in various formats, such as text, audio (using text-to-speech technology), or a combination of both, depending on the user’s preferences and accessibility needs.
Feedback and continuous improvement – The system can incorporate user feedback mechanisms to improve its performance over time. This feedback can be used to refine the LLM’s understanding, enhance the data integration process, and make sure that the system remains up to date with the latest clinical trial and disease information.

The solution offers the following potential benefits:

Improved access to information – By using voice input and NLP capabilities, the system empowers patients, caregivers, and healthcare professionals to access accurate and comprehensive information about diseases and clinical trials, regardless of their technical expertise or literacy levels.
Enhanced communication – The voice-enabled interface facilitates seamless communication between users and the system, enabling them to ask questions and receive responses in a conversational manner, mimicking human-to-human interaction.
Personalized insights – The LLM can provide personalized insights and recommendations based on the user’s specific query and context, enabling more informed decision-making and tailored support for individuals.
Time and efficiency gains – By automating the process of information retrieval and providing intelligent responses, the system can significantly reduce the time and effort required for healthcare professionals to manually search and synthesize information from multiple sources.
Improved patient engagement – By offering accessible and user-friendly access to disease and clinical trial information, the system can empower patients and caregivers to actively participate in their healthcare journey, fostering better engagement and understanding.

This use case highlights the potential of integrating audio-to-text translation with LLM capabilities to address real-world challenges in the HCLS domain. By using cutting-edge technologies, this solution can improve information accessibility, enhance communication, and support more informed decision-making for all stakeholders involved in clinical trials and disease management.

For the demonstration purpose we will focus on following use case:

Use case overview: Patient reporting and analysis in clinical trials

In clinical trials, it’s crucial to gather accurate and comprehensive patient data to assess the safety and efficacy of investigational drugs or therapies. Traditional methods of collecting patient reports can be time-consuming, prone to errors, and might result in incomplete or inconsistent data. By combining audio-to-text translation with LLM capabilities, we can streamline the patient reporting process and unlock valuable insights to support decision-making.

Don’t feel like reading the full use case? No problem! You can listen to the key details in our audio file instead.

The process flow consists of the following steps:

Audio input – Patients participating in clinical trials can provide their updates, symptoms, and feedback through voice recordings using a mobile application or a dedicated recording device.
Audio-to-text transcription – The recorded audio files are securely transmitted to a cloud-based infrastructure, where they undergo automated transcription using ASR technology. The audio is converted into text, providing accurate and verbatim transcripts.
Data consolidation – The transcribed patient reports are consolidated into a structured database, enabling efficient storage, retrieval, and analysis.
LLM processing – The consolidated textual data is then processed by an LLM trained on biomedical and clinical trial data. The LLM can perform various tasks, including:
1. Natural language processing – Extracting relevant information and identifying key symptoms, adverse events, or treatment responses from the patient reports.
2. Sentiment analysis – Analyzing the emotional and psychological state of patients based on their language and tone, which can provide valuable insights into their overall well-being and treatment experience.
3. Pattern recognition – Identifying recurring themes, trends, or anomalies across multiple patient reports, enabling early detection of potential safety concerns or efficacy signals.
4. Knowledge extraction – Using the LLM’s understanding of biomedical concepts and clinical trial protocols to derive meaningful insights and recommendations from the patient data.
Insights and reporting – The processed data and insights derived from the LLM are presented through interactive dashboards, visualizations, and reports. These outputs can be tailored to different stakeholders, such as clinical researchers, medical professionals, and regulatory authorities.

The solution offers the following potential benefits:

Improved data quality – By using audio-to-text transcription, the risk of errors and inconsistencies associated with manual data entry is minimized, providing high-quality patient data.
Time and cost-efficiency – Automated transcription and LLM-powered analysis can significantly reduce the time and resources required for data collection, processing, and analysis, leading to faster decision-making and cost savings.
Enhanced patient experience – Patients can provide their updates conveniently through voice recordings, reducing the burden of manual data entry and enabling more natural communication.
Comprehensive analysis – The combination of NLP, sentiment analysis, and pattern recognition capabilities offered by LLMs allows for a holistic understanding of patient experiences, treatment responses, and potential safety signals.
Regulatory compliance – Accurate and comprehensive patient data, coupled with robust analysis, can support compliance with regulatory requirements for clinical trial reporting and data documentation.

By integrating audio-to-text translation and LLM capabilities, clinical trial sponsors and research organizations can benefit from streamlined patient reporting, enhanced data quality, and powerful insights to support informed decision-making throughout the clinical development process.

Solution overview

The following diagram illustrates the solution architecture.

Solution overview: patient reporting and analysis in clinical trials

Key AWS services used in this solution include Amazon Simple Storage Service (Amazon S3), AWS HealthScribe, Amazon Transcribe, and Amazon Bedrock.

Prerequisites

This solution requires the following prerequisites:

A basic understanding of clinical trails in healthcare.
An AWS account. If you don’t have one, you can register for a new AWS account.
Access to Anthropic’s Claude 3 Sonnet model on Amazon Bedrock (model ID: claude-3-sonnet-20240229-v1:0). For more details, see Add or remove access to Amazon Bedrock foundation models.

Data samples

To illustrate the concept and provide a practical understanding, we have curated a collection of audio samples. These samples serve as representative examples, simulating site interviews conducted by researchers at clinical trial sites with patient participants.

The audio recordings offer a glimpse into the type of data typically encountered during such interviews. We encourage you to listen to these samples to gain a better appreciation of the data and its context.

These samples are for demonstration purposes only and don’t contain any real patient information or sensitive data. They are intended solely to provide a sample structure and format for the audio recordings used in this particular use case.

Sample Data	Audio File
Site interview 1
Site Interview 2
Site Interview 3
Site Interview 4
Site Interview 5

Prompt templates

Prior to deploying and executing this solution, it’s essential to comprehend the input prompts and the anticipated output from the LLM. Although this is merely a sample, the potential outcomes and possibilities can be vastly expanded by crafting creative prompts.

We use the following input prompt template:

You are an expert medical research analyst for clinical trials of medicines.

You will be provided with a dictionary containing text transcriptions of clinical trial interviews conducted between patients and interviewers.

The dictionary keys represent the interview_id, and the values contain the interview transcripts.

<interview_transcripts>add_interview_transcripts</interview_transcripts>

Your task is to analyze all the transcripts and generate a comprehensive report summarizing the key findings and conclusions from the clinical trial.

The response Amazon Bedrock will be as below:

Based on the interview transcripts provided, here is a comprehensive report summarizing the key findings and conclusions from the clinical trial:

Introduction:

This report analyzes transcripts from interviews conducted with patients participating in a clinical trial for a new investigational drug. The interviews cover various aspects of the trial, including the informed consent process, randomization procedures, dosing schedules, follow-up visits, and patient experiences with potential side effects.

Key Findings:

1. Informed Consent Process:

– The informed consent process was thorough, with detailed explanations provided to patients about the trial’s procedures, potential risks, and benefits (Transcript 5).

– Patients were given ample time to review the consent documents, discuss them with family members, and have their questions addressed satisfactorily by the study team (Transcript 5).

– Overall, patients felt they fully understood the commitments and requirements of participating in the trial (Transcript 5).

2. Randomization and Blinding:

– Patients were randomized to either receive the investigational drug or a placebo, as part of a placebo-controlled study design (Transcript 2).

– The randomization process was adequately explained to patients, and they understood the rationale behind blinding, which is to prevent bias in the results (Transcript 2).

– Patients expressed acceptance of the possibility of receiving a placebo, recognizing its importance for the research (Transcript 2).

3. Dosing Schedule and Adherence:

– The dosing schedule involved taking the medication twice daily, in the morning and evening (Transcript 4).

– Some patients reported occasional difficulties in remembering the evening dose but implemented strategies like setting reminders on their phones to improve adherence (Transcript 4).

4. Follow-up Visits and Assessments:

– Follow-up visits were scheduled at specific intervals, such as 30 days, 3 months, and 6 months after the last dose (Transcripts 1 and 3).

– During these visits, various assessments were conducted, including blood tests, physical exams, ECGs, and evaluation of patient-reported outcomes like pain levels (Transcripts 1 and 3).

– Patients were informed that they would receive clinically significant findings from these assessments (Transcript 3).

5. Patient-Reported Side Effects:

– Some patients reported experiencing mild side effects, such as headaches, nausea, and joint pain improvement (Transcripts 3 and 4).

– The study team diligently documented and monitored these side effects, noting them in case report forms for further evaluation (Transcript 4).

6. Study Conduct and Communication:

– The study team provided 24/7 contact information, allowing patients to reach out with concerns between scheduled visits (Transcript 1).

– Patients were informed that they would receive information about the overall study results once available (Transcript 1).

– Patients were made aware of their ability to withdraw from the study at any time if they became uncomfortable (Transcript 2).

Conclusions:

Based on the interview transcripts, the clinical trial appears to have been conducted in a thorough and ethical manner, adhering to principles of informed consent, randomization, and blinding. Patients were adequately informed about the trial procedures, potential risks, and their rights as participants. The study team diligently monitored patient safety, documented adverse events, and maintained open communication channels. Overall, the transcripts suggest a well-managed clinical trial with a focus on patient safety, data integrity, and adherence to research protocols.

Deploy resources with AWS CloudFormation

To deploy the solution, use AWS CloudFormation template

Test the application

To test the application, complete the following steps:

On the Amazon S3 console, choose Buckets in the navigation pane.
Locate your bucket starting with blog-hcls-assets-*.
Navigate to the S3 prefix hcls-framework/samples-input-audio/. You will see sample audio files, which we reviewed earlier in this post.
Select these files, and on the Actions menu, choose Copy.
For Destination, choose Browse S3 and navigate to the S3 path for hcls-framework/input-audio/.

Copying these sample files will trigger an S3 event invoking the AWS Lambda function audio-to-text. To review the invocations of the Lambda function on the AWS Lambda console, navigate to the audio-to-text function and then the Monitor tab, which contains detailed logs.

You can review the status of the Amazon Transcribe jobs on the Amazon Transcribe console.

At this step, the interview transcripts are ready. They should be available in Amazon S3 under the prefix hcls-framework/input-text/.

You can download a sample file and review the contents. You will notice the content of this file as JSON with a text transcript available under the key transcripts, along with other metadata.

Now let’s run Anthropic’s Claude 3 Sonnet using the Lambda function hcls_clinical_trial_analysis to analyze the transcripts and generate a comprehensive report summarizing the key findings and conclusions from the clinical trial.

On the Lambda console, navigate to the function named hcls_clinical_trial_analysis.
Choose Test.
If the console prompts you to create a new test event, do so with default or no input to the test event.

Run the test event.

To review the output, open the Lambda console and navigate to the function named hcls_clinical_trial_analysis, and then on the Monitor tab, for detailed logs, choose View CloudWatch Logs. In the logs, you will see your comprehensive report on the clinical trial.

So far, we have completed a process involving:

Collecting audio interviews from clinical trials
Transcribing the audio to text
Compiling transcripts into a dictionary
Using Amazon Bedrock (Anthropic’s Claude 3 Sonnet) to generate a comprehensive summary

Although we focused on summarization, this approach can be extended to other applications such as sentiment analysis, extracting key learnings, identifying common complaints, and more.

Summary

Healthcare patients often find themselves in need of reliable information about their conditions, clinical trials, or treatment options. However, accessing accurate and up-to-date medical knowledge can be a daunting task. Our innovative solution integrates cutting-edge audio-to-text translation and LLM capabilities to revolutionize how patients receive vital healthcare information. By using speech recognition technology, we can accurately transcribe patients’ spoken queries, allowing our LLM to comprehend the context and provide personalized, evidence-based responses tailored to their specific needs. This empowers patients to make informed decisions, enhances accessibility for those with disabilities or preferences for verbal communication, and alleviates the workload on healthcare professionals, ultimately improving patient outcomes and driving progress in the HCLS domain.

Take charge of your healthcare journey with our innovative voice-enabled virtual assistant. Empower yourself with accurate and personalized information by simply asking your questions aloud. Our cutting-edge solution integrates speech recognition and advanced language models to provide reliable, context-aware responses tailored to your specific needs. Embrace the future of healthcare today and experience the convenience of instantaneous access to vital medical information.

About the Authors

Vrinda Dabke leads AWS Professional Services North America Delivery. Prior to joining AWS, Vrinda held a variety of leadership roles in Fortune 100 companies like UnitedHealth Group, The Hartford, Aetna, and Pfizer. Her work has been focused on in the areas of business intelligence, analytics, and AI/ML. She is a motivational people leader with experience in leading and managing high-performing global teams in complex matrix organizations.

Kannan Raman leads the North America Delivery for AWS Professional Services Healthcare and Life Sciences practice at AWS. He has over 24 years of healthcare and life sciences experience and provides thought leadership in digital transformation. He works with C level customer executives to help them with their digital transformation agenda.

Rushabh Lokhande is a Senior Data & ML Engineer with AWS Professional Services Analytics Practice. He helps customers implement big data, machine learning, and analytics solutions. Outside of work, he enjoys spending time with family, reading, running, and playing golf.

Bruno Klein is a Senior Machine Learning Engineer with AWS Professional Services Analytics Practice. He helps customers implement big data and analytics solutions. Outside of work, he enjoys spending time with family, traveling, and trying new food.