March 2025 – Page 5

NVIDIA Honors Americas Partners Advancing Agentic and Physical AI

NVIDIA this week recognized 14 partners leading the way across the Americas for their work advancing agentic and physical AI across industries.

The 2025 Americas NVIDIA Partner Network awards — announced at the GTC 2025 global AI conference — represent key efforts by industry leaders to help customers become experts in using AI to solve many of today’s greatest challenges. The awards honor the diverse contributions of NPN members fostering AI-driven innovation and growth.

This year, NPN introduced three new award categories that reflect how AI is driving economic growth and opportunities, including:

Trailblazer, which honors a visionary partner spearheading AI adoption and setting new industry standards.
Rising Star, which celebrates an emerging talent helping industries harness AI to drive transformation.
Innovation, which recognizes a partner that’s demonstrated exceptional creativity and forward thinking.

This year’s NPN ecosystem winners have helped companies across industries use AI to adapt to new challenges and prioritize energy-efficient accelerated computing. NPN partners help customers implement a broad range of AI technologies, including NVIDIA-accelerated AI factories, as well as large language models and generative AI chatbots, to transform business operations.

The 2025 NPN award winners for the Americas are:

Global Consulting Partner of the Year — Accenture is recognized for its impact and depth of engineering with its AI Refinery platform for industries, simulation and robotics, marketing and sovereignty, which helps organizations enhance innovation and growth with custom-built approaches to AI-driven enterprise reinvention.
Trailblazer Partner of the Year — Advizex is recognized for its commitment to driving innovation in AI and high-performance computing, helping industries like healthcare, manufacturing, retail and government seamlessly integrate advanced AI technologies into existing business frameworks. This enables organizations to achieve significant operations efficiencies, enhanced decision-making, and accelerated digital transformation.
Rising Star Partner of the Year — AHEAD is recognized for its leadership, technical expertise and deployment of NVIDIA software, NVIDIA DGX systems, NVIDIA HGX and networking technologies to advance AI, benefitting customers across healthcare, financial services, life sciences and higher education.
Networking Partner of the Year — Computacenter is recognized for advancing high-performance computing and data centers with NVIDIA networking technologies. The company achieved this by using the NVIDIA AI Enterprise software platform, DGX platforms and NVIDIA networking to drive innovation and growth throughout industries with efficient, accelerated data centers.
Solution Integration Partner of the Year — EXXACT is recognized for its efforts in helping research institutions and businesses tap into generative AI, large language models and high-performance computing. The company harnesses NVIDIA GPUs and networking technologies to deliver powerful computing platforms that accelerate innovation and tackle complex computational challenges across various industries.
Enterprise Partner of the Year — World Wide Technology (WWT) is recognized for its leadership in advancing AI adoption of customers across industry verticals worldwide. The company expanded its end-to-end AI capabilities by integrating NVIDIA Blueprints into its AI Proving Ground and has made a $500 million commitment to AI development over three years to help speed enterprise generative AI deployments.
Software Partner of the Year — Mark III is recognized for the work of its cross-functional team spanning data scientists, developers, 3D artists, systems engineers, and HPC and AI architects, as well as its close collaborations with enterprises and institutions, to deploy NVIDIA software, including NVIDIA AI Enterprise and NVIDIA Omniverse, across industries. These efforts have helped many customers build software-powered pipelines and data flywheels with machine learning, generative AI, high-performance computing and digital twins.
Higher Education Research Partner of the Year — Mark III is recognized for its close engagement with universities, academic institutions and research organizations to cultivate the next generation of leaders across AI, machine learning, generative AI, high-performance computing and digital twins.
Healthcare Partner of the Year — Lambda is recognized for empowering healthcare and biotech organizations with AI training, fine-tuning and inferencing solutions to speed innovation and drive breakthroughs in AI-driven drug discovery. The company provides AI training, fine-tuning and inferencing solutions at every scale — from individual workstations to comprehensive AI factories — that help healthcare providers seamlessly integrate NVIDIA accelerated computing and software into their infrastructure.
Financial Services Partner of the Year — WWT is recognized for driving the digital transformation of the world’s largest banks and financial institutions. The company harnesses NVIDIA AI technologies to optimize data management, enhance cybersecurity and deliver transformative generative AI solutions, helping financial services clients navigate rapid technological changes and evolving customer expectations.
Innovation Partner of the Year — Cambridge Computer is recognized for supporting customers deploying transformative technologies, including NVIDIA Grace Hopper, NVIDIA Blackwell and the NVIDIA Omniverse platform for physical AI.
Service Delivery Partner of the Year — SoftServe is recognized for its impact in driving enterprise adoption of NVIDIA AI and Omniverse with custom NVIDIA Blueprints that tap into NVIDIA NIM microservices and NVIDIA NeMo and Riva software. SoftServe helps customers create generative AI services for industries spanning manufacturing, retail, financial services, auto, healthcare and life sciences.
Distribution Partner of the Year — TD SYNNEX has been recognized for the second consecutive year for supporting customers in accelerating AI growth through rapid delivery of NVIDIA accelerated computing and software, as part of its Destination AI initiative.
Rising Star Consulting Partner of the Year — Tata Consultancy Services (TCS) is recognized for its growth and commitment to providing industry-specific solutions that help customers adopt AI faster and at scale. Through its recently launched business unit and center of excellence built on NVIDIA AI Enterprise and Omniverse, TCS is poised to accelerate adoption of agentic AI and physical AI solutions to speed innovation for customers worldwide.
Canadian Partner of the Year — Hypertec is recognized for its advancement of high-performance computing and generative AI across Canada. The company has employed the full-stack NVIDIA platform to accelerate AI for financial services, higher education and research.
Public Sector Partner of the Year — Government Acquisitions (GAI) is recognized for its rapid AI deployment and robust customer relationships, helping serve the unique needs of the federal government by adding AI to operations to improve public safety and efficiency.

Learn more about the NPN program.

NVIDIA Blackwell Powers Real-Time AI for Entertainment Workflows

AI has been shaping the media and entertainment industry for decades, from early recommendation engines to AI-driven editing and visual effects automation. Real-time AI — which lets companies actively drive content creation, personalize viewing experiences and rapidly deliver data insights — marks the next wave of that transformation.

With the NVIDIA RTX PRO Blackwell GPU series, announced yesterday at the NVIDIA GTC global AI conference, media companies can now harness real-time AI for media workflows with unprecedented speed, efficiency and creative potential.

NVIDIA Blackwell serves as the foundation of NVIDIA Media2, an initiative that enables real-time AI by bringing together NVIDIA technologies — including NVIDIA NIM microservices, NVIDIA AI Blueprints, accelerated computing platforms and generative AI software — to transform all aspects of production workflows and experiences, starting with content creation, streaming and live media.

Powering Intelligent Content Creation

Accelerated computing enables AI-driven workflows to process massive datasets in real time, unlocking faster rendering, simulation and content generation.

NVIDIA RTX PRO Blackwell GPUs series include new features that enable unprecedented graphics and AI performance. The NVIDIA Streaming Multiprocessor offers up to 1.5x faster throughput over the NVIDIA Ada generation, and new neural shaders that integrate AI inside of programmable shaders for advanced content creation.

Fourth-generation RT Cores deliver up to 2x the performance of the previous generation, enabling the creation of massive photoreal and physically accurate animated scenes. Fifth-generation Tensor Cores deliver up to 4,000 AI trillion operations per second and add support for FP4 precision. And up to 96GB of GDDR7 memory boosts GPU bandwidth and capacity, allowing applications to run faster and work with larger, more complex datasets for massive 3D and AI projects, large-scale virtual-reality environments and more.

“One of the most exciting aspects of new technology is how it empowers our artists with tools to enhance their creative workflows,” said Steve May, chief technology officer of Pixar Animation Studios. “With Pixar’s next-generation renderer, RenderMan XPU — optimized for the NVIDIA Blackwell platform — 99% of Pixar shots can now fit within the 96GB of memory on the NVIDIA RTX PRO 6000 Blackwell GPUs. This breakthrough will fundamentally improve the way we make movies.”

“Our artists were frequently maxing out our 48GB cards with ILM StageCraft environments and having to battle performance issues on set for 6K and 8K real-time renders,” said Stephen Hill, principal rendering engineer at Lucasfilm. “The new NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition GPU lifts these limitations — we’re seeing upwards of a 2.5x performance increase over our current production GPUs, and with 96GB of VRAM we now have twice as much memory to play with.”

In addition, neural rendering with NVIDIA RTX Kit brings cinematic-quality ray tracing and AI-enhanced graphics to real-time engines, elevating visual fidelity in film, TV and interactive media. Including neural texture compression, neural shaders, RTX Global Illumination and Mega Geometry, RTX Kit is a suite of neural rendering technologies that enhance graphics for games, animation, virtual production scenes and immersive experiences.

Fueling the Future of Streaming and Data Analytics

Data analytics is transforming raw audience insights into actionable intelligence faster than ever. NVIDIA accelerated computing and AI-powered frameworks enable studios to analyze viewer behavior, predict engagement patterns and optimize content in real time, driving hyper-personalized experiences and smarter creative decisions.

With the new GPUs, users can achieve real-time ingestion and data transformation with GPU-accelerated data loading and cleansing at scale.

The NVIDIA technologies accelerating streaming and data analytics include a suite of NVIDIA CUDA-X data processing libraries that enable immediate insights from continuous data streams and reduce latency, such as:

NVIDIA cuML: Enables GPU-accelerated training and inference for recommendation models using scikit-learn algorithms, providing real-time personalization capabilities and up-to-date relevant content recommendations that boost viewer engagement while reducing churn.
NVIDIA cuDF: Offers pandas DataFrame operations on GPUs, enabling faster and more efficient NVIDIA-accelerated extract, transform and load operations and analytics. cuDF helps optimize content delivery by analyzing user data to predict demand and adjust content distribution in real time, improving overall user experiences.

Along with cuML and cuDF, accelerated data science libraries provide seamless integration with the open-source Dask library for multi-GPU or multi-node clusters. NVIDIA RTX Blackwell PRO GPUs’ large GPU memory can further assist with handling massive datasets and spikes in usage without sacrificing performance.

And, the video search and summarization blueprint integrates vision language models and large language models and provides cloud-native building blocks to build video analytics, search and summarization applications.

Breathing Life Into Live Media

With NVIDIA RTX PRO Blackwell GPUs, broadcasters can achieve higher performance than ever in high-resolution video processing, real-time augmented reality and AI-driven content production and video analytics.

New features include:

Ninth-Generation NVIDIA NVENC: Adds support for 4:2:2 encoding, accelerating video encoding speed and improving quality for broadcast and live media applications while reducing costs of storing uncompressed video.
Sixth-Generation NVIDIA NVDEC: Provides up to double H.264 decoding throughput and offers support for 4:2:2 H.264 and HEVC decode. Professionals can benefit from high-quality video playback, accelerate video data ingestion and use advanced AI-powered video editing features.
Fifth-Generation PCIe: Provides double the bandwidth over the previous generation, improving data transfer speeds from CPU memory and unlocking faster performance for data-intensive tasks.
DisplayPort 2.1: Drives high-resolution displays at up to 8K at 240Hz and 16K at 60Hz. Increased bandwidth enables seamless multi-monitor setups, while high dynamic range and higher color depth support deliver more precise color accuracy for tasks like video editing and live broadcasting.

“The NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition GPU is a transformative force in Cosm’s mission to redefine immersive entertainment,” said Devin Poolman, chief product and technology officer at Cosm, a global immersive technology, media and entertainment company. “With its unparalleled performance, we can push the boundaries of real-time rendering, unlocking the ultra-high resolution and fluid frame rates needed to make our live, immersive experiences feel nearly indistinguishable from reality.”

As a key component of Cosm’s CX System 12K LED dome displays, RTX PRO 6000 Max-Q enables seamless merging of the physical and digital worlds to deliver shared reality experiences, enabling audiences to engage with sports, live events and cinematic content in entirely new ways.

Cosm’s shared reality experience, featuring its 87-foot-diameter LED dome display in stunning 12K resolution, with millions of pixels shining 10x brighter than the brightest cinematic display. Image courtesy of Cosm.

To learn more about NVIDIA Media2, watch the GTC keynote and register to attend sessions from NVIDIA and industry leaders at the show, which runs through Friday, March 21.

Try NVIDIA NIM microservices and AI Blueprints on build.nvidia.com.

Amazon Q Business now available in Europe (Ireland) AWS Region

Today, we are excited to announce that Amazon Q Business—a fully managed generative-AI powered assistant that you can configure to answer questions, provide summaries and generate content based on your enterprise data—is now generally available in the Europe (Ireland) AWS Region.

Since its launch, Amazon Q Business has been helping customers find information, gain insight, and take action at work. The general availability of Amazon Q Business in the Europe (Ireland) Region will support customers across Ireland and the EU to transform how their employees work and access information, while maintaining data security and privacy requirements.

AWS customers and partners innovate using Amazon Q Business in Europe

Organizations across the EU are using Amazon Q Business for a wide variety of use cases, including answering questions about company data, summarizing documents, and providing business insights.

Katya Dunets, the AWS Lead Sales Engineer for Adastra noted,

Adastra stands at the forefront of technological innovation, specializing in artificial intelligence, data, cloud, digital, and governance services. Our team was facing the daunting challenge of sifting through hundreds of documents on SharePoint, searching for content and information critical for market research and RFP generation. This process was not only time-consuming but also impeded our agility and responsiveness. Recognizing the need for a transformative solution, we turned to Amazon Q Business for its prowess in answering queries, summarizing documents, generating content, and executing tasks, coupled with its direct SharePoint integration. Amazon Q Business became the catalyst for unprecedented efficiency within Adastra, dramatically streamlining document retrieval, enhancing cross-team collaboration through shared insights from past projects, and accelerating our RFP development process by 70%. Amazon Q Business has not only facilitated a smoother exchange of knowledge within our teams but has also empowered us to maintain our competitive edge by focusing on innovation rather than manual tasks. Adastra’s journey with Amazon Q exemplifies our commitment to harnessing cutting-edge technology to better serve both our clients and their customers.

AllCloud is a cloud solutions provider specializing in cloud stack, infrastructure, platform, and Software-as-a-Service. Their CTO, Peter Nebel stated,

“AllCloud faces the common challenge of information sprawl. Critical knowledge for sales and delivery teams is scattered across various tools—Salesforce for customer and marketing data, Google Drive for documents, Bamboo for HR and internal information, and Confluence for internal wikis. This fragmented approach wastes valuable time as employees hunt and peck for the information they need, hindering productivity and potentially impacting client satisfaction. Amazon Q Business provides AllCloud a solution to increase productivity by streamlining information access. By leveraging Amazon Q’s natural language search capabilities, AllCloud can empower its personnel with a central hub to find answers to their questions across all their existing information sources. This drives efficiency and accuracy by eliminating the need for time-consuming searches across multiple platforms and ensures all teams have access to the most up-to-date information. Amazon Q will significantly accelerate productivity, across all lines of business, allowing AllCloud’s teams to focus on delivering exceptional service to their clients.”

Lars Ritter, Senior Manager at Woodmark Consulting noted,

“Amazon Bedrock and Amazon Q Business have been game-changers for Woodmark. Employees struggled with time-consuming searches across various siloed systems, leading to reduced productivity and slower operations. To solve for the inefficient retrieval of corporate knowledge from unstructured data sources we turned to Amazon Bedrock and Amazon Q Business for help. With this innovative solution, Woodmark has been able to revolutionize data accessibility, empowering our teams to effortlessly retrieve insights using simple natural language queries, and to make informed decisions without relying on specialized data teams, which was not feasible before. These solutions have dramatically increased efficiency, fostered a data-driven culture, and positioned us for scalable growth, driving our organization toward unparalleled success.”

Scott Kumono, Product Manager for Kinectus at Siemens Healthineers adds,

“Amazon Q Business has enhanced the delivery of service and clinical support for our ultrasound customers. Previously, finding specific information meant sifting through a 1,000-page manual or waiting for customer support to respond. Now, customers have instant access to answers and specifications right at their fingertips, using Kinectus Remote Service. With Amazon Q Business we were able to significantly reduce manual work and wait times to find the right information, allowing our customers to focus on what really matters – patient care.”

Till Gloger, Head of Digital Production Platform Region Americas at Volkswagen Group of America states,

“Volkswagen innovates not only on its products, but also on how to boost employee productivity and increase production throughput. Volkswagen is testing the use of Amazon Q to streamline employee workflows by potentially integrating it with existing processes. This integration has the possibility to help employees save time during the assembly process, reducing some processes from minutes to seconds, ultimately leading to more throughput.”

Pricing

With Amazon Q Business, enterprise customers pay for user subscriptions and index capacity. For more details, see Amazon Q Business pricing.

Get started with Amazon Q Business today

To get started with Amazon Q Business, users first need to configure an application environment and create a knowledge base using over 40 data source connectors that index documents (e.g text, pdf, images, tables). Organizations then set up user authentication through AWS IAM Identity Center or other SAML-based identity providers like Okta, Ping Identity, and Microsoft Entra ID. After configuring access permissions, applications users can navigate to their organization’s Amazon Q Business web interface using their credentials to begin interacting with the Q Business and the data they have access to. Q Business enables natural language interactions where users can ask questions and receive answers based on their indexed documents, uploaded content, and world knowledge – this may include getting details, generating content or insights. Users can access Amazon Q Business through multiple channels including web applications, Slack, Microsoft Teams, Microsoft 365 for Word and Outlook, or through browser extensions for gen-AI assistance directly where they work. Additionally, customers can securely share their data with verified independent software vendors (ISVs) like Asana, Miro, PagerDuty, and Zoom using the data accessors feature, which maintains security and compliance while respecting user-level permissions.

Learn more about how to get started with Amazon Q Business here. Read about other Amazon Q Business customers’ success stories here. Certain Amazon Q Business features already available in US East (N. Virginia) and US West (Oregon) including Q Apps, Q Actions, and Audio/Video file support will become available in Europe (Ireland) soon.

About the Authors

Jose Navarro is an AI/ML Specialist Solutions Architect at AWS, based in Spain. Jose helps AWS customers—from small startups to large enterprises—architect and take their end-to-end machine learning use cases to production.

Morgan Dutton is a Senior Technical Program Manager at AWS, Amazon Q Business based in Seattle.

Eva Pagneux is a Principal Product Manager at AWS, Amazon Q Business, based in San Francisco.

Wesleigh Roeca is a Senior Worldwide Gen AI/ML Specialist at AWS, Amazon Q Business, based in Santa Monica.

PyTorch Day China 2025 Call for Proposals Open

We’re excited to announce the first-ever PyTorch Day China! This new event, hosted by the PyTorch Foundation, will take place on June 7 in Beijing, China, bringing together AI practitioners, researchers, and industry professionals to explore the latest advancements in open source AI and machine learning. Co-located with the BAAI Conference, PyTorch Day China is a chance to connect with the community, share knowledge, and help shape the future of deep learning.

Why Submit a Proposal?

PyTorch Day China offers a platform for AI practitioners and researchers to showcase their work, exchange ideas, and connect with others in the community. If you’re working on innovative applications, tools, or research in the PyTorch ecosystem, we encourage you to share your expertise.

Topics for Submission:

AI Applications and Use Cases
Core PyTorch Framework
DL Compilers and Kernel Authoring
Edge AI and On-Device
Ethical AI, Governance, and Regulation
Generative AI and Large Language Models (LLMs) with PyTorch
Open Source Collaboration, Education, and Community Building
Optimization for Training and Inference
PyTorch on Accelerator Hardware
PyTorch Ecosystem and Tools
PyTorch in Research and Academia
Performance Measurement and Benchmarking
Scaling Training and Inference

The submission deadline is April 13. Submit and learn more here: https://www.lfasiallc.com/pytorch-day-china/call-for-proposals-cfp/

Why Attend?

PyTorch Day China will feature technical talks, discussions, and poster sessions that highlight real-world applications and developments in AI and machine learning. Attendees will have the opportunity to learn from experts, contribute to the open source community, and engage with fellow PyTorch users. Registration information will be available in April.

Event Details

Date: June 7, 2025
Location: Zhongguancun Exhibition Center, Beijing, China
Address: 索家坟, Hai Dian Qu, Bei Jing Shi, China, 100080
Co-located with: BAAI Conference

Travel Information

The venue, Zhongguancun Exhibition Center, is approximately 39 km from Beijing International Airport. More details on travel and accommodation will be available on the BAAI Conference website and updated here as they become available.

Have Questions?

For inquiries, please contact pytorchevents@linuxfoundation.org.

Submit your proposal by April 13 and join the conversation shaping the future of PyTorch.

SGLang Joins PyTorch Ecosystem: Efficient LLM Serving Engine

We’re thrilled to announce that the SGLang project has been integrated into the PyTorch ecosystem! This integration ensures that SGLang aligns with PyTorch’s standards and practices, providing developers with a reliable and community-supported framework for fast and flexible serving of LLMs.

To view the PyTorch Ecosystem, see the PyTorch Landscape and learn more about how projects can join the PyTorch Ecosystem.

About SGLang

SGLang is a fast-serving engine for large language models and vision language models. It makes the interaction with models faster and more controllable by co-designing the backend runtime and frontend language.

The core features include:

Fast Backend Runtime: Provides efficient serving with RadixAttention for prefix caching, zero-overhead CPU scheduler, continuous batching, token attention (paged attention), speculative decoding, tensor parallelism, chunked prefill, structured outputs, and quantization (FP8/INT4/AWQ/GPTQ).
Flexible Frontend Language: Offers an intuitive interface for programming LLM applications, including chained generation calls, advanced prompting, control flow, multi-modal inputs, parallelism, and external interactions.
Extensive Model Support: Supports a wide range of generative models (Llama, Gemma, Mistral, Qwen, DeepSeek, LLaVA, etc.), embedding models (e5-mistral, gte, mcdse) and reward models (Skywork), with easy extensibility for integrating new models.
Active Community: SGLang is open source and backed by an active community with industry adoption.

SGLang is famous for its fast speed. It can often significantly outperform other state-of-the-art frameworks in terms of serving throughput and latency. You can learn more about the underlying techniques from the past release blog posts: v0.2 blog, v0.3 blog, v0.4 blog.

SGLang has been widely adopted by leading industry companies and frontier research labs. For example, xAI uses SGLang to serve its flagship model, Grok 3, which is currently the best model according to the Chatbot Arena leaderboard. Microsoft Azure uses SGLang to serve DeepSeek R1 on AMD GPUs, which is currently the best open source model.

Serving DeepSeek Models

You can easily launch a Docker container to serve a DeepSeek model with the following command:

# Pull the latest image
docker pull lmsysorg/sglang:latest

# Launch a server
docker run --gpus all --shm-size 32g -p 30000:30000 -v ~/.cache/huggingface:/root/.cache/huggingface --ipc=host --network=host --privileged lmsysorg/sglang:latest 
    python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code --port 30000

Then you can query the server with the OpenAI-compatible API

import openai
client = openai.Client(base_url=f"http://127.0.0.1:30000/v1", api_key="None")

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V3",
    messages=[
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    temperature=0,
    max_tokens=64,
)

The server launch command above works for 8xH200. You can find detailed instructions for other hardware (MI300X, H100, A100, H20, L40S) at https://docs.sglang.ai/references/deepseek.html.

SGLang integrates DeepSeek-specific optimizations, such as MLA throughput optimizations, MLA-optimized kernels, data-parallel attention, multi-token prediction, and DeepGemm, making it the top choice for serving DeepSeek models by dozens of companies, including AMD, NVIDIA, and many cloud providers. The team is actively working on integrating more optimizations following the 2025 H1 roadmap below.

Serving Llama Models

Similarly, you can launch the server for a Llama 3.1 text model with:

python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct

Or a Llama 3.2 multimodal model with:

python3 -m sglang.launch_server --model-path meta-llama/Llama-3.2-11B-Vision-Instruct  --chat-template=llama_3_vision

Roadmap

This year, the SGLang team will continue to push the boundaries of system efficiency. You can find the roadmap of 2025H1 here. The focus is

Throughput-oriented large-scale deployment similar to the DeepSeek inference system
Long context optimizations
Low latency speculative decoding
Reinforcement learning training framework integration
Kernel optimizations

Community

SGLang has been deployed to large-scale production, generating trillions of tokens every day. It has an active community with over three hundred contributors on GitHub. It is supported by the following institutions: AMD, Atlas Cloud, Baseten, Cursor, DataCrunch, Etched, Hyperbolic, iFlytek, Jam & Tea Studios, LinkedIn, LMSYS, Meituan, Nebius, Novita AI, NVIDIA, RunPod, Stanford, UC Berkeley, UCLA, xAI, and 01.AI.

Conclusion

We’re excited to welcome SGLang to the PyTorch ecosystem. SGLang accelerates the serving of large language and vision language models. It’s widely adopted by industry, powering the large-scale online serving of frontier models like Grok and DeepSeek.

We invite you to explore the SGLang GitHub repo, join the community on Slack, and reach out to contact@sglang.ai for inquiries or collaboration opportunities. Together, we can make powerful AI models accessible to everyone.

Running NVIDIA NeMo 2.0 Framework on Amazon SageMaker HyperPod

This post is cowritten with Abdullahi Olaoye, Akshit Arora and Eliuth Triana Isaza at NVIDIA.

As enterprises continue to push the boundaries of generative AI, scalable and efficient model training frameworks are essential. The NVIDIA NeMo Framework provides a robust, end-to-end solution for developing, customizing, and deploying large-scale AI models, while Amazon SageMaker HyperPod delivers the distributed infrastructure needed to handle multi-GPU, multi-node workloads seamlessly.

In this blog post, we explore how to integrate NeMo 2.0 with SageMaker HyperPod to enable efficient training of large language models (LLMs). We cover the setup process and provide a step-by-step guide to running a NeMo job on a SageMaker HyperPod cluster.

NVIDIA NeMo Framework Overview

The NVIDIA NeMo Framework is an end-to-end solution for developing cutting edge generative AI models such as LLMs, vision language models (VLMs), video and speech models, and others.

At its core, NeMo Framework provides model builders with:

Comprehensive development tools: A complete ecosystem of tools, scripts, and proven recipes that guide users through every phase of the LLM lifecycle, from initial data preparation to final deployment.
Advanced customization: Flexible customization options that teams can use to tailor models to their specific use cases while maintaining peak performance.
Optimized infrastructure: Sophisticated multi-GPU and multi-node configurations that maximize computational efficiency for both language and image applications.
Enterprise-grade features with built-in capabilities including:
- Advanced parallelism techniques
- Memory optimization strategies
- Distributed checkpointing
- Streamlined deployment pipelines

By consolidating these powerful features into a unified framework, NeMo significantly reduces the complexity and cost associated with generative AI development. NeMo Framework 2.0 is a flexible, IDE-independent Python-based framework that enables flexible integration in each developer’s workflow. The framework provides capabilities such as code completion, type checking and programmatic extensions and configuration customization. The NeMo Framework includes NeMo-Run, a library designed to that streamline the configuration, execution, and management of machine learning experiments across various computing environments.

The end-to-end NeMo Framework includes the following key features that streamline and accelerate AI development:

Data curation: NeMo Curator is a Python library that includes a suite of modules for data-mining and synthetic data generation. They are scalable and optimized for GPUs, making them ideal for curating natural language data to train or fine-tune LLMs. With NeMo Curator, you can efficiently extract high-quality text from extensive raw web data sources.
Training and customization: NeMo Framework provides tools for efficient training and customization of LLMs and multimodal models. It includes default configurations for compute cluster setup, data downloading, and model hyperparameters autotuning, which can be adjusted to train on new datasets and models. In addition to pre-training, NeMo supports both supervised fine-tuning (SFT) and parameter-efficient fine-tuning (PEFT) techniques such as LoRA, Ptuning, and more.
Alignment: NeMo Aligner is a scalable toolkit for efficient model alignment. The toolkit supports state-of-the-art model alignment algorithms such as SteerLM, DPO, reinforcement learning from human feedback (RLHF), and much more. By using these algorithms, you can align language models to be safer, more harmless, and more helpful.

Solution overview

In this post, we show you how to efficiently train large-scale generative AI models with NVIDIA NeMo Framework 2.0 using SageMaker HyperPod, a managed distributed training service designed for high-performance workloads. This solution integrates NeMo Framework 2.0 with the scalable infrastructure of SageMaker HyperPod, creating seamless orchestration of multi-node, multi-GPU clusters.

The key steps to deploying this solution include:

Setting up SageMaker HyperPod prerequisites: Configuring networking, storage, and permissions management (AWS Identity and Access Management (IAM) roles).
Launching the SageMaker HyperPod cluster: Using lifecycle scripts and a predefined cluster configuration to deploy compute resources.
Configuring the environment: Setting up NeMo Framework and installing the required dependencies.
Building a custom container: Creating a Docker image that packages NeMo Framework and installs the required AWS networking dependencies.
Running NeMo model training: Using NeMo-Run with a Slurm-based execution setup to train an example LLaMA (180M) model efficiently.

Architecture diagram

The architecture, shown in the preceding diagram shows an Amazon SageMaker HyperPod Cluster.

Prerequisites

First, you deploy a SageMaker HyperPod cluster before running the job. But to deploy the cluster, you need to create some prerequisite resources.

Note that there is a cost associated with running a SageMaker HyperPod cluster, see the Amazon SageMaker AI Pricing (HyperPod pricing in On-demand pricing) for more information.

The following prerequisite steps are adapted from the Amazon SageMaker HyperPod workshop, which you can visit for additional information.

Use the following steps to deploy the prerequisite resources.

Sign in to the AWS Management Console using the AWS account you want to deploy the SageMaker HyperPod cluster in. You will create a VPC, subnets, an FSx Lustre volume, an Amazon Simple Storage Service (Amazon S3) bucket, and IAM role as pre-requisites; so make sure that your IAM role or user for console access has permissions to create these resources.
Use the CloudFormation template to go to your AWS CloudFormation console and launch the solution template.
Template parameters:
- Change the Availability Zone to match the AWS Region where you’re deploying the template. See Availability Zone IDs for the AZ ID for your Region.
- All other parameters can be left as default or changed as needed for your use case.
Select the acknowledgement box in the Capabilities section and create the stack.

It takes about 10 minutes for the CloudFormation stack creation to complete. The following figure shows the deployment timeline of the CloudFormation stack deployment for the prerequisite infrastructure components.

Launch the training job

With the prerequisite infrastructure deployed in your AWS account, you next deploy the SageMaker HyperPod cluster that you’ll use for the model training example. For the model training job, you will use the NeMo Framework to launch training jobs efficiently.

Step 1: Set up a SageMaker HyperPod cluster

After the prerequisite resources are successfully deployed, create a SageMaker HyperPod cluster.

The deployment steps are adapted from the SageMaker HyperPod workshop, which you can review for additional information.

Install and configure the AWS Command Line Interface (AWS CLI). If you already have it installed, verify that the version is at least 2.17.1 by running the following command:

$ aws --version

Configure the environment variables that using outputs from the CloudFormation stack deployed earlier.

$ curl -O https://raw.githubusercontent.com/aws-samples/awsome-distributed-training/main/1.architectures/5.sagemaker-hyperpod/create_config.sh
# Change the region below to the region you wish to use
$ AWS_REGION=us-east-1 bash create_config.sh
$ source env_vars
# Confirm environment variables
$ cat env_vars

Download the lifecycle scripts and upload them to the S3 bucket created in the prerequisites. SageMaker HyperPod uses lifecycle scripts to bootstrap a cluster. Examples of actions the lifecycle script manages include setting up Slurm and mounting the FSx Lustre filesystem.

$ git clone --depth=1 https://github.com/aws-samples/awsome-distributed-training/
$ cd awsome-distributed-training && git checkout e67fc352de83e13ad6a34b4a33da11c8a71b4d9c
# upload script
$ aws s3 cp --recursive 1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/ s3://${BUCKET}/src

Create a cluster config file for setting up the cluster. The following is an example of creating a cluster config from a template. The example cluster config is for g5.48xlarge compute nodes accelerated by 8 x NVIDIA A10G GPUs. See Create Cluster for cluster config examples of additional Amazon Elastic Compute Cloud (Amazon EC2) instance types. A cluster config file contains the following information:
1. Cluster name
2. It defines three instance groups
  1. Login-group: Acts as the entry point for users and administrators. Typically used for managing jobs, monitoring and debugging.
  2. Controller-machine: This is the head node for the Hyperpod Slurm cluster. It manages the overall orchestration of the distributed training process and handles job scheduling and communication within nodes.
  3. Worker-group: The group of nodes that executes the actual model training workload
3. VPC configuration

$ cd 3.test_cases/22.nemo-run/slurm
$ curl -O https://awsome-distributed-training.s3.amazonaws.com/blog-assets/nemo2.0-hyperpod/cluster-config-template.json 
$ cp cluster-config-template.json cluster-config.json
# Replace the placeholders in the cluster config
$ source env_vars
$ sed -i "s/$BUCKET/${BUCKET}/g" cluster-config.json
$ sed -i "s/$ROLE/${ROLE}/g" cluster-config.json 
$ sed -i "s/$SECURITY_GROUP/${SECURITY_GROUP}/g" cluster-config.json
$ sed -i "s/$SUBNET_ID/${SUBNET_ID}/g" cluster-config.json

Create a config file based on the following example with the cluster provisioning parameters and upload it to the S3 bucket.

$ instance_type=$(jq '.InstanceGroups[] | select(.InstanceGroupName == "worker-group-1").InstanceType' cluster-config.json)
$ cat > provisioning_parameters.json << EOL
{
"version": "1.0.0",
"workload_manager": "slurm",
"controller_group": "controller-machine",
"login_group": "login-group",
"worker_groups": [
{      
		"instance_group_name": "worker-group-1",      
		"partition_name": ${instance_type}
	}  
],  "fsx_dns_name": "${FSX_ID}.fsx.${AWS_REGION}.amazonaws.com",
"fsx_mountname": "${FSX_MOUNTNAME}"
}
EOL
# copy to the S3 Bucket
$ aws s3 cp provisioning_parameters.json s3://${BUCKET}/src/

Create the SageMaker HyperPod cluster

$ aws sagemaker create-cluster 
    --cli-input-json file://cluster-config.json --region $AWS_REGION

Use the following code or the console to check the status of the cluster. The status should be Creating. Wait for the cluster status to be InService proceeding

$ aws sagemaker list-clusters --output table

The following screenshot shows the results of the –output table command showing the cluster status as Creating.

The following screenshot shows the Cluster Management page and status of the cluster in the Amazon SageMaker AI console.

The following screenshot shows the results of the –output table command showing the cluster status as InService.

Step 2: SSH into the cluster

After the cluster is ready (that is, has a status of InService), you can connect to it using the AWS Systems Manager Session Manager and an SSH helper script. See SSH into Cluster for more information

Install the AWS SSM Session Manager Plugin.
Create a local key pair that can be added to the cluster by the helper script for easier SSH access and run the following SSH helper script.

$ ssh-keygen -t rsa -q -f "$HOME/.ssh/id_rsa" -N ""
$ curl -O https://raw.githubusercontent.com/aws-samples/awsome-distributed-training/main/1.architectures/5.sagemaker-hyperpod/easy-ssh.sh
$ chmod +x easy-ssh.sh
$ ./easy-ssh.sh -c controller-machine ml-cluster

Step 3: Interact with the cluster and clone the repository

After connecting to the cluster, you can validate that the command is properly configured by running several commands. See Get to know your Cluster for more information.

View the existing partition and nodes per partition

$ sinfo

List the jobs that are in the queue or running.

$ squeue

SSH to the compute nodes.

# First ssh into the cluster head node as ubuntu user
$ ssh ml-cluster

#SSH into one of the compute nodes
$ salloc -N 1
$ ssh $(srun hostname)

#Exit to the head node
$ exit

#Exit again to cancel the srun job above
$ exit

Clone the code sample GitHub repository onto the cluster controller node (head node).

$ cd /fsx/ubuntu
$ git clone https://github.com/aws-samples/awsome-distributed-training/
$ cd awsome-distributed-training && git checkout e67fc352de83e13ad6a34b4a33da11c8a71b4d9c$ cd 3.test_cases/22.nemo-run/slurm

Now, you’re ready to run your NeMo Framework Jobs on the SageMaker HyperPod cluster.

Step 4: Build the job container

The next step is to build the job container. By using a container, you can create a consistent, portable, and reproducible environment, helping to ensure that all dependencies, configurations, and optimizations remain intact. This is particularly important for high-performance computing (HPC) and AI workloads, where variations in the software stack can impact performance and compatibility.

To have a fully functioning and optimized environment, you need to add AWS-specific networking dependencies (EFA, OFI plugin, update NCCL, and NCCL tests) to the NeMo Framework container from NVIDIA GPU Cloud (NGC) Catalog. After building the Docker image, you will use Enroot to create a squash file from it. A squash file is a compressed, read-only file system that encapsulates the container image in a lightweight format. It helps reduce storage space, speeds up loading times, and improves efficiency when deploying the container across multiple nodes in a cluster. By converting the Docker image into a squash file, you can achieve a more optimized and performant execution environment, especially in distributed training scenarios.

Make sure that you have a registered account with NVIDIA and can access NGC. Retrieve the NGC API key following the instructions from NVIDIA. Use the following command to configure NGC. When prompted, use $oauthtoken for the login username and the API key from NGC for the password.

$ docker login nvcr.io

You can use the following command to build the Docker file and create a SquashFS image.

$ docker build --progress=plain -t nemo_hyperpod:24.12 -f Dockerfile .
$ sudo enroot import -o /fsx/ubuntu/nemo-hyperpod-24-12-02102025.sqsh dockerd://nemo_hyperpod:24.12

Step 5: Set up NeMo-Run and other dependencies on the head node

Before continuing:

NeMo-Run requires python3.10, verify that this is installed on the head node before proceeding.
You can use the following steps to set up Nemo-Run dependencies using a virtual environment. The steps create and activate a virtual environment then execute the venv.sh script to install the dependencies. Dependencies being installed include the NeMo toolkit, NeMo-Run, PyTorch, Megatron-LM, and others.

$ python3.10 -m venv temp-env
$ source temp-env/bin/activate
$ bash venv.sh

To prepare for the pre-training of the LLaMA model in an offline mode and to help ensure consistent tokenization, use the widely adopted GPT-2 vocabulary and merges files. This approach helps avoid potential issues related to downloading tokenizer files during training:

$ mkdir -p /fsx/ubuntu/temp/megatron
$ wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json -O /fsx/ubuntu/temp/megatron/megatron-gpt-345m_vocab
$ wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt -O /fsx/ubuntu/temp/megatron/megatron-gpt-345m_merges

Step 6: Launch the pretraining job using NeMo-Run

Run the training script to start the LLaMA pretraining job. The training script run.py defines the configuration for a LLaMA 180M parameter model, defines a Slurm executor, defines the experiment, and launches the experiment.

The following function defines the model configuration.

def small_llama_cfg() -> llm.GPTConfig:
   return run.Config(
       llm.Llama3Config8B,       
	   rotary_base=500_000,       
	   seq_length=1024,       
	   num_layers=12,       
	   hidden_size=768,       
	   ffn_hidden_size=2688,       
	   num_attention_heads=16,       
	   init_method_std=0.023,
   )

The following function defines the Slurm executor.

def slurm_executor(
   account: str,   
   partition: str,   
   nodes: int,   
   user: str = "local",   
   host: str = "local",   
   remote_job_dir: str = "/fsx/ubuntu/nemo2-sm-hyperpod/tmp/",   
   time: str = "01:00:00",   
   custom_mounts: Optional[list[str]] = None,   
	custom_env_vars: Optional[dict[str, str]] = None,   
	container_image: str = "/fsx/ubuntu/nemo-hyperpod-24-12-02102025.sqsh",   
	retries: int = 0,) -> run.SlurmExecutor:

The following function runs the experiment.

with run.Experiment(exp_name, log_level="INFO") as exp:
       exp.add(pretrain_recipe, executor=executor, tail_logs=True, name="training")
       # Run the experiment
       exp.run(detach=True)

Use the following command to run the training job.

$ python run.py --nodes 2 --max_steps 1000

The –nodes argument specifies the number of nodes to use during the pretraining job, while the –max_steps argument specifies the maximum number of training iterations. This is useful for controlling the duration of training.

The following figure shows the logs of a running training job.

You can download the training logs from the cluster to your local machine and use machine learning visualization tools like TensorBoard to visualize your experimentation. See Install TensorFlow 2 for information about installing TensorBoard. The following is an example of downloading logs from the cluster and visualizing the logs on TensorBoard.

After installing TensorBoard, download the log files from the cluster to your workstation where TensorBoardis installed

$ rsync -aP ml-cluster:/path/to/logs/checkpoints/tb_logs/events.out.tfevents.1741213162.ip-10-1-7-21.55692.0 .

After the logs are downloaded, you can launch TensorBoard with the log files in the current directory.

$ tensorboard --logdir .

Below is a tensorboard screenshot for a training job. There we can see the reduced_train_loss which shows a decreasing loss curve over the training steps.

Troubleshooting

If some of the nodes appear “down” or “down*” as shown below, we can see that both the two nodes are shown in down* status

Solution: login to them and run sudo systemctl restart slurmd. As shown below, the two nodes went to an idle state.

Clean up

Use the following steps to clean up the infrastructure created for this post and avoid incurring ongoing costs. You can also find cleanup instructions in Cleanup.

Delete the SageMaker HyperPod cluster.

$ aws sagemaker delete-cluster --cluster-name ml-cluster

Delete the CloudFormation stack created in the prerequisites.

$ aws cloudformation wait stack-delete-complete --stack-name sagemaker-hyperpod

Conclusion

Using the NVIDIA NeMo 2.0 framework on SageMaker HyperPod offers a scalable, cost-efficient, and streamlined approach to training large-scale generative AI models. By following the step-by-step deployment process, you can use the power of distributed computing with minimal setup complexity.

References

About the authors

Abdullahi Olaoye is a Senior AI Solutions Architect at NVIDIA, specializing in integrating NVIDIA AI libraries, frameworks, and products with cloud AI services and open-source tools to optimize AI model deployment, inference, and generative AI workflows. He collaborates with AWS to enhance AI workload performance and drive adoption of NVIDIA-powered AI and generative AI solutions.

Greeshma Nallapareddy is a Sr. Business Development Manager at AWS working with NVIDIA on go-to-market strategy to accelerate AI solutions for customers at scale. Her experience includes leading solutions architecture teams focused on working with startups.

Akshit Arora is a senior data scientist at NVIDIA, where he works on deploying conversational AI models on GPUs at scale. He’s a graduate of University of Colorado at Boulder, where he applied deep learning to improve knowledge tracking on a K-12 online tutoring service. His work spans multilingual text-to-speech, time series classification, ed-tech, and practical applications of deep learning.

Ankur Srivastava is a Sr. Solutions Architect in the ML Frameworks Team. He focuses on helping customers with self-managed distributed training and inference at scale on AWS. His experience includes industrial predictive maintenance, digital twins, probabilistic design optimization and has completed his doctoral studies from Mechanical Engineering at Rice University and post-doctoral research from Massachusetts Institute of Technology.

Eliuth Triana Isaza is a Developer Relations Manager at NVIDIA empowering Amazon AI MLOps, DevOps, Scientists, and AWS technical experts to master the NVIDIA computing stack for accelerating and optimizing generative AI foundation models spanning from data curation, GPU training, model inference, and production deployment on AWS GPU instances. In addition, Eliuth is a passionate mountain biker, skier, tennis and poker player.

NeMo Retriever Llama 3.2 text embedding and reranking NVIDIA NIM microservices now available in Amazon SageMaker JumpStart

Today, we are excited to announce that the NeMo Retriever Llama3.2 Text Embedding and Reranking NVIDIA NIM microservices are available in Amazon SageMaker JumpStart. With this launch, you can now deploy NVIDIA’s optimized reranking and embedding models to build, experiment, and responsibly scale your generative AI ideas on AWS.

In this post, we demonstrate how to get started with these models on SageMaker JumpStart.

About NVIDIA NIM on AWS

NVIDIA NIM microservices integrate closely with AWS managed services such as Amazon Elastic Compute Cloud (Amazon EC2), Amazon Elastic Kubernetes Service (Amazon EKS), and Amazon SageMaker to enable the deployment of generative AI models at scale. As part of NVIDIA AI Enterprise available in AWS Marketplace, NIM is a set of user-friendly microservices designed to streamline and accelerate the deployment of generative AI. These prebuilt containers support a broad spectrum of generative AI models, from open source community models to NVIDIA AI foundation models (FMs) and custom models. NIM microservices provide straightforward integration into generative AI applications using industry-standard APIs and can be deployed with just a few lines of code, or with a few clicks on the SageMaker JumpStart console. Engineered to facilitate seamless generative AI inferencing at scale, NIM helps you deploy your generative AI applications.

Overview of NVIDIA NeMo Retriever NIM microservices

In this section, we provide an overview of the NVIDIA NeMo Retriever NIM microservices discussed in this post.

NeMo Retriever text embedding NIM

The NVIDIA NeMo Retriever Llama3.2 embedding NIM is optimized for multilingual and cross-lingual text question-answering retrieval with support for long documents (up to 8,192 tokens) and dynamic embedding size (Matryoshka Embeddings). This model was evaluated on 26 languages: English, Arabic, Bengali, Chinese, Czech, Danish, Dutch, Finnish, French, German, Hebrew, Hindi, Hungarian, Indonesian, Italian, Japanese, Korean, Norwegian, Persian, Polish, Portuguese, Russian, Spanish, Swedish, Thai, and Turkish. In addition to enabling multilingual and cross-lingual question-answering retrieval, this model reduces the data storage footprint by 35-fold through dynamic embedding sizing and support for longer token length, making it feasible to handle large-scale datasets efficiently.

NeMo Retriever text reranking NIM

The NVIDIA NeMo Retriever Llama3.2 reranking NIM is optimized for providing a logit score that represents how relevant a document is to a given query. The model was fine-tuned for multilingual, cross-lingual text question-answering retrieval, with support for long documents (up to 8,192 tokens). This model was evaluated on the same 26 languages mentioned earlier.

SageMaker JumpStart overview

SageMaker JumpStart is a fully managed service that offers state-of-the-art FMs for various use cases such as content writing, code generation, question answering, copywriting, summarization, classification, and information retrieval. It provides a collection of pre-trained models that you can deploy quickly, accelerating the development and deployment of ML applications. One of the key components of SageMaker JumpStart is model hubs, which offer a vast catalog of pre-trained models, such as Mistral, for a variety of tasks.

Solution overview

You can now discover and deploy the NeMo Retriever text embedding and reranking NIM microservices in Amazon SageMaker Studio or programmatically through the Amazon SageMaker Python SDK, enabling you to derive model performance and MLOps controls with SageMaker features such as Amazon SageMaker Pipelines, Amazon SageMaker Debugger, or container logs. The model is deployed in a secure AWS environment and in your virtual private cloud (VPC), helping to support data security for enterprise security needs.

In the following sections, we demonstrate how to deploy these microservices and run real-time and batch inference.

Make sure your SageMaker AWS Identity and Access Management (IAM) service role has the AmazonSageMakerFullAccess permission policy attached.

To deploy NeMo Retriever Llama3.2 embedding and reranking microservices successfully, confirm one of the following:

Make sure your IAM role has the following permissions and you have the authority to make AWS Marketplace subscriptions in the AWS account used:
- aws-marketplace:ViewSubscriptions
- aws-marketplace:Unsubscribe
- aws-marketplace:Subscribe
Alternatively, confirm your AWS account has a subscription to the model. If so, you can skip the following deployment instructions and start at the Subscribe to the model package section.

Deploy NeMo Retriever microservices on SageMaker JumpStart

For those new to SageMaker JumpStart, we demonstrate using SageMaker Studio to access models on SageMaker JumpStart. The following screenshot shows the NeMo Retriever text embedding and reranking microservices available on SageMaker JumpStart.

Deployment starts when you choose the Deploy option. You might be prompted to subscribe to this model through AWS Marketplace. If you are already subscribed, then you can move forward with choosing the second Deploy button. After deployment finishes, you will see that an endpoint is created. You can test the endpoint by passing a sample inference request payload or by selecting the testing option using the SDK.

Subscribe to the model package

To subscribe to the model package, complete the following steps

Depending on the model you want to deploy, open the model package listing page for Llama-3.2-NV-EmbedQA-1B-v2 or Llama-3.2-NV-RerankQA-1B-v2.
On the AWS Marketplace listing, choose Continue to subscribe.
On the Subscribe to this software page, review and choose Accept Offer if you and your organization agree with EULA, pricing, and support terms.
Choose Continue to configuration and then choose an AWS Region.

A product Amazon Resource Name (ARN) will be displayed. This is the model package ARN that you need to specify while creating a deployable model using Boto3.

Deploy NeMo Retriever microservices using the SageMaker SDK

In this section, we walk through deploying the NeMo Retriever text embedding NIM through the SageMaker SDK. A similar process can be followed for deploying the NeMo Retriever text reranking NIM as well.

Define the SageMaker model using the model package ARN

To deploy the model using the SDK, copy the product ARN from the previous step and specify it in the model_package_arn in the following code:

# Define the model details
model_package_arn = "Specify the model package ARN here"
sm_model_name = "nim-llama-3-2-nv-embedqa-1b-v2"

# Create the SageMaker model
create_model_response = sm.create_model(
ModelName=sm_model_name,
PrimaryContainer={
'ModelPackageName': model_package_arn
},
ExecutionRoleArn=role,
EnableNetworkIsolation=True
)
print("Model Arn: " + create_model_response["ModelArn"])

Create the endpoint configuration

Next, we create an endpoint configuration specifying instance type; in this case, we are using an ml.g5.2xlarge instance type accelerated by NVIDIA A10G GPUs. Make sure you have the account-level service limit for using ml.g5.2xlarge for endpoint usage as one or more instances. To request a service quota increase, refer to AWS service quotas. For further performance improvements, you can use NVIDIA Hopper GPUs (P5 instances) on SageMaker.

# Create the endpoint configuration
endpoint_config_name = sm_model_name

create_endpoint_config_response = sm.create_endpoint_config(
EndpointConfigName=endpoint_config_name,
ProductionVariants=[
{
'VariantName': 'AllTraffic',
'ModelName': sm_model_name,
'InitialInstanceCount': 1,
'InstanceType': 'ml.g5.xlarge', 
'InferenceAmiVersion': 'al2-ami-sagemaker-inference-gpu-2',
'RoutingConfig': {'RoutingStrategy': 'LEAST_OUTSTANDING_REQUESTS'},
'ModelDataDownloadTimeoutInSeconds': 3600, # Specify the model download timeout in seconds.
'ContainerStartupHealthCheckTimeoutInSeconds': 3600, # Specify the health checkup timeout in seconds
}
]
)
print("Endpoint Config Arn: " + create_endpoint_config_response["EndpointConfigArn"])

Create the endpoint

Using the preceding endpoint configuration, we create a new SageMaker endpoint and wait for the deployment to finish. The status will change to InService after the deployment is successful.

# Create the endpoint
endpoint_name = endpoint_config_name
create_endpoint_response = sm.create_endpoint(
EndpointName=endpoint_name,
EndpointConfigName=endpoint_config_name
)

print("Endpoint Arn: " + create_endpoint_response["EndpointArn"])

Deploy the NIM microservice

Deploy the NIM microservice with the following code:

resp = sm.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

while status == "Creating":
time.sleep(60)
resp = sm.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

print("Arn: " + resp["EndpointArn"])
print("Status: " + status)

We get the following output:

Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: InService
Arn: arn:aws:sagemaker:us-west-2:611951037680:endpoint/nim-llama-3-2-nv-embedqa-1b-v2
Status: InService

After you deploy the model, your endpoint is ready for inference. In the following section, we use a sample text to do an inference request. For inference request format, NIM on SageMaker supports the OpenAI API inference protocol (at the time of writing). For an explanation of supported parameters, see Create an embedding vector from the input text.

Inference example with NeMo Retriever text embedding NIM microservice

The NVIDIA NeMo Retriever Llama3.2 embedding model is optimized for multilingual and cross-lingual text question-answering retrieval with support for long documents (up to 8,192 tokens) and dynamic embedding size (Matryoshka Embeddings). In this section, we provide examples of running real-time inference and batch inference.

Real-time inference example

The following code example illustrates how to perform real-time inference using the NeMo Retriever Llama3.2 embedding model:

import pprint
pp1 = pprint.PrettyPrinter(indent=2, width=80, compact=True, depth=3)

input_embedding = '''{
"model": "nvidia/llama-3.2-nv-embedqa-1b-v2",
"input": ["Sample text 1", "Sample text 2"],
"input_type": "query"
}'''

print("Example input data for embedding model endpoint:")
print(input_embedding)

response = client.invoke_endpoint(
EndpointName=endpoint_name,
ContentType="application/json",
Accept="application/json",
Body=input_embedding
)

print("nEmbedding endpoint response:")
response = json.load(response["Body"])
pp1.pprint(response)

We get the following output:

Example input data for embedding model endpoint:
{
"model": "nvidia/llama-3.2-nv-embedqa-1b-v2", 
"input": ["Sample text 1", "Sample text 2"],
"input_type": "query"
}

Embedding endpoint response:
{ 'data': [ {'embedding': [...], 'index': 0, 'object': 'embedding'},
            {'embedding': [...], 'index': 1, 'object': 'embedding'}],
  'model': 'nvidia/llama-3.2-nv-embedqa-1b-v2',
  'object': 'list',
  'usage': {'prompt_tokens': 14, 'total_tokens': 14}}

Batch inference example

When you have many documents, you can vectorize each of them with a for loop. This will often result in many requests. Alternatively, you can send requests consisting of batches of documents to reduce the number of requests to the API endpoint. We use the following example with a dataset of 10 documents. Let’s test the model with a number of documents in different languages:

documents = [
"El futuro de la computación cuántica en aplicaciones criptográficas.",
"L’application des réseaux neuronaux dans les systèmes de véhicules autonomes.",
"Analyse der Rolle von Big Data in personalisierten Gesundheitslösungen.",
"L’evoluzione del cloud computing nello sviluppo di software aziendale.",
"Avaliando o impacto da IoT na infraestrutura de cidades inteligentes.",
"Потенциал граничных вычислений для обработки данных в реальном времени.",
"评估人工智能在欺诈检测系统中的有效性。",
"倫理的なAIアルゴリズムの開発における課題と機会。",
"دمج تقنية الجيل الخامس (5G) في تعزيز الاتصال بالإنترنت للأشياء (IoT).",
"सुरक्षित लेनदेन के लिए बायोमेट्रिक प्रमाणीकरण विधियों में प्रगति।"
]

The following code demonstrates how to group the documents into batches and invoke the endpoint repeatedly to vectorize the whole dataset. Specifically, the example code loops over the 10 documents in batches of size 5 (batch_size=5).

pp2 = pprint.PrettyPrinter(indent=2, width=80, compact=True, depth=2)

encoded_data = []
batch_size = 5

# Loop over the documents in increments of the batch size
for i in range(0, len(documents), batch_size):
input = json.dumps({
"input": documents[i:i+batch_size],
"input_type": "passage",
"model": "nvidia/llama-3.2-nv-embedqa-1b-v2",
})

response = client.invoke_endpoint(
EndpointName=endpoint_name,
ContentType="application/json",
Accept="application/json",
Body=input,
)

response = json.load(response["Body"])

# Concatenating vectors into a single list; preserve original index
encoded_data.extend({"embedding": data[1]["embedding"], "index": data[0] } for
data in zip(range(i,i+batch_size), response["data"]))

# Print the response data
pp2.pprint(encoded_data)

We get the following output:

[ {'embedding': [...], 'index': 0}, {'embedding': [...], 'index': 1},
  {'embedding': [...], 'index': 2}, {'embedding': [...], 'index': 3},
  {'embedding': [...], 'index': 4}, {'embedding': [...], 'index': 5},
  {'embedding': [...], 'index': 6}, {'embedding': [...], 'index': 7},
  {'embedding': [...], 'index': 8}, {'embedding': [...], 'index': 9}]

Inference example with NeMo Retriever text reranking NIM microservice

The NVIDIA NeMo Retriever Llama3.2 reranking NIM microservice is optimized for providing a logit score that represents how relevant a documents is to a given query. The model was fine-tuned for multilingual, cross-lingual text question-answering retrieval, with support for long documents (up to 8,192 tokens).

In the following example, we create an input payload for a list of emails in multiple languages:

payload_model = "nvidia/llama-3.2-nv-rerankqa-1b-v2"
query = {"text": "What emails have been about returning items?"}
documents = [
    {"text":"Contraseña incorrecta. Hola, llevo una hora intentando acceder a mi cuenta y sigue diciendo que mi contraseña es incorrecta. ¿Puede ayudarme, por favor?"},
    {"text":"Confirmation Email Missed. Hi, I recently purchased a product from your website but I never received a confirmation email. Can you please look into this for me?"},
    {"text":"أسئلة حول سياسة الإرجاع. مرحبًا، لدي سؤال حول سياسة إرجاع هذا المنتج. لقد اشتريته قبل بضعة أسابيع وهو معيب"},
    {"text":"Customer Support is Busy. Good morning, I have been trying to reach your customer support team for the past week but I keep getting a busy signal. Can you please help me?"},
    {"text":"Falschen Artikel erhalten. Hallo, ich habe eine Frage zu meiner letzten Bestellung. Ich habe den falschen Artikel erhalten und muss ihn zurückschicken."},
    {"text":"Customer Service is Unavailable. Hello, I have been trying to reach your customer support team for the past hour but I keep getting a busy signal. Can you please help me?"},
    {"text":"Return Policy for Defective Product. Hi, I have a question about the return policy for this product. I purchased it a few weeks ago and it is defective."},
    {"text":"收到错误物品. 早上好，关于我最近的订单，我有一个问题。我收到了错误的商品，需要退货。"},
    {"text":"Return Defective Product. Hello, I have a question about the return policy for this product. I purchased it a few weeks ago and it is defective."}
]

payload = {
  "model": payload_model,
  "query": query,
  "passages": documents,
  "truncate": "END"
}

response = client.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType="application/json",
    Body=json.dumps(payload)
)

output = json.loads(response["Body"].read().decode("utf8"))
print(f'Documents: {response}')
print(json.dumps(output, indent=2))

In this example, the relevance (logit) scores are normalized to be in the range [0, 1]. Scores close to 1 indicate a high relevance to the query, and scores closer to 0 indicate low relevance.

Documents: {'ResponseMetadata': {'RequestId': 'a3f19e06-f468-4382-a927-3485137ffcf6', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': 'a3f19e06-f468-4382-a927-3485137ffcf6', 'x-amzn-invoked-production-variant': 'AllTraffic', 'date': 'Tue, 04 Mar 2025 21:46:39 GMT', 'content-type': 'application/json', 'content-length': '349', 'connection': 'keep-alive'}, 'RetryAttempts': 0}, 'ContentType': 'application/json', 'InvokedProductionVariant': 'AllTraffic', 'Body': <botocore.response.StreamingBody object at 0x7fbb00ff94b0>}
{
  "rankings": [
    {
      "index": 4,
      "logit": 0.0791015625
    },
    {
      "index": 8,
      "logit": -0.1904296875
    },
    {
      "index": 7,
      "logit": -2.583984375
    },
    {
      "index": 2,
      "logit": -4.71484375
    },
    {
      "index": 6,
      "logit": -5.34375
    },
    {
      "index": 1,
      "logit": -5.64453125
    },
    {
      "index": 5,
      "logit": -11.96875
    },
    {
      "index": 3,
      "logit": -12.2265625
    },
    {
      "index": 0,
      "logit": -16.421875
    }
  ],
  "usage": {
    "prompt_tokens": 513,
    "total_tokens": 513
  }
}

Let’s see the top-ranked document for our query:

# 1. Extract the array of rankings
rankings = output["rankings"]  # or output.get("rankings", [])

# 2. Get the top-ranked entry (highest logit)
top_ranked_entry = rankings[0]
top_index = top_ranked_entry["index"]  # e.g. 4 in your example

# 3. Retrieve the corresponding document
top_document = documents[top_index]

print("Top-ranked document:")
print(top_document)

The following is the top-ranked document based on the provided relevance scores:

Top-ranked document:
{'text': 'Falschen Artikel erhalten. Hallo, ich habe eine Frage zu meiner letzten Bestellung. Ich habe den falschen Artikel erhalten und muss ihn zurückschicken.'}

This translates to the following:

"Wrong item received. Hello, I have a question about my last order. I received the wrong item and need to return it."

Based on the preceding results from the model, we see that a higher logit indicates stronger alignment with the query, whereas lower (or more negative) values indicate lower relevance. In this case, the document discussing receiving the wrong item (in German) was ranked first with the highest logit, confirming that the model quickly and effectively identified it as the most relevant passage regarding product returns.

Clean up

To clean up your resources, use the following commands:

sm.delete_model(ModelName=sm_model_name)
sm.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
sm.delete_endpoint(EndpointName=endpoint_name)

Conclusion

The NVIDIA NeMo Retriever Llama 3.2 NIM microservices bring powerful multilingual capabilities to enterprise search and retrieval systems. These models excel in diverse use cases, including cross-lingual search applications, enterprise knowledge bases, customer support systems, and content recommendation engines. The text embedding NIM’s dynamic embedding size (Matryoshka Embeddings) reduces storage footprint by 35-fold while supporting 26 languages and documents up to 8,192 tokens. The reranking NIM accurately scores document relevance across languages, enabling precise information retrieval even for multilingual content. For organizations managing global knowledge bases or customer-facing search experiences, these NVIDIA-optimized microservices provide a significant advantage in latency, accuracy, and efficiency, allowing developers to quickly deploy sophisticated search capabilities without compromising on performance or linguistic diversity.

SageMaker JumpStart provides a straightforward way to use state-of-the-art large language FMs for text embedding and reranking. Through the UI or just a few lines of code, you can deploy a highly accurate text embedding model to generate dense vector representations that capture semantic meaning and a reranking model to find semantic matches and retrieve the most relevant information from various data stores at scale and cost-efficiently.

About the Authors

Niithiyn Vijeaswaran is a Generative AI Specialist Solutions Architect with the Third-Party Model Science team at AWS. His area of focus is AWS AI accelerators (AWS Neuron). He holds a Bachelor’s in Computer Science and Bioinformatics.

Abhishek Sawarkar is a product manager in the NVIDIA AI Enterprise team working on integrating NVIDIA AI Software in Cloud MLOps platforms. He focuses on integrating the NVIDIA AI end-to-end stack within cloud platforms and enhancing user experience on accelerated computing.

Abdullahi Olaoye is a Senior AI Solutions Architect at NVIDIA, specializing in integrating NVIDIA AI libraries, frameworks, and products with cloud AI services and open source tools to optimize AI model deployment, inference, and generative AI workflows. He collaborates with AWS to enhance AI workload performance and drive adoption of NVIDIA-powered AI and generative AI solutions.

Banu Nagasundaram leads product, engineering, and strategic partnerships for Amazon SageMaker JumpStart, the machine learning and generative AI hub provided by Amazon SageMaker. She is passionate about building solutions that help customers accelerate their AI journey and unlock business value.

Chase Pinkerton is a Startups Solutions Architect at Amazon Web Services. He holds a Bachelor’s in Computer Science with a minor in Economics from Tufts University. He’s passionate about helping startups grow and scale their businesses. When not working, he enjoys road cycling, hiking, playing volleyball, and photography.

Eliuth Triana Isaza is a Developer Relations Manager at NVIDIA, empowering Amazon’s AI MLOps, DevOps, scientists, and AWS technical experts to master the NVIDIA computing stack for accelerating and optimizing generative AI foundation models spanning from data curation, GPU training, model inference, and production deployment on AWS GPU instances. In addition, Eliuth is a passionate mountain biker, skier, and tennis and poker player.

Accelerating AI Development With NVIDIA RTX PRO Blackwell Series GPUs and NVIDIA NIM Microservices for RTX

As generative AI capabilities expand, NVIDIA is equipping developers with the tools to seamlessly integrate AI into creative projects, applications and games to unlock groundbreaking experiences on NVIDIA RTX AI PCs and workstations.

At the NVIDIA GTC global AI conference this week, NVIDIA introduced the NVIDIA RTX PRO Blackwell series, a new generation of workstation and server GPUs built for complex AI-driven workloads, technical computing and high-performance graphics.

Alongside the new hardware, NVIDIA announced a suite of AI-powered tools, libraries and software development kits designed to accelerate AI development on PCs and workstations. With NVIDIA CUDA-X libraries for data science, developers can significantly accelerate data processing and machine learning tasks, enabling faster exploratory data analysis, feature engineering and model development with zero code changes. And with NVIDIA NIM microservices, developers can more seamlessly build AI assistants, productivity plug-ins and advanced content-creation workflows with peak performance.

AI at the Speed of NIM With RTX PRO Series GPUs

The RTX PRO Blackwell series is built to handle the most demanding AI-driven workflows, powering applications like AI agents, simulation, extended reality, 3D design and high-end visual effects. Whether for designing and engineering complex systems or creating sophisticated and immersive content, RTX PRO GPUs deliver the performance, efficiency and scalability professionals need.

The new lineup includes:

Desktop GPUs: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, NVIDIA RTX PRO 5000 Blackwell, NVIDIA RTX PRO 4500 Blackwell and NVIDIA RTX PRO 4000 Blackwell
Laptop GPUs: NVIDIA RTX PRO 5000 Blackwell, NVIDIA RTX PRO 4000 Blackwell, NVIDIA RTX PRO 3000 Blackwell, NVIDIA RTX PRO 2000 Blackwell, NVIDIA RTX PRO 1000 Blackwell and NVIDIA RTX PRO 500 Blackwell Laptop GPUs
Data center GPU: NVIDIA RTX PRO 6000 Blackwell Server Edition

As AI and data science evolve, the ability to rapidly process and analyze massive datasets will become a key differentiator to enable breakthroughs across industries.

NVIDIA CUDA-X libraries, built on CUDA, is a collection of libraries that deliver dramatically higher performance compared with CPU-only alternatives. With cuML 25.02 — now available in open beta — data scientists and researchers can accelerate scikit-learn, UMAP and HDBSCAN algorithms with zero code changes, unlocking new levels of performance and efficiency in machine learning tasks. This release extends the zero-code-change acceleration paradigm established by cuDF-pandas for DataFrame operations to machine learning, reducing iterations from hours to seconds.

Optimized AI software unlocks even greater possibilities. NVIDIA NIM microservices are prepackaged, high-performance AI models optimized across NVIDIA GPUs, from RTX-powered PCs and workstations to the cloud. Developers can use NIM microservices to build AI-powered app assistants, productivity tools and content-creation workflows that seamlessly integrate with RTX PRO GPUs. This makes AI more accessible and powerful than ever.

NIM microservices integrate top community and NVIDIA-built models, spanning capabilities and modalities important for PC and workstation use cases, including large language models (LLMs), images, speech and retrieval-augmented generation (RAG).

Announced at the CES trade show in January, NVIDIA AI Blueprints are advanced AI reference workflows built on NVIDIA NIM. With AI Blueprints, developers can create podcasts from PDF documents, generate stunning 4K images controlled and guided by 3D scenes, and incorporate digital humans into AI-powered use cases.

Coming soon to build.nvidia.com, the blueprints are extensible and provide everything needed to build and customize them for different use cases. These resources include source code, sample data, a demo application and documentation.

From cutting-edge hardware to optimized AI models and reference workflows, the RTX PRO series is redefining AI-powered computing — enabling professionals to push the limits of creativity, productivity and innovation. Learn about all the GTC announcements and the RTX PRO Blackwell series GPUs for laptops and workstations.

Create NIMble AI Chatbots With ChatRTX

AI-powered chatbots are changing how people interact with their content.

ChatRTX is a demo app that personalizes a LLM connected to a user’s content, whether documents, notes, images or other data. Using RAG, the NVIDIA TensorRT-LLM library and RTX acceleration, a user can query a custom chatbot to get contextually relevant answers. And because it all runs locally on Windows RTX PCs or RTX PRO workstations, they get fast and private results.

Today, the latest version of ChatRTX introduces support for NVIDIA NIM microservices, giving users access to new foundational models. NIM is expected to soon be available in additional top AI ecosystem apps. Download ChatRTX today.

Game On

Half-Life 2 owners can now download a free Half-Life 2 RTX demo from Steam, built with RTX Remix and featuring the latest neural rendering enhancements. RTX Remix supports a host of AI tools, including NVIDIA DLSS 4, RTX Neural Radiance Cache and the new community-published AI model PBRFusion 3, which upscales textures and generates high-quality normal, roughness and height maps for physically based materials.

The March NVIDIA Studio Driver is also now available for download, supporting recent app updates including last week’s RTX Remix launch. For automatic Studio Driver notifications, download the NVIDIA app.

In addition, NVIDIA RTX Kit, a suite of neural rendering technologies for game developers, is receiving major updates with Unreal Engine 5 support for the RTX Mega Geometry and RTX Hair features.

Learn more about the NVIDIA RTX PRO Blackwell GPUs by watching a replay of NVIDIA founder and CEO Jensen Huang’s GTC keynote and register to attend sessions from NVIDIA and industry leaders at the show, which runs through March 21.

Follow NVIDIA AI PC on Facebook, Instagram, TikTok and X — and stay informed by subscribing to the RTX AI PC newsletter.

Follow NVIDIA Workstation on LinkedIn and X.

See notice regarding software product information.

How Google and NVIDIA are teaming up to solve real-world problems with AI

An overview of the collaboration between Google and NVIDIA and a preview of the announcements at GTC this week.Read More

AI Factories Are Redefining Data Centers and Enabling the Next Era of AI

AI is fueling a new industrial revolution — one driven by AI factories.

Unlike traditional data centers, AI factories do more than store and process data — they manufacture intelligence at scale, transforming raw data into real-time insights. For enterprises and countries around the world, this means dramatically faster time to value — turning AI from a long-term investment into an immediate driver of competitive advantage. Companies that invest in purpose-built AI factories today will lead in innovation, efficiency and market differentiation tomorrow.

While a traditional data center typically handles diverse workloads and is built for general-purpose computing, AI factories are optimized to create value from AI. They orchestrate the entire AI lifecycle — from data ingestion to training, fine-tuning and, most critically, high-volume inference.

For AI factories, intelligence isn’t a byproduct but the primary one. This intelligence is measured by AI token throughput — the real-time predictions that drive decisions, automation and entirely new services.

While traditional data centers aren’t disappearing anytime soon, whether they evolve into AI factories or connect to them depends on the enterprise business model.

Regardless of how enterprises choose to adapt, AI factories powered by NVIDIA are already manufacturing intelligence at scale, transforming how AI is built, refined and deployed.

The Scaling Laws Driving Compute Demand

Over the past few years, AI has revolved around training large models. But with the recent proliferation of AI reasoning models, inference has become the main driver of AI economics. Three key scaling laws highlight why:

Pretraining scaling: Larger datasets and model parameters yield predictable intelligence gains, but reaching this stage demands significant investment in skilled experts, data curation and compute resources. Over the last five years, pretraining scaling has increased compute requirements by 50 million times. However, once a model is trained, it significantly lowers the barrier for others to build on top of it.
Post-training scaling: Fine-tuning AI models for specific real-world applications requires 30x more compute during AI inference than pretraining. As organizations adapt existing models for their unique needs, cumulative demand for AI infrastructure skyrockets.
Test-time scaling (aka long thinking): Advanced AI applications such as agentic AI or physical AI require iterative reasoning, where models explore multiple possible responses before selecting the best one. This consumes up to 100x more compute than traditional inference.

Traditional data centers aren’t designed for this new era of AI. AI factories are purpose-built to optimize and sustain this massive demand for compute, providing an ideal path forward for AI inference and deployment.

Reshaping Industries and Economies With Tokens

Across the world, governments and enterprises are racing to build AI factories to spur economic growth, innovation and efficiency.

The European High Performance Computing Joint Undertaking recently announced plans to build seven AI factories in collaboration with 17 European Union member nations.

This follows a wave of AI factory investments worldwide, as enterprises and countries accelerate AI-driven economic growth across every industry and region:

India: Yotta Data Services has partnered with NVIDIA to launch the Shakti Cloud Platform, helping democratize access to advanced GPU resources. By integrating NVIDIA AI Enterprise software with open-source tools, Yotta provides a seamless environment for AI development and deployment.
Japan: Leading cloud providers — including GMO Internet, Highreso, KDDI, Rutilea and SAKURA internet — are building NVIDIA-powered AI infrastructure to transform industries such as robotics, automotive, healthcare and telecom.
Norway: Telenor has launched an NVIDIA-powered AI factory to accelerate AI adoption across the Nordic region, focusing on workforce upskilling and sustainability.

These initiatives underscore a global reality: AI factories are quickly becoming essential national infrastructure, on par with telecommunications and energy.

Inside an AI Factory: Where Intelligence Is Manufactured

Foundation models, secure customer data and AI tools provide the raw materials for fueling AI factories, where inference serving, prototyping and fine-tuning shape powerful, customized models ready to be put into production.

As these models are deployed into real-world applications, they continuously learn from new data, which is stored, refined and fed back into the system using a data flywheel. This cycle of optimization ensures AI remains adaptive, efficient and always improving — driving enterprise intelligence at an unprecedented scale.

AI factories powered by NVIDIA for manufacturing enterprise intelligence at scale.

An AI Factory Advantage With Full-Stack NVIDIA AI

NVIDIA delivers a complete, integrated AI factory stack where every layer — from the silicon to the software — is optimized for training, fine-tuning, and inference at scale. This full-stack approach ensures enterprises can deploy AI factories that are cost effective, high-performing and future-proofed for the exponential growth of AI.

With its ecosystem partners, NVIDIA has created building blocks for the full-stack AI factory, offering:

Powerful compute performance
Advanced networking
Infrastructure management and workload orchestration
The largest AI inference ecosystem
Storage and data platforms
Blueprints for design and optimization
Reference architectures
Flexible deployment for every enterprise

Powerful Compute Performance

The heart of any AI factory is its compute power. From NVIDIA Hopper to NVIDIA Blackwell, NVIDIA provides the world’s most powerful accelerated computing for this new industrial revolution. With the NVIDIA Blackwell Ultra-based GB300 NVL72 rack-scale solution, AI factories can achieve up to 50X the output for AI reasoning, setting a new standard for efficiency and scale.

The NVIDIA DGX SuperPOD is the exemplar of the turnkey AI factory for enterprises, integrating the best of NVIDIA accelerated computing. NVIDIA DGX Cloud provides an AI factory that delivers NVIDIA accelerated compute with high performance in the cloud.

Global systems partners are building full-stack AI factories for their customers based on NVIDIA accelerated computing — now including the NVIDIA GB200 NVL72 and GB300 NVL72 rack-scale solutions.

Advanced Networking

Moving intelligence at scale requires seamless, high-performance connectivity across the entire AI factory stack. NVIDIA NVLink and NVLink Switch enable high-speed, multi-GPU communication, accelerating data movement within and across nodes.

AI factories also demand a robust network backbone. The NVIDIA Quantum InfiniBand, NVIDIA Spectrum-X Ethernet, and NVIDIA BlueField networking platforms reduce bottlenecks, ensuring efficient, high-throughput data exchange across massive GPU clusters. This end-to-end integration is essential for scaling out AI workloads to million-GPU levels, enabling breakthrough performance in training and inference.

Infrastructure Management and Workload Orchestration

Businesses need a way to harness the power of AI infrastructure with the agility, efficiency and scale of a hyperscaler, but without the burdens of cost, complexity and expertise placed on IT.

With NVIDIA Run:ai, organizations can benefit from seamless AI workload orchestration and GPU management, optimizing resource utilization while accelerating AI experimentation and scaling workloads. NVIDIA Mission Control software, which includes NVIDIA Run:ai technology, streamlines AI factory operations from workloads to infrastructure while providing full-stack intelligence that delivers world-class infrastructure resiliency.

NVIDIA Mission Control streamlines workflows across the AI factory stack.

The Largest AI Inference Ecosystem

AI factories need the right tools to turn data into intelligence. The NVIDIA AI inference platform, spanning the NVIDIA TensorRT ecosystem, NVIDIA Dynamo and NVIDIA NIM microservices — all part (or soon to be part) of the NVIDIA AI Enterprise software platform — provides the industry’s most comprehensive suite of AI acceleration libraries and optimized software. It delivers maximum inference performance, ultra-low latency and high throughput.

Storage and Data Platforms

Data fuels AI applications, but the rapidly growing scale and complexity of enterprise data often make it too costly and time-consuming to harness effectively. To thrive in the AI era, enterprises must unlock the full potential of their data.

The NVIDIA AI Data Platform is a customizable reference design to build a new class of AI infrastructure for demanding AI inference workloads. NVIDIA-Certified Storage partners are collaborating with NVIDIA to create customized AI data platforms that can harness enterprise data to reason and respond to complex queries.

Blueprints for Design and Optimization

To design and optimize AI factories, teams can use the NVIDIA Omniverse Blueprint for AI factory design and operations. The blueprint enables engineers to design, test and optimize AI factory infrastructure before deployment using digital twins. By reducing risk and uncertainty, the blueprint helps prevent costly downtime — a critical factor for AI factory operators.

For a 1 gigawatt-scale AI factory, every day of downtime can cost over $100 million. By solving complexity upfront and enabling siloed teams in IT, mechanical, electrical, power and network engineering to work in parallel, the blueprint accelerates deployment and ensures operational resilience.

Reference Architectures

NVIDIA Enterprise Reference Architectures and NVIDIA Cloud Partner Reference Architectures provide a roadmap for partners designing and deploying AI factories. They help enterprises and cloud providers build scalable, high-performance and secure AI infrastructure based on NVIDIA-Certified Systems with the NVIDIA AI software stack and partner ecosystem.

Every layer of the AI factory stack relies on efficient computing to meet growing AI demands. NVIDIA accelerated computing serves as the foundation across the stack, delivering the highest performance per watt to ensure AI factories operate at peak energy efficiency. With energy-efficient architecture and liquid cooling, businesses can scale AI while keeping energy costs in check.

Flexible Deployment for Every Enterprise

With NVIDIA’s full-stack technologies, enterprises can easily build and deploy AI factories, aligning with customers’ preferred IT consumption models and operational needs.

Some organizations opt for on-premises AI factories to maintain full control over data and performance, while others use cloud-based solutions for scalability and flexibility. Many also turn to their trusted global systems partners for pre-integrated solutions that accelerate deployment.

The NVIDIA DGX GB300 is the highest-performing, largest-scale AI factory infrastructure available for enterprises that are built for the era of AI reasoning.

On Premises

NVIDIA DGX SuperPOD is a turnkey AI factory infrastructure solution that provides accelerated infrastructure with scalable performance for the most demanding AI training and inference workloads. It features a design-optimized combination of AI compute, network fabric, storage and NVIDIA Mission Control software, empowering enterprises to get AI factories up and running in weeks instead of months — and with best-in-class uptime, resiliency and utilization.

AI factory solutions are also offered through the NVIDIA global ecosystem of enterprise technology partners with NVIDIA-Certified Systems. They deliver leading hardware and software technology, combined with data center systems expertise and liquid-cooling innovations, to help enterprises de-risk their AI endeavors and accelerate the return on investment of their AI factory implementations.

These global systems partners are providing full-stack solutions based on NVIDIA reference architectures — integrated with NVIDIA accelerated computing, high-performance networking and AI software — to help customers successfully deploy AI factories and manufacture intelligence at scale.

In the Cloud

For enterprises looking to use a cloud-based solution for their AI factory, NVIDIA DGX Cloud delivers a unified platform on leading clouds to build, customize and deploy AI applications. Every layer of DGX Cloud is optimized and fully managed by NVIDIA, offering the best of NVIDIA AI in the cloud, and features enterprise-grade software and large-scale, contiguous clusters on leading cloud providers, offering scalable compute resources ideal for even the most demanding AI training workloads.

DGX Cloud also includes a dynamic and scalable serverless inference platform that delivers high throughput for AI tokens across hybrid and multi-cloud environments, significantly reducing infrastructure complexity and operational overhead.

By providing a full-stack platform that integrates hardware, software, ecosystem partners and reference architectures, NVIDIA is helping enterprises build AI factories that are cost effective, scalable and high-performing — equipping them to meet the next industrial revolution.

Learn more about NVIDIA AI factories.

See notice regarding software product information.

Powering Intelligent Content Creation

Breathing Life Into Live Media

AWS customers and partners innovate using Amazon Q Business in Europe

Pricing

Get started with Amazon Q Business today

About the Authors

Why Submit a Proposal?

Topics for Submission:

Why Attend?

Event Details

Travel Information

Have Questions?

About SGLang

Serving DeepSeek Models

Serving Llama Models

Roadmap

Community

Conclusion

NVIDIA NeMo Framework Overview

Solution overview

Architecture diagram

Prerequisites

Launch the training job

Step 1: Set up a SageMaker HyperPod cluster

Step 2: SSH into the cluster

Step 3: Interact with the cluster and clone the repository

Step 4: Build the job container

Step 5: Set up NeMo-Run and other dependencies on the head node

Step 6: Launch the pretraining job using NeMo-Run

Troubleshooting

Clean up

Conclusion

References

About the authors

About NVIDIA NIM on AWS

Overview of NVIDIA NeMo Retriever NIM microservices

NeMo Retriever text embedding NIM

NeMo Retriever text reranking NIM

SageMaker JumpStart overview

Solution overview

Deploy NeMo Retriever microservices on SageMaker JumpStart

Subscribe to the model package

Deploy NeMo Retriever microservices using the SageMaker SDK

Define the SageMaker model using the model package ARN

Create the endpoint configuration

Create the endpoint

Deploy the NIM microservice

Inference example with NeMo Retriever text embedding NIM microservice

Real-time inference example

Batch inference example

Inference example with NeMo Retriever text reranking NIM microservice

Clean up

Conclusion

About the Authors

AI at the Speed of NIM With RTX PRO Series GPUs

Create NIMble AI Chatbots With ChatRTX

Game On

The Scaling Laws Driving Compute Demand

Reshaping Industries and Economies With Tokens

Inside an AI Factory: Where Intelligence Is Manufactured

An AI Factory Advantage With Full-Stack NVIDIA AI

Powerful Compute Performance

Advanced Networking

Infrastructure Management and Workload Orchestration

The Largest AI Inference Ecosystem

Storage and Data Platforms

Blueprints for Design and Optimization

Reference Architectures

Flexible Deployment for Every Enterprise

On Premises

In the Cloud

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.