NVIDIA Hopper GPU Architecture Accelerates Dynamic Programming Up to 40x Using New DPX Instructions

The NVIDIA Hopper GPU architecture unveiled today at GTC will accelerate dynamic programming — a problem-solving technique used in algorithms for genomics, quantum  computing, route optimization and more — by up to 40x with new DPX instructions.

An instruction set built into NVIDIA H100 GPUs, DPX will help developers write code to achieve speedups on dynamic programming algorithms in multiple industries, boosting workflows for disease diagnosis, quantum simulation, graph analytics and routing optimizations.

What Is Dynamic Programming? 

Developed in the 1950s, dynamic programming is a popular technique for solving complex problems with two key techniques: recursion and memoization.

Recursion involves breaking a problem down into simpler sub-problems, saving time and computational effort. In memoization, the answers to these sub-problems — which are reused several times when solving the main problem — are stored. Memoization increases efficiency, so sub-problems don’t need to be recomputed when needed later on in the main problem.

DPX instructions accelerate dynamic programming algorithms by up to 7x on an NVIDIA H100 GPU, compared with NVIDIA Ampere architecture-based GPUs. In a node with four NVIDIA H100 GPUs, that acceleration can be boosted even further.

Use Cases Span Healthcare, Robotics, Quantum Computing, Data Science

Dynamic programming is commonly used in many optimization, data processing and omics algorithms. To date, most developers have run these kinds of algorithms on CPUs or FPGAs — but can unlock dramatic speedups using DPX instructions on NVIDIA Hopper GPUs.

Omics 

Omics covers a range of biological fields including genomics (focused on DNA), proteomics (focused on proteins) and transcriptomics (focused on RNA). These fields, which inform the critical work of disease research and drug discovery, all rely on algorithmic analyses that can be sped up with DPX instructions.

For example, the Smith-Waterman and Needleman-Wunsch dynamic programming algorithms are used for DNA sequence alignment, protein classification and protein folding. Both use a scoring method to measure how well genetic sequences from different samples align.

Smith-Waterman produces highly accurate results, but takes more compute resources and time than other alignment methods. By using DPX instructions on a node with four NVIDIA H100 GPUs, scientists can speed this process 35x to achieve real-time processing, where the work of base calling and alignment takes place at the same rate as DNA sequencing.

This acceleration will help democratize genomic analysis in hospitals worldwide, bringing scientists closer to providing patients with personalized medicine.

Route Optimization

Finding the optimal route for multiple moving pieces is essential for autonomous robots moving through a dynamic warehouse, or even a sender transferring data to multiple receivers in a computer network.

To tackle this optimization problem, developers rely on Floyd-Warshall, a dynamic programming algorithm used to find the shortest distances between all pairs of destinations in a map or graph. In a server with four NVIDIA H100 GPUs, Floyd-Warshall acceleration is boosted 40x compared to a traditional dual-socket CPU-only server.

Paired with the NVIDIA cuOpt AI logistics software, this speedup in routing optimization could be used for real-time applications in factories, autonomous vehicles, or mapping and routing algorithms in abstract graphs.

Quantum Simulation

Countless other dynamic programming algorithms could be accelerated on NVIDIA H100 GPUs with DPX instructions. One promising field is quantum computing, where dynamic programming is used in tensor optimization algorithms for quantum simulation. DPX instructions could help developers accelerate the process of identifying the right tensor contraction order.

SQL Query Optimization

Another potential application is in data science. Data scientists working with the SQL programming language often need to perform several “join” operations on a set of tables.  Dynamic programming helps find an optimal order for these joins, often saving orders of magnitude in execution time and thus speeding up SQL queries.

Learn more about the NVIDIA Hopper GPU architecture. Register free for GTC, running online through March 24. And watch the replay of NVIDIA founder and CEO Jensen Huang’s keynote address.

The post NVIDIA Hopper GPU Architecture Accelerates Dynamic Programming Up to 40x Using New DPX Instructions appeared first on NVIDIA Blog.

Read More

H100 Transformer Engine Supercharges AI Training, Delivering Up to 6x Higher Performance Without Losing Accuracy

The largest AI models can require months to train on today’s computing platforms. That’s too slow for businesses.

AI, high performance computing and data analytics are growing in complexity with some models, like large language ones, reaching trillions of parameters.

The NVIDIA Hopper architecture is built from the ground up to accelerate these next-generation AI workloads with massive compute power and fast memory to handle growing networks and datasets.

Transformer Engine, part of the new Hopper architecture, will significantly speed up AI performance and capabilities, and help train large models within days or hours.

Training AI Models With Transformer Engine

Transformer models are the backbone of language models used widely today, such asBERT and GPT-3. Initially developed for natural language processing use cases, their versatility is increasingly being applied to computer vision, drug discovery and more.

However, model size continues to increase exponentially, now reaching trillions of parameters. This is causing training times to stretch into months due to huge amounts of computation, which is impractical for business needs.

Transformer Engine uses 16-bit floating-point precision and a newly added 8-bit floating-point data format combined with advanced software algorithms that will further speed up AI performance and capabilities.

AI training relies on floating-point numbers, which have fractional components, like 3.14. Introduced with the NVIDIA Ampere architecture, the TensorFloat32 (TF32) floating-point format is now the default 32-bit format in the TensorFlow and PyTorch frameworks.

Most AI floating-point math is done using 16-bit “half” precision (FP16), 32-bit “single” precision (FP32) and, for specialized operations, 64-bit “double” precision (FP64). By reducing the math to just eight bits, Transformer Engine makes it possible to train larger networks faster.

When coupled with other new features in the Hopper architecture — like the NVLink Switch system, which provides a direct high-speed interconnect between nodes — H100-accelerated server clusters will be able to train enormous networks that were nearly impossible to train at the speed necessary for enterprises.

Diving Deeper Into Transformer Engine

Transformer Engine uses software and custom NVIDIA Hopper Tensor Core technology designed to accelerate training for models built from the prevalent AI model building block, the transformer. These Tensor Cores can apply mixed FP8 and FP16 formats to dramatically accelerate AI calculations for transformers. Tensor Core operations in FP8 have twice the throughput of 16-bit operations.

The challenge for models is to intelligently manage the precision to maintain accuracy while gaining the performance of smaller, faster numerical formats. Transformer Engine enables this with custom, NVIDIA-tuned heuristics that dynamically choose between FP8 and FP16 calculations and automatically handle re-casting and scaling between these precisions in each layer.

Transformer Engine uses per-layer statistical analysis to determine the optimal precision (FP16 or FP8) for each layer of a model, achieving the best performance while preserving model accuracy.

The NVIDIA Hopper architecture also advances fourth-generation Tensor Cores by tripling the floating-point operations per second compared with prior-generation TF32, FP64, FP16 and INT8 precisions. Combined with Transformer Engine and fourth-generation NVLink, Hopper Tensor Cores enable an order-of-magnitude speedup for HPC and AI workloads.

Revving Up Transformer Engine

Much of the cutting-edge work in AI revolves around large language models like Megatron 530B. The chart below shows the growth of model size in recent years, a trend that is widely expected to continue. Many researchers are already working on trillion-plus parameter models for natural language understanding and other applications, showing an unrelenting appetite for AI compute power.

Growth in natural language understanding models continues at a vigorous pace. Source: Microsoft.

Meeting the demand of these growing models requires a combination of computational power and a ton of high-speed memory. The NVIDIA H100 Tensor Core GPU delivers on both fronts, with the speedups made possible by Transformer Engine to take AI training to the next level.

When combined, these innovations deliver higher throughput and a 9x reduction in time to train, from seven days to just 20 hours:

NVIDIA H100 Tensor Core GPU delivers up to 9x more training throughput compared to previous generation, making it possible to train large models in reasonable amounts of time.

Transformer Engine can also be used for inference without any data format conversions. Previously, INT8 was the go-to precision for optimal inference performance. However, it requires that the trained networks be converted to INT8 as part of the optimization process, something the NVIDIA TensorRT inference optimizer makes easy.

Using models trained with FP8 will allow developers to skip this conversion step altogether and do inference operations using that same precision. And like INT8-formatted networks, deployments using Transformer Engine can run in a much smaller memory footprint.

On Megatron 530B, NVIDIA H100 inference per-GPU throughput is up to 30x higher than NVIDIA A100, with a 1-second response latency, showcasing it as the optimal platform for AI deployments:

Transformer Engine will also increase inference throughput by as much as 30x for low-latency applications.

To learn more about NVIDIA H100 GPU and the Hopper architecture, watch the GTC 2022 keynote from Jensen Huang. Register for GTC 2022 for free to attend sessions with NVIDIA and industry leaders.”

The post H100 Transformer Engine Supercharges AI Training, Delivering Up to 6x Higher Performance Without Losing Accuracy appeared first on NVIDIA Blog.

Read More

NVIDIA Maxine Reinvents Real-Time Communication With AI

Everyone wants to be heard. And with more people than ever in video calls or live streaming from their home offices, rich audio free from echo hiccups and background noises like barking dogs is key to better sounding online experiences.

NVIDIA Maxine offers GPU-accelerated, AI-enabled software development kits to help developers build scalable, low-latency audio and video effects pipelines that improve call quality and user experience.

Today, NVIDIA announced at GTC that Maxine is adding acoustic echo cancellation and AI-based upsampling for better sound quality.

Acoustic Echo Cancellation eliminates acoustic echo from the audio stream in real time, preserving speech quality even during double-talk. With AI-based technology, Maxine achieves more effective echo cancellation than that achieved via traditional digital signal processing algorithms.

Audio Super Resolution improves the quality of a low-bandwidth audio signal by restoring the energy lost in higher frequency bands using AI-based techniques. Maxine Audio Super Resolution supports upsampling the audio  from 8 kHz (narrowband) to 16 kHz (wideband), from 16 kHz to 48 kHz (ultra-wideband) and from 8 kHz to 48 kHz. Lower sampling rates such as 8 kHz often result in muffled voices and emphasize artifacts such as sibilance and make the speech difficult to understand.

Modern film and television studios often use 48 kHz (or higher) sampling rate for recording audio, in order to maintain fidelity of the original signal and preserve clarity. Audio Super Resolution can help restore the fidelity of old audio recordings, derived from magnetic tapes or other low bandwidth media.

Bridging the Sound Gap 

Most modern telecommunication takes place using wideband or ultra-wideband audio. Since NVIDIA Audio Super Resolution can upsample and restore the narrowband audio in real-time, the technology can effectively be used to bridge the quality gap between traditional copper wire phone lines and modern VoIP-based wideband communication systems.

Real-time communication — whether for conference calls, call centers or live streaming of all kinds — is taking a big leap forward with Maxine.

Since its initial release, Maxine has been adopted by many of the world’s leading providers for video communications, content creation and live streaming.

The worldwide market for video conferencing is expected to increase to nearly $13 billion in 2028, up from about $6.3 billion in 2021, according to Fortune Business Insights.

WFH: A Way of Life 

The move to work from home, or WFH, has become an accepted norm across companies, and organizations are adapting to the new expectations.

Analyst firm Gartner estimates that only a quarter of meetings for enterprises will be in person in 2024, a decline from 60 percent pre-pandemic.

Virtual collaboration in the U.S. has played an important role as people have taken on hybrid and remote positions in the past two years amid the pandemic.

But as organizations seek to maintain company culture and workplace experience, the stakes have risen for higher-quality media interactivity.

Solving the Cocktail Party Problem    

But sometimes work and home life collide. As a result, meetings are often filled with background noises from kids, construction work outside or emergency vehicle sirens, causing brief interruptions in the flow of conference calls.

Maxine helps solve an age-old audio issue known as the cocktail party problem. With AI, it can filter out unwanted background noises, allowing users to be better heard, whether they’re in a home office or on the road.

The Maxine GPU-accelerated platform provides an end-to-end deep learning pipeline that integrates with customizable state-of-the-art models, enabling high-quality features with a standard microphone and camera.

Sound Like Your Best Self

In addition to being impacted by background noise, audio quality in virtual activities can sometimes sound thin, missing low- and mid-level frequencies, or even be barely audible.

Maxine enables upsampling of audio in real time so that voices sound fuller, deeper and more audible.

Logitech: Better Audio for Headsets and Blue Yeti Microphones

Logitech, a leading maker of peripherals, is implementing Maxine for better interactions with its popular headsets and microphones.

Tapping into AI libraries, Logitech has integrated Maxine directly inside G Hub audio drivers to enhance communications with its devices without the need for additional software. Maxine takes advantage of the powerful Tensor Cores in NVIDIA RTX GPUs so consumers can enjoy real-time processing of their mic signal.

Logitech is now utilizing Maxine’s state-of-the-art denoising in its G Hub software. That has allowed it to remove echoes and background noises — such as fans, as well as keyboard and mouse clicks — that can distract from video conferences or live-streaming sessions.

“NVIDIA Maxine makes it fast and easy for Logitech G gamers to clean up their mic signal and eliminate unwanted background noises in a single click.” said Ujesh Desai, GM of Logitech G. “You can even use G HUB to test your mic signal to make sure you have your Maxine settings dialed in.”

Logitech is now taking advantage of Maxine’s state-of-the-art denoising in its G Hub software. That has allowed it to remove echoes and background noises — such as fans, as well as keyboard and mouse clicks — that can distract from video conferences or live-streaming sessions.

“NVIDIA Maxine makes it fast and easy for users to clean up their mic signal and eliminate unwanted background noises in a single click,” said Ujesh Desai, vice president at Logitech. “You can even test your mic signal to find the perfect settings for your setup.”

Tencent Cloud Boosts Content Creators

Tencent Cloud is helping content creators with their productions by offering technology from NVIDIA Maxine that makes it quick and easy to add creative backgrounds.

NVIDIA Maxine’s AI Green Screen feature enables users to create a more immersive presence with high-quality foreground and background separation — without the need for a traditional green screen. Once the real background is separated, it can easily be replaced with a virtual background, or blurred to create a depth-of-field effect. Tencent Cloud is offering this new capability as a software-as-a-service package for content creators.

NVIDIA Maxine’s AI Green Screen technology helps content creators with their productions by enabling more immersive high quality experiences, without the need for specialized equipment and lighting” said Director of the Product Center, Vulture Li at Tencent Cloud audio and video platform.

Making Virtual Experiences Better

NVIDIA Maxine provides state-of-the-art real-time AI audio, video and augmented reality features that can be built into customizable, end-to-end deep learning pipelines.

The AI-powered SDKs from Maxine help developers to create applications that include audio and image denoising, super resolution, gaze correction, 3D body pose estimation and translation features.

Maxine also enables real-time voice-to-text translation for a growing number of languages. At GTC, NVIDIA demonstrated Maxine translating between English, French, German and Spanish.

These effects will allow millions of people to enjoy high-quality and engaging live-streaming video across any device.

 

Join us at GTC this week to learn more about Maxine in the following session:

The post NVIDIA Maxine Reinvents Real-Time Communication With AI appeared first on NVIDIA Blog.

Read More

Getting People Talking: Microsoft Improves AI Quality and Efficiency of Translator Using NVIDIA Triton

When your software can evoke tears of joy, you spread the cheer.

So, Translator, a Microsoft Azure Cognitive Service, is applying some of the world’s largest AI models to help more people communicate.

“There are so many cool stories,” said Vishal Chowdhary, development manager for Translator.

Like the five-day sprint to add Haitian Creole to power apps that helped aid workers after Haiti suffered a 7.0 earthquake in 2010. Or the grandparents who choked up in their first session using the software to speak live with remote grandkids who spoke a language they did not understand.

An Ambitious Goal

“Our vision is to eliminate barriers in all languages and modalities with this same API that’s already being used by thousands of developers,” said Chowdhary.

With some 7,000 languages spoken worldwide, it’s an ambitious goal.

So, the team turned to a powerful, and complex, tool — a mixture of experts (MoE) AI approach.

It’s a state-of-the-art member of the class of transformer models driving rapid advances in natural language processing. And with 5 billion parameters, it’s 80x larger than the biggest model the team has in production for natural-language processing.

MoE models are so compute-intensive, it’s hard to find anyone who’s put them into production. In an initial test, CPU-based servers couldn’t meet the team’s requirement to use them to translate a document in one second.

A 27x Speedup

Then the team ran the test on accelerated systems with NVIDIA Triton Inference Server, part of the NVIDIA AI Enterprise 2.0 platform announced this week at GTC.

“Using NVIDIA GPUs and Triton we could do it, and do it efficiently,” said Chowdhary.

In fact, the team was able to achieve up to a 27x speedup over non-optimized GPU runtimes.

“We were able to build one model to perform different language understanding tasks — like summarizing, text generation and translation — instead of having to develop separate models for each task,” said Hanny Hassan Awadalla, a principal researcher at Microsoft who supervised the tests.

How Triton Helped

Microsoft’s models break down a big job like translating a stack of documents into many small tasks of translating hundreds of sentences. Triton’s dynamic batching feature pools these many requests to make best use of a GPU’s muscle.

The team praised Triton’s ability to run any model in any mode using CPUs, GPUs or other accelerators.

“It seems very well thought out with all the features I wanted for my scenario, like something I would have developed for myself,” said Chowdhary, whose team has been developing large-scale distributed systems for more than a decade.

Under the hood, two software components were key to Triton’s success. NVIDIA extended FasterTransformer — a software layer that handles inference computations — to support MoE models. CUTLASS, an NVIDIA math library, helped implement the models efficiently.

Proven Prototype in Four Weeks

Though the tests were complex, the team worked with NVIDIA engineers to get an end-to-end prototype with Triton up and running in less than a month.

“That’s a really impressive timeline to make a shippable product — I really appreciate that,” said Awadalla.

And though it was the team’s first experience with Triton, “we used it to ship the MoE models by rearchitecting our runtime environment without a lot of effort, and now I hope it becomes part of our long-term host system,” Chowdhary added.

Taking the Next Steps

The accelerated service will arrive in judicious steps, initially for document translation in a few major languages.

“Eventually, we want our customers to get the goodness of these new models transparently in all our scenarios,” said Chowdhary.

The work is part of a broad initiative at Microsoft. It aims to fuel advances across a wide sweep of its products such as Office and Teams, as well as those of its developers and customers from small one-app companies to Fortune 500 enterprises.

Paving the way, Awadalla’s team published research in September on training MoE models with up to 200 billion parameters on NVIDIA A100 Tensor Core GPUs. Since then, the team’s accelerated that work another 8x by using 80G versions of the A100 GPUs on models with more than 300 billion parameters.

“The models will need to get larger and larger to better represent more languages, especially for ones where we don’t have a lot of data,” Adawalla said.

The post Getting People Talking: Microsoft Improves AI Quality and Efficiency of Translator Using NVIDIA Triton appeared first on NVIDIA Blog.

Read More