Splitwise improves GPU usage by splitting LLM inference phases

Splitwise improves GPU usage by splitting LLM inference phases

The recent surge in large language model (LLM) use is causing significant challenges for cloud providers, requiring them to deploy more GPUs at an unprecedented rate. However, the capacity to provision the power needed to run these GPUs is limited, and with demand for computation surpassing supply, it is not uncommon for user queries to be denied. Therefore, any approach to making the existing infrastructure more efficient—enabling it to serve more queries faster under the same power budget—can have very tangible benefits to both cloud providers and users.

One aspect of LLM inference that currently limits efficient use of resources is that it has two distinct phases with different characteristics: the prompt phase and the token-generation phase. During the prompt phase, LLMs process all user input, or prompts, in parallel, efficiently utilizing GPU compute. However, during the token-generation phase, LLMs generate each output token sequentially and are limited by GPU memory bandwidth. Even when employing state-of-the-art batching mechanisms, the discrepancy between these two phases results in low overall hardware utilization, leading to much higher costs when offering LLMs to users. Figure 1 illustrates the differences between these two phases.

An example of the generative LLM inference process and the two phases associated with it. The initial prompt is “Which is better, pizza or burger?” and it generates the word “Pizza”. The token generation phase generates the words/tokens: “is”, “better”, and “.”. The prompt phase has the following properties: (1) all input tokens are processed in parallel to generate the first output token, (2) compute intensive, and (3) is a smaller part of the end-to-end latency. The token phase is: (1) serialized, (2) memory intensive, and (3) tends to be the majority of the end-to-end latency.
Figure 1. An example of the generative LLM inference process and the two phases associated with it. The prompt phase is computationally intensive, while the token phase is memory intensive.

Splitting the phases with Splitwise

At Azure Research – Systems, we tackled this by creating Splitwise, a technique designed to optimally utilize available hardware by separating the prompt computation and token-generation phases onto separate machines. This approach is underpinned by the insight that prompt processing and token-generation are distinct in their computational, memory, and power requirements. By separating these two phases, we can enhance hardware utilization during both phases. Our paper, “Splitwise: Efficient Generative LLM Inference Using Phase Splitting,” details our methods for developing and testing this technique, including an exploration of how different types of GPUs perform during each phase.   

To create a sustainable approach for GPU provisioning, we used Splitwise to design GPU clusters with three primary objectives: maximizing throughput, minimizing costs, and reducing power. In addition to separating the two LLM inference phases into two distinct machine pools, we include a third machine pool for mixed batching across the prompt and token phases, sized dynamically based on real-time computational demands. Lastly, we transferred the state context (i.e., KV-cache in the LLM transformer attention layers) from the prompt to the token machines over InfiniBand without any perceivable latency impact to the user. This high-level system architecture is illustrated in Figure 2.

A high-level diagram of Splitwise architecture. Machines maintained in different pools are dedicated to the corresponding phases. The mixed pool grows and reduces according to runtime demand. KV-cache encompassing the state of the query after the prompt phase is transferred from the prompt machines to the token machines over InfiniBand with very low latency.
Figure 2. A high-level diagram of the Splitwise architecture. Machines maintained in different pools are dedicated to the two distinct LLM inference phases. The mixed pool grows and reduces according to runtime demand. KV-cache encompassing the state of the query after the prompt phase is transferred from the prompt machines to the token machines over InfiniBand with very low latency.

MICROSOFT RESEARCH PODCAST

Abstracts: October 23, 2023

On “Abstracts,” Partner Research Manager Andy Gordon & Senior Researcher Carina Negreanu explore new work introducing co-audit, a term for any tool-assisted experience that helps users of generative AI find and fix mistakes in AI output.


Tests show Splitwise maximizes throughput while lowering costs

To evaluate its performance, we used Splitwise to design clusters with different types of GPUs, including NVIDIA DGX-A100 and DGX-H100, while optimizing cost, power, and throughput under specific latency service level agreements (SLAs) for each query. Table 1 shows the machine types we used for each cluster design. Our application of Splitwise encompassed two use cases: code and conversation using the Llama-2-70B (opens in new tab) and BLOOM-176B (opens in new tab) LLMs.

Details for the prompt and token machines we used for each cluster design, evaluated with Splitwise. All values are normalized to a baseline of DGX-A100. DGX-H100 capped is a system with all GPUs power-capped to half the maximum power.
Table 1. Details for the prompt and token machines we used for each cluster design, evaluated with Splitwise. All values are normalized to a baseline of DGX-A100. DGX-H100 capped is a system with all GPUs power-capped to half the maximum power.

Our findings demonstrate that Splitwise successfully achieves our three goals of maximizing throughput, minimizing costs, and reducing power. Through our evaluation, we observed that the Splitwise cluster design can maximize throughput at the same cost compared with an A100 baseline cluster. Moreover, Splitwise delivers much higher throughput while operating within the same provisioned power constraints as the baseline cluster. Figure 3 shows that compared with Baseline-H100, we can achieve 1.4x higher throughput at 20 percent lower cost. Alternatively, we can achieve 2.35x more throughput with the same cost and power budgets.

Results from baseline and Splitwise clusters optimized for throughput, all with the same power constraints. Splitwise-HH requires the least number of machines. Splitwise-HHcap provides the best throughput. Splitwise-AA is the cheapest option.
Figure 3. Results from baseline and Splitwise clusters optimized for throughput, all with the same power constraints.

Looking forward

Splitwise marks a leap toward efficient, high-performance LLM deployments. By separating the prompt and token phases, we can unlock new potential in GPU use. Looking forward, we at Microsoft Azure envision tailored machine pools driving maximum throughput, reduced costs, and power efficiency, and we will continue to focus on making LLM inference efficient and sustainable.

Our approach is now part of vLLM (opens in new tab) and can also be implemented with other frameworks.

Acknowledgements

This work was done in collaboration with our intern, Pratyush Patel from the University of Washington. We also appreciate the help and guidance of Suriya Kalivardhan, Gopi Kumar, and Chetan Bansal.

The post Splitwise improves GPU usage by splitting LLM inference phases appeared first on Microsoft Research.

Read More

A New Year of Gaming: GeForce NOW Adds More Than 20 New Titles in January

A New Year of Gaming: GeForce NOW Adds More Than 20 New Titles in January

Celebrate the new year with more cloud gaming. Experience the power and performance of the cloud with more than 20 new games to be added to GeForce NOW in January.

Start with five games available this week, including The Finals from Embark Studios

And tune in to the NVIDIA Special Address at CES on Monday, Jan. 8, at 8 a.m. PT for the latest on gaming, AI-related news and more.

It’s the Final Countdown

Fight for glory, fame and survival.

Fight for fame on the world’s biggest stage with Embark Studios’ The Finals. The free-to-play, multiplayer, first-person shooter is newly supported in the cloud this week, with RTX ON for the most cinematic lighting and visuals for GeForce NOW Ultimate and Priority members.

In The Finals, take part in a deadly TV game show that pits contestants against each other as they battle for a huge reward. Fight alongside teammates in virtual arenas that can be altered, exploited and even destroyed. Manipulate the environment as a weapon itself and use it to take down other players. Drive viewers wild with thrilling combat and flair, using tricks like crashing a wrecking ball into opponents.

Harness the power of the cloud and reach the finals anywhere with the ability to stream across devices. Ultimate members can fight for glory with the advantage of longer gaming sessions, the highest frame rates, ray tracing and ultra-low latency.

In With the New

Spotlight games on GeForce NOW
Flame on! ‘Enshrouded’ launches in the cloud Jan. 24.

In Enshrouded, become Flameborn, the last ember of hope of a dying race. Awaken, survive the terror of a corrupting fog and reclaim the lost beauty of the kingdom. Venture into a vast world, vanquish punishing bosses, build grand halls and forge a path in this co-op survival action role-playing game for up to 16 players, launching in the cloud Jan. 24.

Don’t miss the five newly supported games joining the GeForce NOW library this week:

  • Dishonored, for Hungary, Czech Republic and Poland (Steam)
  • The Finals (Steam)
  • Redmatch 2 (Steam)
  • Scorn (Xbox, available for PC Game Pass)
  • Sniper Elite 5 (Xbox, available for PC Game Pass)

And here’s what’s coming throughout the rest of January:

  • War Hospital (New release on Steam, Jan. 11)
  • Prince of Persia: The Lost Crown (New release on Ubisoft, Jan. 18)
  • Turnip Boy Robs a Bank (New release on Steam and Xbox, available for PC Game Pass, Jan.18)
  • Stargate: Timekeepers (New release on Steam, Jan. 23)
  • Enshrouded (New release on Steam, Jan. 24)
  • Bang-On Balls: Chronicles (Steam)
  • Firefighting Simulator – The Squad (Steam)
  • Jected – Rivals (Steam)
  • The Legend of Nayuta: Boundless Trails (Steam)
  • RAILGRADE (Steam)
  • Redmatch 2 (Steam)
  • Shadow Tactics: Blades of the Shogun (Steam)
  • Shadow Tactics: Blades of the Shogun – Aiko’s Choice (Steam)
  • Solasta: Crown of the Magister (Steam)
  • Survivalist: Invisible Strain (Steam)
  • Witch It (Steam)
  • Wobbly Life (Steam)

Doubled in December

In addition to the 70 games announced last month, 34 extra joined GeForce NOW:

  • Avatar: Frontiers of Pandora (New release on Ubisoft, Dec. 7)
  • Goat Simulator 3 (New release on Xbox, available on PC Game Pass, Dec. 7)
  • LEGO Fortnite (New release on Epic Games Store, Dec. 7)
  • Against the Storm (New release on Xbox, available on PC Game Pass, Dec. 8)
  • Rocket Racing (New release on Epic Games Store, Dec. 8)
  • Fortnite Festival (New release on Epic Games Store, Dec. 9)
  • Stellaris Nexus (New release on Steam, Dec. 12)
  • Tin Hearts (New release on Xbox, available PC Game Pass, Dec. 12)
  • Amazing Cultivation Simulator (Xbox, available on the Microsoft Store)
  • Blasphemous 2 (Epic Games Store)
  • Century: Age of Ashes (Xbox, available on the Microsoft Store)
  • Chorus (Xbox, available on the Microsoft Store)
  • Dungeons 4  (Xbox, available on PC Game Pass)
  • Edge of Eternity (Xbox, available on the Microsoft Store)
  • Farming Simulator 17 (Xbox, available on the Microsoft Store)
  • Farming Simulator 22 (Xbox, available on PC Game Pass)
  • Flashback 2 (Steam)
  • Forza Horizon 4 (Steam)
  • Forza Horizon 5 (Steam, Xbox and available on PC Game Pass)
  • Hollow Knight (Xbox, available on PC Game Pass)
  • The Front (Steam)
  • Martha Iis Dead (Xbox, available on the Microsoft Store)
  • Minecraft Dungeons (Steam, Xbox and available on PC Game Pass)
  • Monster Hunter: World (Steam)
  • Neon Abyss (Xbox, available on PC Game Pass)
  • Ori and the Will of the Wisps (Steam, Xbox and available on PC Game Pass)
  • Ori and the Blind Forest: Definitive Edition (Steam)
  • Raji: An Ancient Epic (Xbox, available on the Microsoft Store)
  • Remnant: From the Ashes (Xbox, available on PC Game Pass)
  • Remnant II (Xbox, available on PC Game Pass)
  • Richman 10 (Xbox, available on the Microsoft Store)
  • Spirittea (Xbox, available on PC Game Pass)
  • Surgeon Simulator 2 (Xbox, available on the Microsoft Store)
  • Sword and Fairy 7 (Xbox, available on PC Game Pass)

Terminator: Dark Fate – Defiance didn’t make it in December due to a change in its publish date. Stay tuned to GFN Thursday for updates.

What are you planning to play this weekend? Let us know on X or in the comments below.

Read More

Accelerating Generative AI Part III: Diffusion, Fast

Accelerating Generative AI Part III: Diffusion, Fast

This post is the third part of a multi-series blog focused on how to accelerate generative AI models with pure, native PyTorch. We are excited to share a breadth of newly released PyTorch performance features alongside practical examples to see how far we can push PyTorch native performance. In part one, we showed how to accelerate Segment Anything over 8x using only pure, native PyTorch. In part two, we showed how to accelerate Llama-7B by almost 10x using only native PyTorch optimizations. In this blog, we’ll focus on speeding up text-to-image diffusion models by upto 3x.

We will leverage an array of optimizations including:

  • Running with the bfloat16 precision
  • scaled_dot_product_attention (SPDA)
  • torch.compile
  • Combining q,k,v projections for attention computation
  • Dynamic int8 quantization

We will primarily focus on Stable Diffusion XL (SDXL), demonstrating a latency improvement of 3x. These techniques are PyTorch-native, which means you don’t have to rely on any third-party libraries or any C++ code to take advantage of them.

Enabling these optimizations with the 🤗Diffusers library takes just a few lines of code. If you’re already feeling excited and cannot wait to jump to the code, check out the accompanying repository here: https://github.com/huggingface/diffusion-fast.

SDXL Chart

(The discussed techniques are not SDXL-specific and can be used to speed up other text-to-image diffusion systems, as shown later.)

Below, you can find some blog posts on similar topics:

Setup

We will demonstrate the optimizations and their respective speed-up gains using the 🤗Diffusers library. Apart from that, we will make use of the following PyTorch-native libraries and environments:

  • Torch nightly (to benefit from the fastest kernels for efficient attention; 2.3.0.dev20231218+cu121)
  • 🤗 PEFT (version: 0.7.1)
  • torchao (commit SHA: 54bcd5a10d0abbe7b0c045052029257099f83fd9)
  • CUDA 12.1

For an easier reproduction environment, you can also refer to this Dockerfile. The benchmarking numbers presented in this post come from a 400W 80GB A100 GPU (with its clock rate set to its maximum capacity).

Since we use an A100 GPU (Ampere architecture) here, we can specify torch.set_float32_matmul_precision("high") to benefit from the TF32 precision format.

Run inference using a reduced precision

Running SDXL in Diffusers just takes a few lines of code:

from diffusers import StableDiffusionXLPipeline

## Load the pipeline in full-precision and place its model components on CUDA.
pipe = StableDiffusionXLPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0").to("cuda")

## Run the attention ops without efficiency.
pipe.unet.set_default_attn_processor()
pipe.vae.set_default_attn_processor()

prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
image = pipe(prompt, num_inference_steps=30).images[0]

But this isn’t very practical as it takes 7.36 seconds to generate a single image with 30 steps. This is our baseline which we will try to optimize one step at a time.

SDXL Chart

Here, we’re running the pipeline with the full precision. We can immediately cut down the inference time by using a reduced precision such as bfloat16. Besides, modern GPUs come with dedicated cores for running accelerated computation benefiting from reduced precision. To run the computations of the pipeline in the bfloat16 precision, we just need to specify the data type while initializing the pipeline:

from diffusers import StableDiffusionXLPipeline

pipe = StableDiffusionXLPipeline.from_pretrained(
	"stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.bfloat16
).to("cuda")

## Run the attention ops without efficiency.
pipe.unet.set_default_attn_processor()
pipe.vae.set_default_attn_processor()
prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
image = pipe(prompt, num_inference_steps=30).images[0]

SDXL Chart

By using a reduced precision, we’re able to cut down the inference latency from 7.36 seconds to 4.63 seconds.

Some notes on the use of bfloat16

  • Using a reduced numerical precision (such as float16, bfloat16) to run inference doesn’t affect the generation quality but significantly improves latency.
  • The benefits of using the bfloat16 numerical precision as compared to float16 are hardware-dependent. Modern generations of GPUs tend to favor bfloat16.
  • Furthermore, in our experiments, we bfloat16 to be much more resilient when used with quantization in comparison to float16.

(We later ran the experiments in float16 and found out that the recent versions of torchao do not incur numerical problems from float16.)

Use SDPA for performing attention computations

By default, Diffusers uses scaled_dot_product_attention (SDPA) for performing attention-related computations when using PyTorch 2. SDPA provides faster and more efficient kernels to run intensive attention-related operations. To run the pipeline SDPA, we simply don’t set any attention processor like so:

from diffusers import StableDiffusionXLPipeline

pipe = StableDiffusionXLPipeline.from_pretrained(
	"stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.bfloat16
).to("cuda")

prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
image = pipe(prompt, num_inference_steps=30).images[0]

SDPA gives a nice boost from 4.63 seconds to 3.31 seconds.

SDXL Chart

Compiling the UNet and VAE

We can ask PyTorch to perform some low-level optimizations (such as operator fusion and launching faster kernels with CUDA graphs) by using torch.compile. For the StableDiffusionXLPipeline, we compile the denoiser (UNet) and the VAE:

from diffusers import StableDiffusionXLPipeline
import torch

pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.bfloat16
).to("cuda")

## Compile the UNet and VAE.
pipe.unet = torch.compile(pipe.unet, mode="max-autotune", fullgraph=True)
pipe.vae.decode = torch.compile(pipe.vae.decode, mode="max-autotune", fullgraph=True)

prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"

## First call to `pipe` will be slow, subsequent ones will be faster.
image = pipe(prompt, num_inference_steps=30).images[0]

Using SDPA attention and compiling both the UNet and VAE reduces the latency from 3.31 seconds to 2.54 seconds.

SDXL Chart

Notes on torch.compile

torch.compile offers different backends and modes. As we’re aiming for maximum inference speed, we opt for the inductor backend using the “max-autotune”. “max-autotune” uses CUDA graphs and optimizes the compilation graph specifically for latency. Using CUDA graphs greatly reduces the overhead of launching GPU operations. It saves time by using a mechanism to launch multiple GPU operations through a single CPU operation.

Specifying fullgraph to be True ensures that there are no graph breaks in the underlying model, ensuring the fullest potential of torch.compile. In our case, the following compiler flags were also important to be explicitly set:

torch._inductor.config.conv_1x1_as_mm = True
torch._inductor.config.coordinate_descent_tuning = True
torch._inductor.config.epilogue_fusion = False
torch._inductor.config.coordinate_descent_check_all_directions = True

For the full list of compiler flags, refer to this file.

We also change the memory layout of the UNet and the VAE to “channels_last” when compiling them to ensure maximum speed:

pipe.unet.to(memory_format=torch.channels_last)
pipe.vae.to(memory_format=torch.channels_last)

In the next section, we’ll show how to improve the latency even further.

Additional optimizations

No graph breaks during torch.compile

Ensuring that the underlying model/method can be fully compiled is crucial for performance (torch.compile with fullgraph=True). This means having no graph breaks. We did this for the UNet and VAE by changing how we access the returning variables. Consider the following example:

code example

Getting rid of GPU syncs after compilation

During the iterative reverse diffusion process, we call step() on the scheduler each time after the denoiser predicts the less noisy latent embeddings. Inside step(), the sigmas variable is indexed. If the sigmas array is placed on the GPU, indexing causes a communication sync between the CPU and GPU. This causes a latency, and it becomes more evident when the denoiser has already been compiled.

But if the sigmas array always stays on the CPU (refer to this line), this sync doesn’t take place, hence improved latency. In general, any CPU <-> GPU communication sync should be none or be kept to a bare minimum as it can impact inference latency.

Using combined projections for attention ops

Both the UNet and the VAE used in SDXL make use of Transformer-like blocks. A Transformer block consists of attention blocks and feed-forward blocks.

In an attention block, the input is projected into three sub-spaces using three different projection matrices – Q, K, and V. In the naive implementation, these projections are performed separately on the input. But we can horizontally combine the projection matrices into a single matrix and perform the projection in one shot. This increases the size of the matmuls of the input projections and improves the impact of quantization (to be discussed next).

Enabling this kind of computation in Diffusers just takes a single line of code:

pipe.fuse_qkv_projections()

This will make the attention operations for both the UNet and the VAE take advantage of the combined projections. For the cross-attention layers, we only combine the key and value matrices. To learn more, you can refer to the official documentation here. It’s worth noting that we leverage PyTorch’s scaled_dot_product_attention here internally.

These additional techniques improved the inference latency from 2.54 seconds to 2.52 seconds.

SDXL Chart

Dynamic int8 quantization

We selectively apply dynamic int8 quantization to both the UNet and the VAE. This is because quantization adds additional conversion overhead to the model that is hopefully made up for by faster matmuls (dynamic quantization). If the matmuls are too small, these techniques may degrade performance.

Through experimentation, we found that certain linear layers in the UNet and the VAE don’t benefit from dynamic int8 quantization. You can check out the full code for filtering those layers here (referred to as dynamic_quant_filter_fn below).

We leverage the ultra-lightweight pure PyTorch library torchao to use its user-friendly APIs for quantization:

from torchao.quantization import apply_dynamic_quant

apply_dynamic_quant(pipe.unet, dynamic_quant_filter_fn)
apply_dynamic_quant(pipe.vae, dynamic_quant_filter_fn)

Since this quantization support is limited to linear layers only, we also turn suitable pointwise convolution layers into linear layers to maximize the benefit. We also specify the following compiler flags when using this option:

torch._inductor.config.force_fuse_int_mm_with_mul = True
torch._inductor.config.use_mixed_mm = True

To prevent any numerical issues stemming from quantization, we run everything in the bfloat16 format.

Applying quantization this way improved the latency from 2.52 seconds to 2.43 seconds.

SDXL Chart

Resources

We welcome you to check out the following codebases to reproduce these numbers and extend the techniques to other text-to-image diffusion systems as well:

Other links

Improvements in other pipelines

We applied these techniques to other pipelines to test the generality of our approach. Below are our findings:

SSD-1B

SSD-1B Chart

Stable Diffusion v1-5

Stable Diffusion v1-5 chart

PixArt-alpha/PixArt-XL-2-1024-MS

It’s worth noting that PixArt-Alpha uses a Transformer-based architecture as its denoiser for the reverse diffusion process instead of a UNet.

PixArt-alpha/PixArt-XL-2-1024-MS chart

Note that for Stable Diffusion v1-5 and PixArt-Alpha, we didn’t explore the best shape combination criteria for applying dynamic int8 quantization. It might be possible to get better numbers with a better combination.

Collectively, the methods we presented offer substantial speedup over the baseline without degradation in the generation quality. Furthermore, we believe that these methods should complement other optimization methods popular in the community (such as DeepCache, Stable Fast, etc.).

Conclusion and next steps

In this post, we presented a basket of simple yet effective techniques that can help improve the inference latency of text-to-image Diffusion models in pure PyTorch. In summary:

  • Using a reduced precision to perform our computations
  • Scaled-dot product attention for running the attention blocks efficiently
  • torch.compile with “max-autotune” to improve for latency
  • Combining the different projections together for computing attention
  • Dynamic int8 quantization

We believe there’s a lot to be explored in terms of how we apply quantization to a text-to-image diffusion system. We didn’t exhaustively explore which layers in the UNet and the VAE tend to benefit from dynamic quantization. There might be opportunities to further speed things up with a better combination of the layers being targeted for quantization.

We kept the text encoders of SDXL untouched other than just running them in bfloat16. Optimizing them might also lead to improvements in latency.

Acknowledgements

Thanks to Ollin Boer Bohan whose VAE was used throughout the benchmarking process as it is numerically more stable under reduced numerical precisions.

Thanks to Hugo Larcher from Hugging Face for helping with infrastructure.

Read More

By Jove, It’s No Myth: NVIDIA Triton Speeds Inference on Oracle Cloud

By Jove, It’s No Myth: NVIDIA Triton Speeds Inference on Oracle Cloud

An avid cyclist, Thomas Park knows the value of having lots of gears to maintain a smooth, fast ride.

So, when the software architect designed an AI inference platform to serve predictions for Oracle Cloud Infrastructure’s (OCI) Vision AI service, he picked NVIDIA Triton Inference Server. That’s because it can shift up, down or sideways to handle virtually any AI model, framework and hardware and operating mode — quickly and efficiently.

“The NVIDIA AI inference platform gives our worldwide cloud services customers tremendous flexibility in how they build and run their AI applications,” said Park, a Zurich-based computer engineer and competitive cycler who’s worked for four of the world’s largest cloud services providers.

Specifically, Triton reduced OCI’s total cost of ownership by 10%, increased prediction throughput up to 76% and reduced inference latency up to 51% for OCI Vision and Document Understanding Service models that were migrated to Triton. The services run globally across more than 45 regional data centers, according to an Oracle blog Park and a colleague posted earlier this year.

Computer Vision Accelerates Insights

Customers rely on OCI Vision AI for a wide variety of object detection and image classification jobs. For instance, a U.S.-based transit agency uses it to automatically detect the number of vehicle axles passing by to calculate and bill bridge tolls, sparing busy truckers wait time at toll booths.

OCI AI is also available in Oracle NetSuite, a set of business applications used by more than 37,000 organizations worldwide. It’s used, for example, to automate invoice recognition.

Thanks to Park’s work, Triton is now being adopted across other OCI services, too.

A Triton-Aware Data Service

“We’ve built a Triton-aware AI platform for our customers,” said Tzvi Keisar, a director of product management for OCI’s Data Science service, which handles machine learning for Oracle’s internal and external users.

“If customers want to use Triton, we’ll save them time by automatically doing the configuration work for them in the background, launching a Triton-powered inference endpoint for them,” said Keisar.

His team also plans to make it even easier for its other users to embrace the fast, flexible inference server. Triton is included in NVIDIA AI Enterprise, a platform that provides full security and support businesses need — and it’s available on OCI Marketplace.

A Massive SaaS Platform

OCI’s Data Science service is the machine learning platform for both NetSuite and Oracle Fusion software-as-a-service applications.

“These platforms are massive, with tens of thousands of customers who are also building their work on top of our service,” he said.

It’s a wide swath of mainly enterprise users in manufacturing, retail, transportation and other industries. They’re building and using AI models of nearly every shape and size.

Inference was one of the group’s first services, and Triton came on the team’s radar not long after its launch.

A Best-in-Class Inference Framework

“We saw Triton pick up in popularity as a best-in-class serving framework, so we started experimenting with it,” Keisar said. “We saw really good performance, and it closed a gap in our existing offerings, especially on multi-model inference — it’s the most versatile and advanced inferencing framework out there.”

Launched on OCI in March, Triton has already attracted the attention of many internal teams at Oracle hoping to use it for inference jobs that require serving predictions from multiple AI models running concurrently.

“Triton has a very good track record and performance on multiple models deployed on a single endpoint,” he said.

Accelerating the Future

Looking ahead, Keisar’s team is evaluating NVIDIA TensorRT-LLM software to supercharge inference on the complex large language models (LLMs) that have captured the imagination of many users.

An active blogger, Keisar’s latest article detailed creative quantization techniques for running a Llama 2 LLM with a whopping 70 billion parameters on NVIDIA A10 Tensor Core GPUs.

“Even down to four bits, the quality of model outputs is still quite good,” he said. “I can’t explain all the math, but we found a good balance, and I haven’t seen anyone else do this yet.”

After announcements this fall that Oracle is deploying the latest NVIDIA H100 Tensor Core GPUs, H200 GPUs, L40S GPUs and Grace Hopper Superchips, it’s just the start of many accelerated efforts to come.

Read More

Ring in the New Year With 3D Artist Blendeered’s Futuristic, NVIDIA-Themed City

Ring in the New Year With 3D Artist Blendeered’s Futuristic, NVIDIA-Themed City

Editor’s note: This post is part of our weekly In the NVIDIA Studio series, which celebrates featured artists, offers creative tips and tricks, and demonstrates how NVIDIA Studio technology improves creative workflows. We’re also deep diving on new GeForce RTX 40 Series GPU features, technologies and resources, and how they dramatically accelerate content creation.

A new year means new creative opportunities and new In the NVIDIA Studio beats.

Each week, featured In the NVIDIA Studio artists share their unique artwork and content creation processes, as well as how NVIDIA Studio — a platform comprising fine-tuned hardware and efficient software powered by NVIDIA and GeForce RTX GPUs — elevates their work.

This week’s featured 3D content creator, Pedro Soares, aka Blendeered, created a stunning NVIDIA-themed New Year’s celebration animation.

Plus, tune in to the NVIDIA Special Address at CES on Monday, Jan. 8, at 8 a.m. PT for the latest on content creation, AI-related news and more.

Blendeered’s Beguiling Renders 

Blendeered’s latest animation was inspired by NVIDIA and the power of technological innovation.

“The scene, New Year’s, showcases a futuristic city with all the buildings funneling to the center point,” said Blendeered. “This evokes the feeling of accelerating toward a brighter future, which is what NVIDIA is all about: taking tech to the next level, every day.”

The Portugal-based creator first conceptualized the scene.

“The futuristic city needed to give a sense of speed,” he said. He accomplished this using highlighted arrows, neon-green street lines and light beams on digital screens across various high-rise buildings.

Blendeered then built individual assets in Blender version 3.6 — by far his favorite 3D app, in case his stage name didn’t give it away.

“Blender captivates users with its friendly interface, speed, power, real-time rendering and vibrant community — and the best part is that it’s free!” he shared.

His NVIDIA GeForce RTX 4090 GPU unlocked Blender Cycles’ RTX-accelerated OptiX ray tracing in the viewport for interactive, photorealistic modeling sequences. He also textured and applied color schemes to his 3D assets in Blender.

Next, the artist began lighting the scene using the new Panorama feature in the NVIDIA Canvas app. He tapped OptiX denoising to preview final render results in real time, speeding his workflow.

Available for GeForce RTX GPU owners and free to download, NVIDIA Canvas uses AI to turn brushstrokes into realistic landscape images for quick creation of backgrounds and concept exploration.

NVIDIA Canvas can be used to generate full, spherical HDRi backdrops and brainstorm ideas.

Blendeered generated a full, spherical, high-dynamic-range imaging (HDRi) backdrop for his computer-generated imagery workflows — based on AI rendering — with a few simple sketches. He then exported it as an HDR file and imported it into Blender. YouTuber Timo Helmers demonstrates this type of workflow in the video tutorial below.

“NVIDIA Canvas is amazing software that allowed me to make an HDRi backdrop that fit my scene perfectly,” said Blendeered.

From there, he completed the animation process before exporting the files to Blackmagic Design’s DaVinci Resolve version 18.

Completing animation work in Blender.

DaVinci Resolve is a key app for GPU acceleration and AI-powered workflows. All of the AI effects in DaVinci Resolve version 18.6 run twice as fast on NVIDIA RTX GPUs with acceleration using the NVIDIA TensorRT software development kit.

Blendeered’s post-production work included GPU-accelerated color grading, video editing and color scopes. And NVENC, a GPU hardware accelerator engine for video decoding, enabled faster, smoother playback and scrubbing of high-resolution video files.

Post-production work in DaVinci Resolve.

For the final export, the eighth generation NVENC worked together with the built-in dual encoders on the artist’s GeForce RTX 4090 GPU to generate video files twice as fast. For Blendeered, NVIDIA GPUs are the clear choice for content creation because they provide “power, efficiency and reliability.”

When asked to give advice for aspiring artists, Blendeered encouraged beginners to “embrace consistent practice, learn from failures, seek feedback and stay true to the inner artistic voice.”

3D artist Pedro Soares, aka Blendeered.

Check out Blendeered’s portfolio on Instagram.

Follow NVIDIA Studio on Instagram, Twitter and Facebook. Access tutorials on the Studio YouTube channel and get updates directly in your inbox by subscribing to the Studio newsletter. 

Read More