Announcing recipients of the AFMR Minority Serving Institutions grant

Announcing recipients of the AFMR Minority Serving Institutions grant

Today, as part of the Accelerate Foundation Models Research (AFMR) initiative, Microsoft is delighted to announce the 10 inaugural grant recipients through the AFMR Minority Serving Institutions grant program.

This pilot focuses on supporting historically black colleges and universities (HBCUs) and Hispanic-serving institutions (HSIs), providing them with access to the state-of-the-art tools necessary to conduct meaningful and impactful research on AI. In addition to a grant award, recipients are provided with credits they can use to access leading-edge models hosted by Microsoft Azure through Azure AI Studio (opens in new tab).

AFMR is a collaborative effort with the academic research community and part of the Microsoft pledge to support the President’s Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence. This program is dedicated to exploring foundation models with the aim of achieving three main objectives: aligning AI with human values and goals to enhance safety, responsibility, and transparency; improving human interactions through research at the intersection of technology and society, fostering trust and creativity; and accelerating scientific discovery in natural sciences through innovative knowledge and data approaches.

Operating as a global network and resource platform, AFMR fosters interdisciplinary collaboration among researchers from various fields, addressing significant technical and societal challenges.

AFMR Minority Serving Institutions grant recipients:

Creativity and Design

  • Cesar Torres, AI-Enhanced Bricolage: Augmenting Creative Decision Making in Creative Practices, The University of Texas at Arlington
  • Nikita Soni, Exploring Interaction Design Space for Child-AI Visual Storytelling Creativity Support Tools, University of Illinois, Chicago

Cognition and Societal Benefits

  • Muhammed Idris, Advancing Culturally Congruent Cancer Communication with Foundation Models, Morehouse School of Medicine
  • Hajar Homayouni, Federated Privacy-Preserving Mulitmodal Generator for Synthetic Medical Data Generation, San Diego State University
  • Junzhou Huang, Developing Foundation Models for Survival Prediction from Pathological Image and Biomedical Text, The University of Texas at Arlington
  • Pedram Rooshenas, LLM-Powered Teaching Assistant for Computer Science Courses, University of Illinois, Chicago

Benchmarks, Evaluation and Measurement

  • Kinnis Gosha, Evaluation of Hybrid AI Systems for Workforce Performance Evaluation, Morehouse College

Model Advancement

  • Davi Valerio de Queiroz Rodrigues, Advancing Foundation Models Towards Physical AI: Bridging the Gap Between Natural Language and Wireless Sensing, The University of Texas at El Paso

Multimodal and Crossmodal Learning

  • Amr Magdy, Visual Knowledge Distillation on the Edge: An Application on Enhancing Self-Inference Cameras for Near-Real-Time Operations, University of California, Riverside

Responsible AI

  • Xiang (Susie) Zhao, Accelerating Environmental Justice Analysis using Foundational Models for Intelligent Disaster Recovery and City Planning, Alabama A&M University

By driving deeper collaboration across disciplines, institutions, and sectors, Microsoft aims to unlock the full potential of AI across greater breadth of research pursuits, application domains, and societal contexts.

The post Announcing recipients of the AFMR Minority Serving Institutions grant appeared first on Microsoft Research.

Read More

Behold the ‘Magic Valley’: Brandon Tieh’s Stunning Scene Showcases Peak Creativity, Powered by RTX and AI

Behold the ‘Magic Valley’: Brandon Tieh’s Stunning Scene Showcases Peak Creativity, Powered by RTX and AI

Editor’s note: This post is part of our weekly In the NVIDIA Studio series, which celebrates featured artists, offers creative tips and tricks, and demonstrates how NVIDIA Studio technology improves creative workflows. We’re also deep diving on new GeForce RTX 40 Series GPU features, technologies and resources, and how they dramatically accelerate content creation.

This week’s featured In the NVIDIA Studio 3D artist Brandon Tieh puts his artistic talents on full display with his whimsical scene Magic Valley.

An array of colors — from bright crimson to hushed blues and lush greens — help set the mood of the vivid scene, which took inspiration from Tieh’s love for video games, anime and manga.

His diverse pieces — along with the works of fellow Studio artists like Christian Dimitrov, Vera Dementchouk and Eddie Mendoza — take audiences to fantastical getaways in the latest Studio Standouts video.

To fuel creative work, the new GeForce RTX 4080 SUPER is available starting tomorrow in a limited Founders Edition design and as custom boards from partners, starting at $999. It’s equipped with more cores than the GeForce RTX 4080 and includes the world’s fastest GDDR6X video memory at 23 Gbps. In 3D apps like Blender, it can run up to 70% faster than previous generations. The video editing app Blackmagic Design’s DaVinci Resolve accelerates AI effects over 30% faster than the GeForce RTX 3080 Ti.

Get creative and AI superpowers with the GeForce RTX 4080 SUPER.

The RTX 4080 SUPER also brings great frame rates and stunning 4K resolution in fully ray-traced games, including Alan Wake 2, Cyberpunk 2077: Phantom Liberty and Portal with RTX. Discover what RTX 40 SUPER Series graphics cards and systems are available.

Stepping Into Magical Worlds

Tieh’s scene began as a sketch of what he envisioned to be a giant door in a grassy field.

“A very broad and abstract thought — but that’s sort of the point of fantasy,” he explained.

Tieh specializes in building impressive worlds.

To bring it to life, he began by gathering assets such as rocks and grass from Quixel Megascans, the Unreal Engine Marketplace and ArtStation Marketplace.

The door required extra customization, so he modeled one from scratch, first sculpting it in ZBrush before importing it into Adobe 3D Substance Painter for a quick texture pass. Tieh’s GeForce RTX graphics card used RTX-accelerated light and ambient occlusion to bake assets in mere seconds.

Lighting will heavily influence the final look, so only a basic texture pass was needed in Substance 3D Painter.

Next, Tieh tackled modeling the obelisks and pillars in Blender, where RTX-accelerated OptiX ray tracing in the viewport ensured highly interactive, photorealistic rendering.

Modeling and UV layouts of pillars and obelisks.

He then unwrapped his 3D assets onto a 2D plane, where he applied textures to the model’s surfaces to enhance realism — a key process called UV unwrapping.

Tieh’s customized door in Unreal Engine.

With the textured assets in place, Tieh next built the scene in Unreal Engine. His technique involves focusing on the big shapes by looking at the scene in a smaller thumbnail view, then flipping the canvas to refresh his perspective — very similar to the approach concept artists use. He adjusted lighting by deploying the same technique.

“Magic Valley” adjusting atmospheric light in the scene.

“I’ve used NVIDIA GPUs all my life — they’re super reliable and high performing without any issues.”  — Brandon Tieh

Unreal Engine users can tap NVIDIA DLSS Super Resolution to increase the interactivity of the viewport by using AI to upscale frames rendered at lower resolution, and enhance image quality using DLSS Ray Reconstruction.

Fog is another major component of the scene. “Fog is generally darker and more opaque in the background and becomes lighter and more translucent as it approaches the foreground,” said Tieh. He primarily used fog cards in Unreal Engine’s free Blueprints Visual Scripting system to add a paint-like effect.

The majority of lighting was artificial, meaning Tieh had to use a significant number of individually placed light sources, but “it looks very believable if executed well,” he explained.

The menu on the right lists the roughly 100 light actor sources Tieh used.

From there, Tieh exported final renders with ease and speed thanks to his RTX GPU.

 

“There’s no secret tricks or fancy engine features,” said Tieh. “It’s all about sticking to the basics and fundamentals of art, as well as trusting your own artistic eye.”

3D artist Brandon Tieh.

Check out Tieh’s impressive portfolio on ArtStation.

Follow NVIDIA Studio on Instagram, X and Facebook. Access tutorials on the Studio YouTube channel and get updates directly in your inbox by subscribing to the Studio newsletter.

Read More

New Library Updates in PyTorch 2.2

Summary

We are bringing a number of improvements to the current PyTorch libraries, alongside the PyTorch 2.2 release. These updates demonstrate our focus on developing common and extensible APIs across all domains to make it easier for our community to build ecosystem projects on PyTorch.

Latest Stable Library Versions (Full List)*
TorchArrow 0.1.0 TorchRec 0.6.0 TorchVision 0.17
TorchAudio 2.2.0 TorchServe 0.9.0 TorchX 0.7.0
TorchData 0.7.1 TorchText 0.17.0 PyTorch on XLA Devices 2.1

*To see prior versions or (unstable) nightlies, click on versions in the top left menu above ‘Search Docs’.

TorchRL

Feature: TorchRL’s Offline RL Data Hub

TorchRL now provides one of the largest dataset hubs for offline RL and imitation learning, and it all comes under a single data format (TED, for TorchRL Episode Data format). This makes it possible to easily swap from different sources in a single training loop. It is also now possible to easily combine datasets of different sources through the ReplayBufferEnsemble class. The data processing is fully customizable. Sources include simulated tasks (Minari, D4RL, VD4RL), robotic datasets (Roboset, OpenX Embodied dataset) and gaming (GenDGRL/ProcGen, Atari/DQN). Check these out in the documentation.

Aside from these changes, our replay buffers can now be dumped on disk using the .dumps() method which will serialize the buffers on disk using the TensorDict API which is faster, safer and more efficient than using torch.save.

Finally, replay buffers can now be read and written from separate processes on the same machine without any extra code needed from the user!

TorchRL2Gym environment API

To facilitate TorchRL’s integration in existing code-bases and enjoy all the features of TorchRL’s environment API (execution on device, batched operations, transforms…) we provide a TorchRL-to-gym API that allows users to register any environment they want in gym or gymnasium. This can be used in turn to make TorchRL a universal lib-to-gym converter that works across stateless (eg, dm_control) and stateless (Brax, Jumanji) environments. The feature is thoroughly detailed in the doc. The info_dict reading API has also been improved.

Environment speedups

We added the option of executing environments on a different environment than the one used to deliver data in ParallelEnv. We also speeded up the GymLikeEnv class to a level that now makes it competitive with gym itself.

Scaling objectives

The most popular objectives for RLHF and training at scale (PPO and A2C) are now compatible with FSDP and DDP models!

TensorDict

Feature: MemoryMappedTensor to replace MemmapTensor

We provide a much more efficient mmap backend for TensorDict; MemoryMappedTensor, which directly subclasses torch.Tensor. It comes with a bunch of utils to be constructed, such as from_tensor, empty and many more. MemoryMappedTensor is now much safer and faster than its counterpart. The library remains fully compatible with the previous class to facilitate transition.

We also introduce a new set of multithreaded serialization methods that make tensordict serialization highly competitive with torch.save, with serialization and deserialization speeds for LLMs more than 3x faster than with torch.save.

Feature: Non-tensor data within TensorDict

It is not possible to carry non-tensor data through the NonTensorData tensorclass. This makes it possible to build tensordicts with metadata. The memmap-API is fully compatible with these values, allowing users to seamlessly serialize and deserialize such objects. To store non-tensor data in a tensordict, simply assign it using the __setitem__ method.

Efficiency improvements

Several methods runtime have been improved, such as unbind, split, map or even TensorDict instantiation. Check our benchmarks!

TorchRec/fbgemm_gpu

VBE

TorchRec now natively supports VBE (variable batched embeddings) within the EmbeddingBagCollection module. This allows variable batch size per feature, unlocking sparse input data deduplication, which can greatly speed up embedding lookup and all-to-all time. To enable, simply initialize KeyedJaggedTensor with stride_per_key_per_rank and inverse_indices fields, which specify batch size per feature and inverse indices to reindex the embedding output respectively.

In addition to the TorchRec library changes, fbgemm_gpu has added the support for variable batch size per feature in TBE. VBE is enabled on split TBE training for both weighted and unweighted cases. To use VBE, please make sure to use the latest fbgemm_gpu version.

Embedding offloading

This technique refers to using CUDA UVM to cache ‘hot’ embeddings (i.e. store embedding tables on host memory with cache on HBM memory), and prefetching the cache. Embedding offloading allows running a larger model with fewer GPUs, while maintaining competitive performance. Use the prefetching pipeline (PrefetchTrainPipelineSparseDist) and pass in per-table cache load factor and the prefetch_pipeline flag through constraints in the planner to use this feature.

Fbgemm_gpu has introduced UVM cache pipeline prefetching in v0.5.0 for TBE performance speedup. This allows cache-insert to be executed in parallel with TBE forward/backward. To enable this feature, please be sure to use the latest fbgemm_gpu version.

Trec.shard/shard_modules

These APIs replace embedding submodules with its sharded variant. The shard API applies to an individual embedding module while the shard_modules API replaces all embedding modules and won’t touch other non-embedding submodules.

Embedding sharding follows similar behavior to the prior TorchRec DistributedModuleParallel behavior, except the ShardedModules have been made composable, meaning the modules are backed by TableBatchedEmbeddingSlices which are views into the underlying TBE (including .grad). This means that fused parameters are now returned with named_parameters(), including in DistributedModuleParallel.

TorchVision

The V2 transforms are now stable!

The torchvision.transforms.v2 namespace was still in BETA stage until now. It is now stable! Whether you’re new to Torchvision transforms, or you’re already experienced with them, we encourage you to start with Getting started with transforms v2 in order to learn more about what can be done with the new v2 transforms.

Browse our main docs for general information and performance tips. The available transforms and functionals are listed in the API reference. Additional information and tutorials can also be found in our example gallery, e.g. Transforms v2: End-to-end object detection/segmentation example or How to write your own v2 transforms.

Towards torch.compile() support

We are progressively adding support for torch.compile() to torchvision interfaces, reducing graph breaks and allowing dynamic shape.

The torchvision ops (nms, [ps_]roi_align, [ps_]roi_pool and deform_conv_2d) are now compatible with torch.compile and dynamic shapes.

On the transforms side, the majority of low-level kernels (like resize_image() or crop_image()) should compile properly without graph breaks and with dynamic shapes. We are still addressing the remaining edge-cases, moving up towards full functional support and classes, and you should expect more progress on that front with the next release.

Read More

PyTorch 2.2: FlashAttention-v2 integration, AOTInductor

We are excited to announce the release of PyTorch® 2.2 (release note)! PyTorch 2.2 offers ~2x performance improvements to scaled_dot_product_attention via FlashAttention-v2 integration, as well as AOTInductor, a new ahead-of-time compilation and deployment tool built for non-python server-side deployments.

This release also includes improved torch.compile support for Optimizers, a number of new inductor optimizations, and a new logging mechanism called TORCH_LOGS.

Please note that we are deprecating macOS x86 support, and PyTorch 2.2.x will be the last version that supports macOS x64.

Along with 2.2, we are also releasing a series of updates to the PyTorch domain libraries. More details can be found in the library updates blog.

This release is composed of 3,628 commits and 521 contributors since PyTorch 2.1. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try these out and report any issues as we improve 2.2. More information about how to get started with the PyTorch 2-series can be found at our Getting Started page.

Summary:

  • scaled_dot_product_attention (SDPA) now supports FlashAttention-2, yielding around 2x speedups compared to previous versions.
  • PyTorch 2.2 introduces a new ahead-of-time extension of TorchInductor called AOTInductor, designed to compile and deploy PyTorch programs for non-python server-side.
  • torch.distributed supports a new abstraction for initializing and representing ProcessGroups called device_mesh.
  • PyTorch 2.2 ships a standardized, configurable logging mechanism called TORCH_LOGS.
  • A number of torch.compile improvements are included in PyTorch 2.2, including improved support for compiling Optimizers and improved TorchInductor fusion and layout optimizations.
  • Please note that we are deprecating macOS x86 support, and PyTorch 2.2.x will be the last version that supports macOS x64.
Stable Beta Performance Improvements
FlashAttention-2 Integration Inductor optimizations
AOTInductor aarch64 optimizations
TORCH_LOGS
device_mesh
Optimizer compilation

*To see a full list of public feature submissions click here.

Beta Features

[Beta] FlashAttention-2 support in torch.nn.functional.scaled_dot_product_attention

torch.nn.functional.scaled_dot_product_attention (SDPA) now supports FlashAttention-2, yielding around 2x speedups (compared to the previous version) and reaching ~50-73% of theoretical maximum FLOPs/s on A100 GPUs.

More information is available on FlashAttention-2 in this paper.

For a tutorial on how to use SDPA please see this tutorial.

[Beta] AOTInductor: ahead-of-time compilation and deployment for torch.export-ed programs

AOTInductor is an extension of TorchInductor, designed to process exported PyTorch models, optimize them, and produce shared libraries as well as other relevant artifacts. These compiled artifacts can be deployed in non-Python environments, which are frequently employed for inference on the server-side. Note that AOTInductor supports the same backends as Inductor, including CUDA, ROCm, and CPU.

For more information please see the AOTInductor tutorial.

[Beta] Fine-grained configurable logging via TORCH_LOGS

PyTorch now ships a standardized, configurable logging mechanism that can be used to analyze the status of various subsystems such as compilation and distributed operations.

Logs can be enabled via the TORCH_LOGS environment variable. For example, to set the log level of TorchDynamo to logging.ERROR and the log level of TorchInductor to logging.DEBUG pass TORCH_LOGS=”-dynamo,+inductor” to PyTorch.

For more information, please see the logging documentation and tutorial.

[Beta] torch.distributed.device_mesh

PyTorch 2.2 introduces a new abstraction for representing the ProcessGroups involved in distributed parallelisms called torch.distributed.device_mesh. This abstraction allows users to represent inter-node and intra-node process groups via an N-dimensional array where, for example, one dimension can data parallelism in FSDP while another could represent tensor parallelism within FSDP.

For more information, see the device_mesh tutorial.

[Beta] Improvements to torch.compile-ing Optimizers

A number of improvements have been made to torch.compile-ing Optimizers including less overhead and support for cuda graphs.

More technical details of the improvements are available on dev-discuss, and a recipe for torch.compile-ing optimizers is available here.

Performance Improvements

Inductor Performance Optimizations

A number of performance optimizations have been added to TorchInductor including horizontal fusion support for torch.concat, improved convolution layout optimizations, and improved scaled_dot_product_attention pattern matching.

For a complete list of inductor optimizations, please see the Release Notes.

aarch64 Performance Optimizations

PyTorch 2.2 includes a number of performance enhancements for aarch64 including support for mkldnn weight pre-packing, improved ideep primitive caching, and improved inference speed via fixed format kernel improvements to OneDNN.

For a complete list of aarch64 optimizations, please see the Release Notes.

Read More

Benchmark and optimize endpoint deployment in Amazon SageMaker JumpStart 

Benchmark and optimize endpoint deployment in Amazon SageMaker JumpStart 

When deploying a large language model (LLM), machine learning (ML) practitioners typically care about two measurements for model serving performance: latency, defined by the time it takes to generate a single token, and throughput, defined by the number of tokens generated per second. Although a single request to the deployed endpoint would exhibit a throughput approximately equal to the inverse of model latency, this is not necessarily the case when multiple concurrent requests are simultaneously sent to the endpoint. Due to model serving techniques, such as client-side continuous batching of concurrent requests, latency and throughput have a complex relationship that varies significantly based on model architecture, serving configurations, instance type hardware, number of concurrent requests, and variations in input payloads such as number of input tokens and output tokens.

This post explores these relationships via a comprehensive benchmarking of LLMs available in Amazon SageMaker JumpStart, including Llama 2, Falcon, and Mistral variants. With SageMaker JumpStart, ML practitioners can choose from a broad selection of publicly available foundation models to deploy to dedicated Amazon SageMaker instances within a network-isolated environment. We provide theoretical principles on how accelerator specifications impact LLM benchmarking. We also demonstrate the impact of deploying multiple instances behind a single endpoint. Finally, we provide practical recommendations for tailoring the SageMaker JumpStart deployment process to align with your requirements on latency, throughput, cost, and constraints on available instance types. All the benchmarking results as well as recommendations are based on a versatile notebook that you can adapt to your use case.

Deployed endpoint benchmarking

The following figure shows the lowest latencies (left) and highest throughput (right) values for deployment configurations across a variety of model types and instance types. Importantly, each of these model deployments use default configurations as provided by SageMaker JumpStart given the desired model ID and instance type for deployment.

These latency and throughput values correspond to payloads with 256 input tokens and 256 output tokens. The lowest latency configuration limits model serving to a single concurrent request, and the highest throughput configuration maximizes the possible number of concurrent requests. As we can see in our benchmarking, increasing concurrent requests monotonically increases throughput with diminishing improvement for large concurrent requests. Additionally, models are fully sharded on the supported instance. For example, because the ml.g5.48xlarge instance has 8 GPUs, all SageMaker JumpStart models using this instance are sharded using tensor parallelism on all eight available accelerators.

We can note a few takeaways from this figure. First, not all models are supported on all instances; some smaller models, such as Falcon 7B, don’t support model sharding, whereas larger models have higher compute resource requirements. Second, as sharding increases, performance typically improves, but may not necessarily improve for small modelsThis is because small models such as 7B and 13B incur a substantial communication overhead when sharded across too many accelerators. We discuss this in more depth later. Finally, ml.p4d.24xlarge instances tend to have significantly better throughput due to memory bandwidth improvements of A100 over A10G GPUs. As we discuss later, the decision to use a particular instance type depends on your deployment requirements, including latency, throughput, and cost constraints.

How can you obtain these lowest latency and highest throughput configuration values? Let’s start by plotting latency vs. throughput for a Llama 2 7B endpoint on an ml.g5.12xlarge instance for a payload with 256 input tokens and 256 output tokens, as seen in the following curve. A similar curve exists for every deployed LLM endpoint.

As concurrency increases, throughput and latency also monotonically increase. Therefore, the lowest latency point occurs at a concurrent request value of 1, and you can cost-effectively increase system throughput by increasing concurrent requests. There exists a distinct “knee” in this curve, where it’s obvious that the throughput gains associated with additional concurrency don’t outweigh the associated increase in latency. The exact location of this knee is use case-specific; some practitioners may define the knee at the point where a pre-specified latency requirement is exceeded (for example, 100 ms/token), whereas others may use load test benchmarks and queueing theory methods like the half-latency rule, and others may use theoretical accelerator specifications.

We also note that the maximum number of concurrent requests is limited. In the preceding figure, the line trace ends with 192 concurrent requests. The source of this limitation is the SageMaker invocation timeout limit, where SageMaker endpoints timeout an invocation response after 60 seconds. This setting is account-specific and not configurable for an individual endpoint. For LLMs, generating a large number of output tokens can take seconds or even minutes. Therefore, large input or output payloads can cause the invocation requests to fail. Furthermore, if the number of concurrent requests is very large, then many requests will experience large queue times, driving this 60-second timeout limit. For the purpose of this study, we use the timeout limit to define the maximum throughput possible for a model deployment. Importantly, although a SageMaker endpoint may handle a large number of concurrent requests without observing an invocation response timeout, you may want to define maximum concurrent requests with respect to the knee in the latency-throughput curve. This is likely the point at which you start to consider horizontal scaling, where a single endpoint provisions multiple instances with model replicas and load balances incoming requests between the replicas, to support more concurrent requests.

Taking this one step further, the following table contains benchmarking results for different configurations for the Llama 2 7B model, including different number of input and output tokens, instance types, and number of concurrent requests. Note that the preceding figure only plots a single row of this table.

. Throughput (tokens/sec) Latency (ms/token)
Concurrent Requests 1 2 4 8 16 32 64 128 256 512 1 2 4 8 16 32 64 128 256 512
Number of total tokens: 512,    Number of output tokens: 256
ml.g5.2xlarge 30 54 115 208 343 475 486 33 33 35 39 48 97 159
ml.g5.12xlarge 59 117 223 406 616 866 1098 1214 17 17 18 20 27 38 60 112
ml.g5.48xlarge 56 108 202 366 522 660 707 804 18 18 19 22 32 50 101 171
ml.p4d.24xlarge 49 85 178 353 654 1079 1544 2312 2905 2944 21 23 22 23 26 31 44 58 92 165
Number of total tokens: 4096,    Number of output tokens: 256
ml.g5.2xlarge 20 36 48 49 48 57 104 170
ml.g5.12xlarge 33 58 90 123 142 31 34 48 73 132
ml.g5.48xlarge 31 48 66 82 31 43 68 120
ml.p4d.24xlarge 39 73 124 202 278 290 26 27 33 43 66 107

We observe some additional patterns in this data. When increasing context size, latency increases and throughput decreases. For instance, on ml.g5.2xlarge with a concurrency of 1, throughput is 30 tokens/sec when the number of total tokens is 512, vs. 20 tokens/sec if the number of total tokens is 4,096. This is because it takes more time to process the larger input. We can also see that increasing GPU capability and sharding impacts the maximum throughput and maximum supported concurrent requests. The table shows that Llama 2 7B has notably different maximum throughput values for different instance types, and these maximum throughput values occur at different values of concurrent requests. These characteristics would drive an ML practitioner to justify the cost of one instance over another. For example, given a low latency requirement, the practitioner might select an ml.g5.12xlarge instance (4 A10G GPUs) over an ml.g5.2xlarge instance (1 A10G GPU). If given a high throughput requirement, the use of an ml.p4d.24xlarge instance (8 A100 GPUs) with full sharding would only be justified under high concurrency. Note, however, that it’s often beneficial to instead load multiple inference components of a 7B model on a single ml.p4d.24xlarge instance; such multi-model support is discussed later in this post.

The preceding observations were made for the Llama 2 7B model. However, similar patterns remain true for other models as well. A primary takeaway is that latency and throughput performance numbers are dependent on payload, instance type, and number of concurrent requests, so you will need to find the ideal configuration for your specific application. To generate the preceding numbers for your use case, you can run the linked notebook, where you can configure this load test analysis for your model, instance type, and payload.

Making sense of accelerator specifications

Selecting suitable hardware for LLM inference relies heavily on specific use cases, user experience goals, and the chosen LLM. This section attempts to create an understanding of the knee in the latency-throughput curve with respect to high-level principles based on accelerator specifications. These principles alone don’t suffice to make a decision: real benchmarks are necessary. The term device is used here to encompass all ML hardware accelerators. We assert the knee in the latency-throughput curve is driven by one of two factors:

  • The accelerator has exhausted memory to cache KV matrices, so subsequent requests are queued
  • The accelerator still has spare memory for the KV cache, but is using a large enough batch size that processing time is driven by compute operation latency rather than memory bandwidth

We typically prefer to be limited by the second factor because this implies the accelerator resources are saturated. Basically, you are maximizing the resources you payed for. Let’s explore this assertion in greater detail.

KV caching and device memory

Standard transformer attention mechanisms compute attention for each new token against all previous tokens. Most modern ML servers cache attention keys and values in device memory (DRAM) to avoid re-computation at every step. This is called this the KV cache, and it grows with batch size and sequence length. It defines how many user requests can be served in parallel and will determine the knee in the latency-throughput curve if the compute-bound regime in the second scenario mentioned earlier is not yet met, given the available DRAM. The following formula is a rough approximation for the maximum KV cache size.

In this formula, B is batch size and N is number of accelerators. For example, the Llama 2 7B model in FP16 (2 bytes/parameter) served on an A10G GPU (24 GB DRAM) consumes approximately 14 GB, leaving 10 GB for the KV cache. Plugging in the model’s full context length (N = 4096) and remaining parameters (n_layers=32, n_kv_attention_heads=32, and d_attention_head=128), this expression shows we are limited to serving a batch size of four users in parallel due to DRAM constraints. If you observe the corresponding benchmarks in the previous table, this is a good approximation for the observed knee in this latency-throughput curve. Methods such as grouped query attention (GQA) can reduce the KV cache size, in GQA’s case by the same factor it reduces the number of KV heads.

Arithmetic intensity and device memory bandwidth

The growth in the computational power of ML accelerators has outpaced their memory bandwidth, meaning they can perform many more computations on each byte of data in the amount of time it takes to access that byte.

The arithmetic intensity, or the ratio of compute operations to memory accesses, for an operation determines if it is limited by memory bandwidth or compute capacity on the selected hardware. For example, an A10G GPU (g5 instance type family) with 70 TFLOPS FP16 and 600 GB/sec bandwidth can compute approximately 116 ops/byte. An A100 GPU (p4d instance type family) can compute approximately 208 ops/byte. If the arithmetic intensity for a transformer model is under that value, it is memory-bound; if it is above, it is compute-bound. The attention mechanism for Llama 2 7B requires 62 ops/byte for batch size 1 (for an explanation, see A guide to LLM inference and performance), which means it is memory-bound. When the attention mechanism is memory-bound, expensive FLOPS are left unutilized.

There are two ways to better utilize the accelerator and increase arithmetic intensity: reduce the required memory accesses for the operation (this is what FlashAttention focuses on) or increase the batch size. However, we might not be able to increase our batch size enough to reach a compute-bound regime if our DRAM is too small to hold the corresponding KV cache. A crude approximation of the critical batch size B* that separates compute-bound from memory-bound regimes for standard GPT decoder inference is described by the following expression, where A_mb is the accelerator memory bandwidth, A_f is accelerator FLOPS, and N is the number of accelerators. This critical batch size can be derived by finding where memory access time equals computation time. Refer to this blog post to understand Equation 2 and its assumptions in greater detail.

This is the same ops/byte ratio we previously calculated for A10G, so the critical batch size on this GPU is 116. One way to approach this theoretical, critical batch size is to increase model sharding and split the cache across more N accelerators. This effectively increases the KV cache capacity as well as the memory-bound batch size.

Another benefit of model sharding is splitting model parameter and data loading work across N accelerators. This type of sharding is a type of model parallelism also referred to as tensor parallelism. Naively, there is N times the memory bandwidth and compute power in aggregate. Assuming no overhead of any kind (communication, software, and so on), this would decrease decoding latency per token by N if we are memory-bound, because token decoding latency in this regime is bound by the time it takes to load the model weights and cache. In real life, however, increasing the degree of sharding results in increased communication between devices to share intermediate activations at every model layer. This communication speed is limited by the device interconnect bandwidth. It’s difficult to estimate its impact precisely (for details, see Model parallelism), but this can eventually stop yielding benefits or deteriorate performance — this is especially true for smaller models, because smaller data transfers lead to lower transfer rates.

To compare ML accelerators based on their specs, we recommend the following. First, calculate the approximate critical batch size for each accelerator type according to the second equation and the KV cache size for the critical batch size according to the first equation. You can then use the available DRAM on the accelerator to calculate the minimum number of accelerators required to fit the KV cache and model parameters. If deciding between multiple accelerators, prioritize accelerators in order of lowest cost per GB/sec of memory bandwidth. Finally, benchmark these configurations and verify what is the best cost/token for your upper bound of desired latency.

Select an endpoint deployment configuration

Many LLMs distributed by SageMaker JumpStart use the text-generation-inference (TGI) SageMaker container for model serving. The following table discusses how to adjust a variety of model serving parameters to either affect model serving which impacts the latency-throughput curve or protect the endpoint against requests that would overload the endpoint. These are the primary parameters you can use to configure your endpoint deployment for your use case. Unless otherwise specified, we use default text generation payload parameters and TGI environment variables.

Environment Variable Description SageMaker JumpStart Default Value
Model serving configurations . .
MAX_BATCH_PREFILL_TOKENS Limits the number of tokens in the prefill operation. This operation generates the KV cache for a new input prompt sequence. It is memory intensive and compute bound, so this value caps the number of tokens allowed in a single prefill operation. Decoding steps for other queries pause while prefill is occurring. 4096 (TGI default) or model-specific maximum supported context length (SageMaker JumpStart provided), whichever is greater.
MAX_BATCH_TOTAL_TOKENS Controls the maximum number of tokens to include within a batch during decoding, or a single forward pass through the model. Ideally, this is set to maximize the usage of all available hardware. Not specified (TGI default). TGI will set this value with respect to remaining CUDA memory during model warm up.
SM_NUM_GPUS The number of shards to use. That is, the number of GPUs used to run the model using tensor parallelism. Instance dependent (SageMaker JumpStart provided). For each supported instance for a given model, SageMaker JumpStart provides the best setting for tensor parallelism.
Configurations to guard your endpoint (set these for your use case) . .
MAX_TOTAL_TOKENS This caps the memory budget of a single client request by limiting the number of tokens in the input sequence plus the number of tokens in the output sequence (the max_new_tokens payload parameter). Model-specific maximum supported context length. For example, 4096 for Llama 2.
MAX_INPUT_LENGTH Identifies the maximum allowed number of tokens in the input sequence for a single client request. Things to consider when increasing this value include: longer input sequences require more memory, which affects continuous batching, and many models have a supported context length that should not be exceeded. Model-specific maximum supported context length. For example, 4095 for Llama 2.
MAX_CONCURRENT_REQUESTS The maximum number of concurrent requests allowed by the deployed endpoint. New requests beyond this limit will immediately raise a model overloaded error to prevent poor latency for the current processing requests. 128 (TGI default). This setting allows you to obtain high throughput for a variety of use cases, but you should pin as appropriate to mitigate SageMaker invocation timeout errors.

The TGI server uses continuous batching, which dynamically batches concurrent requests together to share a single model inference forward pass. There are two types of forward passes: prefill and decode. Each new request must run a single prefill forward pass to populate the KV cache for the input sequence tokens. After the KV cache is populated, a decode forward pass performs a single next-token prediction for all batched requests, which is iteratively repeated to produce the output sequence. As new requests are sent to the server, the next decode step must wait so the prefill step can run for the new requests. This must occur before those new requests are included in subsequent continuously batched decode steps. Due to hardware constraints, the continuous batching used for decoding may not include all requests. At this point, requests enter a processing queue and inference latency starts to significantly increase with only minor throughput gain.

It’s possible to separate LLM latency benchmarking analyses into prefill latency, decode latency, and queue latency. The time consumed by each of these components is fundamentally different in nature: prefill is a one-time computation, decoding occurs one time for each token in the output sequence, and queueing involves server batching processes. When multiple concurrent requests are being processed, it becomes difficult to disentangle the latencies from each of these components because the latency experienced by any given client request involves queue latencies driven by the need to prefill new concurrent requests as well as queue latencies driven by the inclusion of the request in batch decoding processes. For this reason, this post focuses on end-to-end processing latency. The knee in the latency-throughput curve occurs at the point of saturation where queue latencies start to significantly increase. This phenomenon occurs for any model inference server and is driven by accelerator specifications.

Common requirements during deployment include satisfying a minimum required throughput, maximum allowed latency, maximum cost per hour, and maximum cost to generate 1 million tokens. You should condition these requirements on payloads that represent end-user requests. A design to meet these requirements should consider many factors, including the specific model architecture, size of the model, instance types, and instance count (horizontal scaling). In the following sections, we focus on deploying endpoints to minimize latency, maximize throughput, and minimize cost. This analysis considers 512 total tokens and 256 output tokens.

Minimize latency

Latency is an important requirement in many real-time use cases. In the following table, we look at minimum latency for each model and each instance type. You can achieve minimum latency by setting MAX_CONCURRENT_REQUESTS = 1.

Minimum Latency (ms/token)
Model ID ml.g5.2xlarge ml.g5.12xlarge ml.g5.48xlarge ml.p4d.24xlarge ml.p4de.24xlarge
Llama 2 7B 33 17 18 20
Llama 2 7B Chat 33 17 18 20
Llama 2 13B 22 23 23
Llama 2 13B Chat 23 23 23
Llama 2 70B 57 43
Llama 2 70B Chat 57 45
Mistral 7B 35
Mistral 7B Instruct 35
Mixtral 8x7B 33 27
Falcon 7B 33
Falcon 7B Instruct 33
Falcon 40B 53 33 27
Falcon 40B Instruct 53 33 28
Falcon 180B 42
Falcon 180B Chat 42

To achieve minimum latency for a model, you can use the following code while substituting your desired model ID and instance type:

from sagemaker.jumpstart.model import JumpStartModel

model = JumpStartModel(
    model_id="meta-textgeneration-llama-2-7b",
    model_version="3.*",
    instance_type="ml.g5.12xlarge",
    env={
        "MAX_CONCURRENT_REQUESTS": "1",
        "MAX_INPUT_TOKENS": "256",
        "MAX_TOTAL_TOKENS": "512",
    },
)
predictor = model.deploy(accept_eula=False)  # Change EULA acceptance to True

Note that the latency numbers change depending on the number of input and output tokens. However, the deployment process remains the same except the environment variables MAX_INPUT_TOKENS and MAX_TOTAL_TOKENS. Here, these environment variables are set to help guarantee endpoint latency requirements because larger input sequences may violate the latency requirement. Note that SageMaker JumpStart already provides the other optimal environment variables when selecting instance type; for instance, using ml.g5.12xlarge will set SM_NUM_GPUS to 4 in the model environment.

Maximize throughput

In this section, we maximize the number of generated tokens per second. This is typically achieved at the maximum valid concurrent requests for the model and the instance type. In the following table, we report the throughput achieved at the largest concurrent request value achieved before encountering a SageMaker invocation timeout for any request.

Maximum Throughput (tokens/sec), Concurrent Requests
Model ID ml.g5.2xlarge ml.g5.12xlarge ml.g5.48xlarge ml.p4d.24xlarge ml.p4de.24xlarge
Llama 2 7B 486 (64) 1214 (128) 804 (128) 2945 (512)
Llama 2 7B Chat 493 (64) 1207 (128) 932 (128) 3012 (512)
Llama 2 13B 787 (128) 496 (64) 3245 (512)
Llama 2 13B Chat 782 (128) 505 (64) 3310 (512)
Llama 2 70B 124 (16) 1585 (256)
Llama 2 70B Chat 114 (16) 1546 (256)
Mistral 7B 947 (64)
Mistral 7B Instruct 986 (128)
Mixtral 8x7B 701 (128) 3196 (512)
Falcon 7B 1340 (128)
Falcon 7B Instruct 1313 (128)
Falcon 40B 244 (32) 382 (64) 2699 (512)
Falcon 40B Instruct 245 (32) 415 (64) 2675 (512)
Falcon 180B 1100 (128)
Falcon 180B Chat 1081 (128)

To achieve maximum throughput for a model, you can use the following code:

from sagemaker.jumpstart.model import JumpStartModel

model = JumpStartModel(
    model_id="meta-textgeneration-llama-2-7b",
    model_version="3.*",
    instance_type="ml.g5.12xlarge",
    env={
        "MAX_CONCURRENT_REQUESTS": "128",  # For your application, identify it from the benchmarking table with the maximum feasible concurrent requests.
        "MAX_INPUT_TOKENS": "256",
        "MAX_TOTAL_TOKENS": "512",
    },
)
predictor = model.deploy(accept_eula=False)  # Change EULA acceptance to True

Note that the maximum number of concurrent requests depends on the model type, instance type, maximum number of input tokens, and maximum number of output tokens. Therefore, you should set these parameters before setting MAX_CONCURRENT_REQUESTS.

Also note that a user interested in minimizing latency is often at odds with a user interested in maximizing throughput. The former is interested in real-time responses, whereas the latter is interested in batch processing such that the endpoint queue is always saturated, thereby minimizing processing downtime. Users who want to maximize throughput conditioned on latency requirements are often interested in operating at the knee in the latency-throughput curve.

Minimize cost

The first option to minimize cost involves minimizing cost per hour. With this, you can deploy a selected model on the SageMaker instance with the lowest cost per hour. For real-time pricing of SageMaker instances, refer to Amazon SageMaker pricing. In general, the default instance type for SageMaker JumpStart LLMs is the lowest-cost deployment option.

The second option to minimize cost involves minimizing the cost to generate 1 million tokens. This is a simple transformation of the table we discussed earlier to maximize throughput, where you can first compute the time it takes in hours to generate 1 million tokens (1e6 / throughput / 3600). You can then multiply this time to generate 1 million tokens with the price per hour of the specified SageMaker instance.

Note that instances with the lowest cost per hour aren’t the same as instances with the lowest cost to generate 1 million tokens. For instance, if the invocation requests are sporadic, an instance with the lowest cost per hour might be optimal, whereas in the throttling scenarios, the lowest cost to generate a million tokens might be more appropriate.

Tensor parallel vs. multi-model trade-off

In all previous analyses, we considered deploying a single model replica with a tensor parallel degree equal to the number of GPUs on the deployment instance type. This is the default SageMaker JumpStart behavior. However, as previously noted, sharding a model can improve model latency and throughput only up to a certain limit, beyond which inter-device communication requirements dominate computation time. This implies that it’s often beneficial to deploy multiple models with a lower tensor parallel degree on a single instance rather than a single model with a higher tensor parallel degree.

Here, we deploy Llama 2 7B and 13B endpoints on ml.p4d.24xlarge instances with tensor parallel (TP) degrees of 1, 2, 4, and 8. For clarity in model behavior, each of these endpoints only load a single model.

. Throughput (tokens/sec) Latency (ms/token)
Concurrent Requests 1 2 4 8 16 32 64 128 256 512 1 2 4 8 16 32 64 128 256 512
TP Degree Llama 2 13B
1 38 74 147 278 443 612 683 722 26 27 27 29 37 45 87 174
2 49 92 183 351 604 985 1435 1686 1726 21 22 22 22 25 32 46 91 159
4 46 94 181 343 655 1073 1796 2408 2764 2819 23 21 21 24 25 30 37 57 111 172
8 44 86 158 311 552 1015 1654 2450 3087 3180 22 24 26 26 29 36 42 57 95 152
. Llama 2 7B
1 62 121 237 439 778 1122 1569 1773 1775 16 16 17 18 22 28 43 88 151
2 62 122 239 458 780 1328 1773 2440 2730 2811 16 16 17 18 21 25 38 56 103 182
4 60 106 211 420 781 1230 2206 3040 3489 3752 17 19 20 18 22 27 31 45 82 132
8 49 97 179 333 612 1081 1652 2292 2963 3004 22 20 24 26 27 33 41 65 108 167

Our previous analyses already showed significant throughput advantages on ml.p4d.24xlarge instances, which often translates to better performance in terms of cost to generate 1 million tokens over the g5 instance family under high concurrent request load conditions. This analysis clearly demonstrates that you should consider the trade-off between model sharding and model replication within a single instance; that is, a fully sharded model is not typically the best use of  ml.p4d.24xlarge compute resources for 7B and 13B model families. In fact, for the 7B model family, you obtain the best throughput for a single model replica with a tensor parallel degree of 4 instead of 8.

From here, you can extrapolate that the highest throughput configuration for the 7B model involves a tensor parallel degree of 1 with eight model replicas, and the highest throughput configuration for the 13B model is likely a tensor parallel degree of 2 with four model replicas. To learn more about how to accomplish this, refer to Reduce model deployment costs by 50% on average using the latest features of Amazon SageMaker, which demonstrates the use of inference component-based endpoints. Due to load balancing techniques, server routing, and sharing of CPU resources, you might not fully achieve throughput improvements exactly equal to the number of replicas times the throughput for a single replica.

Horizontal scaling

As observed earlier, each endpoint deployment has a limitation on the number of concurrent requests depending on the number of input and output tokens as well as the instance type. If this doesn’t meet your throughput or concurrent request requirement, you can scale up to utilize more than one instance behind the deployed endpoint. SageMaker automatically performs load balancing of queries between instances. For example, the following code deploys an endpoint supported by three instances:

model = JumpStartModel(
    model_id="meta-textgeneration-llama-2-7b",
    model_version="3.*",
    instance_type="ml.g5.2xlarge",
)
predictor = model.deploy(
    accept_eula=False,  # Change EULA acceptance to True
    initial_instance_count = 3,
)

The following table shows the throughput gain as a factor of number of instances for the Llama 2 7B model.

. . Throughput (tokens/sec) Latency (ms/token)
. Concurrent Requests 1 2 4 8 16 32 64 128 1 2 4 8 16 32 64 128
Instance Count Instance Type Number of total tokens: 512, Number of output tokens: 256
1 ml.g5.2xlarge 30 60 115 210 351 484 492 32 33 34 37 45 93 160
2 ml.g5.2xlarge 30 60 115 221 400 642 922 949 32 33 34 37 42 53 94 167
3 ml.g5.2xlarge 30 60 118 228 421 731 1170 1400 32 33 34 36 39 47 57 110

Notably, the knee in the latency-throughput curve shifts to the right because higher instance counts can handle larger numbers of concurrent requests within the multi-instance endpoint. For this table, the concurrent request value is for the entire endpoint, not the number of concurrent requests that each individual instance receives.

You can also use autoscaling, a feature to monitor your workloads and dynamically adjust the capacity to maintain steady and predictable performance at the possible lowest cost. This is beyond the scope of this post. To learn more about autoscaling, refer to Configuring autoscaling inference endpoints in Amazon SageMaker.

Invoke endpoint with concurrent requests

Let’s suppose you have a large batch of queries that you would like to use to generate responses from a deployed model under high throughput conditions. For example, in the following code block, we compile a list of 1,000 payloads, with each payload requesting the generation of 100 tokens. In all, we are requesting the generation of 100,000 tokens.

payload = {
    "inputs": "I believe the meaning of life is to ",
    "parameters": {"max_new_tokens": 100, "details": True},
}
total_requests = 1000
payloads = [payload,] * total_requests

When sending a large number of requests to the SageMaker runtime API, you may experience throttling errors. To mitigate this, you can create a custom SageMaker runtime client that increases the number of retry attempts. You can provide the resulting SageMaker session object to either the JumpStartModel constructor or sagemaker.predictor.retrieve_default if you would like to attach a new predictor to an already deployed endpoint. In the following code, we use this session object when deploying a Llama 2 model with default SageMaker JumpStart configurations:

import boto3
from botocore.config import Config
from sagemaker.session import Session
from sagemaker.jumpstart.model import JumpStartModel

sagemaker_session = Session(
    sagemaker_runtime_client=boto3.client(
        "sagemaker-runtime",
        config=Config(connect_timeout=10, retries={"mode": "standard", "total_max_attempts": 20}),
    )
)
model = JumpStartModel(
    model_id="meta-textgeneration-llama-2-7b",
    model_version="3.*",
    sagemaker_session=sagemaker_session
)
predictor = model.deploy(accept_eula=False)  # Change EULA acceptance to True

This deployed endpoint has MAX_CONCURRENT_REQUESTS = 128 by default. In the following block, we use the concurrent futures library to iterate over invoking the endpoint for all payloads with 128 worker threads. At most, the endpoint will process 128 concurrent requests, and whenever a request returns a response, the executor will immediately send a new request to the endpoint.

import time
from concurrent import futures

concurrent_requests = 128

time_start = time.time()
with futures.ThreadPoolExecutor(max_workers=concurrent_requests) as executor:
    responses = list(executor.map(predictor.predict, payloads))

total_tokens = sum([response[0]["details"]["generated_tokens"] for response in responses])
token_throughput = total_tokens / (time.time() - time_start)

This results in generating 100,000 total tokens with a throughput of 1255 tokens/sec on a single ml.g5.2xlarge instance. This takes approximately 80 seconds to process.

Note that this throughput value is notably different than the maximum throughput for Llama 2 7B on ml.g5.2xlarge in the previous tables of this post (486 tokens/sec at 64 concurrent requests). This is because the input payload uses 8 tokens instead of 256, the output token count is 100 instead of 256, and the smaller token counts allow for 128 concurrent requests. This is a final reminder that all latency and throughput numbers are payload dependent! Changing payload token counts will affect batching processes during model serving, which will in turn affect the emergent prefill, decode, and queue times for your application.

Conclusion

In this post, we presented benchmarking of SageMaker JumpStart LLMs, including Llama 2, Mistral, and Falcon. We also presented a guide to optimize latency, throughput, and cost for your endpoint deployment configuration. You can get started by running the associated notebook to benchmark your use case.


About the Authors

 Dr. Kyle Ulrich is an Applied Scientist with the Amazon SageMaker JumpStart team. His research interests include scalable machine learning algorithms, computer vision, time series, Bayesian non-parametrics, and Gaussian processes. His PhD is from Duke University and he has published papers in NeurIPS, Cell, and Neuron.

Dr. Vivek Madan is an Applied Scientist with the Amazon SageMaker JumpStart team. He got his PhD from University of Illinois at Urbana-Champaign and was a Post Doctoral Researcher at Georgia Tech. He is an active researcher in machine learning and algorithm design and has published papers in EMNLP, ICLR, COLT, FOCS, and SODA conferences.

Dr. Ashish Khetan is a Senior Applied Scientist with Amazon SageMaker JumpStart and helps develop machine learning algorithms. He got his PhD from University of Illinois Urbana-Champaign. He is an active researcher in machine learning and statistical inference, and has published many papers in NeurIPS, ICML, ICLR, JMLR, ACL, and EMNLP conferences.

João Moura is a Senior AI/ML Specialist Solutions Architect at AWS. João helps AWS customers – from small startups to large enterprises – train and deploy large models efficiently, and more broadly build ML platforms on AWS.

Read More

Boston Children’s Researchers, in Joint Effort, Deploy AI Across Their Hip Clinic to Support Patients, Doctors

Boston Children’s Researchers, in Joint Effort, Deploy AI Across Their Hip Clinic to Support Patients, Doctors

Hip disorders, comprising some of the world’s most common joint diseases, are especially prevalent among adolescents and young adults, causing stiffness, pain or a limp. But they can be hard to diagnose using solely 2D medical imaging.

Helping to treat these disorders, the Boston Children’s Hospital’s (BCH’s) Adolescent and Young Adult Hip Preservation Program is the U.S.’s first to deploy a fully automated AI tool across its clinic.

Called VirtualHip, the tool can create a 3D model of a hip from routine medical images, assess anatomy and movement-related issues, and provide clinicians with diagnostic and treatment guidance. It was built at an orthopedic research lab at BCH, Harvard Medical School’s primary pediatric hospital, using the NVIDIA DGX platform.

A team of 10 researchers are working on the project, funded in part by an NVIDIA Academic Hardware Grant, including engineers, computer scientists, orthopedic surgeons, radiologists and software developers.

“Using AI, clinicians can get more value out of the clinical data they routinely collect,” said Dr. Ata Kiapour, the lab’s principal investigator, director of the musculoskeletal informatics group at BCH and assistant professor of orthopedic surgery at Harvard Medical School. “This tool can augment their performance to make better choices in diagnosis and treatment, and free their time to focus on giving patients the best care.”

Getting Straight to the Joint

Often, clinicians must determine a treatment plan — with levels of intervention ranging from physical therapy to total hip replacement — based on just a 2D image, such as an X-ray, CT scan or MRI. Automatically creating 3D models from these images, and using them to conduct comprehensive joint assessments, can help improve the accuracy of diagnosis to inform treatment planning.

“I started a postdoc with an orthopedic surgeon at BCH in 2013, when I saw how an engineer or scientist could help with patient treatment,” said Dr. Kiapour, who’s also trained as a biomedical engineer. “Over the years, I saw that hospitals have a ton of data, but efficient data processing for clinical use was a huge, unmet need.”

VirtualHip, fully integrated with BCH’s hip clinic and radiology database, helps to fill this gap.

Clinicians can log in to the software tool through a web-based portal, view and interact with 3D simulations of 2D medical images, submit analysis requests and see results within an hour — 4x quicker on average than receiving a radiology report after imaging.

The tool, which produces 3D models with a margin of error less than one millimeter, can assess morphological abnormalities and identify issues related to hip motion, such as pain from hip bones rubbing against each other.

VirtualHip was developed using a database with tens of millions of clinical notes and imaging data from patients seen at BCH over the past two decades. Using natural language processing models and computer vision algorithms, Dr. Kiapour’s team processed this data, training the tool to analyze normal versus pathologic hip development and identify factors influencing the injury risk and treatment outcomes.

This will enable VirtualHip to offer patient-specific risk assessment and treatment planning by comparing current patients to previously treated ones with similar demographic backgrounds.

“The hardware support that we got from NVIDIA made all of this possible,” Dr. Kiapour said. “DGX enabled advanced computations on more than 20 years’ worth of historical data for our fine-tuned clinical AI model.”

Clearer Explanations, Better Outcomes for Patients

In addition to assisting doctors in diagnosis and treatment planning, VirtualHip helps patients better understand their condition.

“When patients look at an X-ray, it doesn’t look like a real hip — but with a 3D model that can be rotated, the doctor can show exactly where the joints are impinging or unstable,” Dr. Kiapour said. “This lets the patient better understand their disorder, which usually makes them more compliant with whatever the doctor’s prescribing.”

This type of visualization is especially helpful for children and younger adults, Kiapour added.

The VirtualHip project is under continuous development, including work toward a patient-facing platform — using large language models and generative AI — for personalized analyses and treatment recommendations.

The BCH researchers plan to commercialize the product for use in other hospitals.

Subscribe to NVIDIA healthcare news.

Read More

Co-ML: Collaborative Machine Learning Model Building for Developing Dataset Design Practices

Machine learning (ML) models are fundamentally shaped by data, and building inclusive ML systems requires significant considerations around how to design representative datasets. Yet, few novice-oriented ML modeling tools are designed to foster hands-on learning of dataset design practices, including how to design for data diversity and inspect for data quality.
To this end, we outline a set of four data design practices (DDPs) for designing inclusive ML models and share how we designed a tablet-based application called Co-ML to foster the learning of DDPs through a collbaborative ML model…Apple Machine Learning Research

Investigating Salient Representations and Label Variance Modeling in Dimensional Speech Emotion Analysis

Representations from models such as Bidirectional Encoder Representations from Transformers (BERT) and Hidden units BERT (HuBERT) have helped to achieve state-of-the-art performance in dimensional speech emotion recognition. Both HuBERT, and BERT models generate fairly large dimensional representations, and such models were not trained with emotion recognition task in mind. Such large dimensional representations result in speech emotion models with large parameter size, resulting in both memory and computational cost complexities. In this work, we investigate the selection of representations…Apple Machine Learning Research

Acoustic Model Fusion for End-to-end Speech Recognition

Recent advances in deep learning and automatic speech recognition (ASR) have enabled the end-to-end (E2E) ASR system and boosted its accuracy to a new level. The E2E systems implicitly model all conventional ASR components, such as the acoustic model (AM) and the language model (LM), in a single network trained on audio-text pairs. Despite this simpler system architecture, fusing a separate LM, trained exclusively on text corpora, into the E2E system has proven to be beneficial. However, the application of LM fusion presents certain drawbacks, such as its inability to address the domain…Apple Machine Learning Research