How FP8 boosts LLM training by 18% on Amazon SageMaker P5 instances

How FP8 boosts LLM training by 18% on Amazon SageMaker P5 instances

Large language models (LLMs) are AI systems trained on vast amounts of text data, enabling them to understand, generate, and reason with natural language in highly capable and flexible ways. LLM training has seen remarkable advances in recent years, with organizations pushing the boundaries of what’s possible in terms of model size, performance, and efficiency. In this post, we explore how FP8 optimization can significantly speed up large model training on Amazon SageMaker P5 instances.

LLM training using SageMaker P5

In 2023, SageMaker announced P5 instances, which support up to eight of the latest NVIDIA H100 Tensor Core GPUs. Equipped with high-bandwidth networking technologies like EFA, P5 instances provide a powerful platform for distributed training, enabling large models to be trained in parallel across multiple nodes. With the use of Amazon SageMaker Model Training, organizations have been able to achieve higher training speeds and efficiency by turning to P5 instances. This showcases the transformative potential of training different scales of models faster and more efficiently using SageMaker Training.

LLM training using FP8

P5 instances, which are NVIDIA H100 GPUs underneath, also come with capabilities of training models using FP8 precision. The FP8 data type has emerged as a game changer in LLM training. By reducing the precision of the model’s weights and activations, FP8 allows for more efficient memory usage and faster computation, without significantly impacting model quality. The throughput for running matrix operations like multipliers and convolutions on 32-bit float tensors is much lower than using 8-bit float tensors. FP8 precision reduces the data footprint and computational requirements, making it ideal for large-scale models where memory and speed are critical. This enables researchers to train larger models with the same hardware resources, or to train models faster while maintaining comparable performance. To make the models compatible for FP8, NVIDIA released the Transformer Engine (TE) library, which provides support for some layers like Linear, LayerNorm, and DotProductAttention. To enable FP8 training, models need to use the TE API to incorporate these layers when casted to FP8. For example, the following Python code shows how FP8-compatible layers can be integrated:

try:
    import transformer_engine.pytorch as te
    using_te = True
except ImportError as ie:
    using_te = False
......
linear_type: nn.Module = te.Linear if using_te else nn.Linear
......
    in_proj = linear_type(dim, 3 * n_heads * head_dim, bias=False, device='cuda' if using_te)
    out_proj = linear_type(n_heads * head_dim, dim, bias=False, device='cuda' if using_te)
......

Results

We ran some tests using 1B-parameter and 7B-parameter LLMs by running training with and without FP8. The test is run on 24 billion tokens for one epoch, thereby providing a comparison for throughput (in tokens per second per GPU) and model performance (in loss numbers). For 1B-parameter models, we computed results to compare performance with FP8 using a different number of instances for distributed training. The following table summarizes our results:

Number of P5 Nodes Without FP8 With FP8 % Faster by Using FP8 % Loss Higher with FP8 than Without FP8
Tokens/sec/GPU % Decrease Loss After 1 Epoch Tokens/sec/GPU % Decrease Loss After 1 Epoch
1 40200 6.205 40800 6.395 1.49 3.06
2 38500 4.2288 6.211 41600 -3.4825 6.338 8.05 2.04
4 39500 1.7412 6.244 42000 -4.4776 6.402 6.32 2.53
8 38200 4.9751 6.156 41800 -3.98 6.365 9.42 3.39
16 35500 11.6915 6.024 39500 1.7412 6.223 11.26 3.3
32 33500 16.6667 6.112 38000 5.4726 6.264 13.43 2.48

The following graph that shows the throughput performance of 1B-parameter model in terms of tokens/second/gpu over different numbers of P5 instances:

For 7B-parameter models, we computed results to compare performance with FP8 using different number of instances for distributed training. The following table summarizes our results:

Number of P5 Nodes Without FP8 With FP8 % Faster by Using FP8 % Loss Higher with FP8 than Without FP8
Tokens/sec/GPU % Decrease Loss After 1 Epoch Tokens/sec/GPU % Decrease Loss After 1 Epoch
1 9350 6.595 11000 6.602 15 0.11
2 9400 -0.5347 6.688 10750 2.2935 6.695 12.56 0.1
4 9300 0.5347 6.642 10600 3.6697 6.634 12.26 -0.12
8 9250 1.0695 6.612 10400 4.9541 6.652 11.06 0.6
16 8700 6.9518 6.594 10100 8.7155 6.644 13.86 0.76
32 7900 15.508 6.523 9700 11.8182 6.649 18.56 1.93

The following graph that shows the throughput performance of 7B-parameter model in terms of tokens/second/gpu over different numbers of P5 instances:

The preceding tables show how, when using FP8, the training of 1B models is faster by 13% and training of 7B models is faster by 18%. As model training speed increases with FP8, there is generally a trade-off with a slower decrease in loss. However, the impact on model performance after one epoch remains minimal, with only about a 3% higher loss for 1B models and 2% higher loss for 7B models using FP8 as compared to training without using FP8. The following graph illustrates the loss performance.

As discussed in Scalable multi-node training with TensorFlow, due to inter-node communication, a small decline in the overall throughput is observed as the number of nodes increases.

The impact on LLM training and beyond

The use of FP8 precision combined with SageMaker P5 instances has significant implications for the field of LLM training. By demonstrating the feasibility and effectiveness of this approach, it opens the door for other researchers and organizations to adopt similar techniques, accelerating progress in large model training. Moreover, the benefits of FP8 and advanced hardware extend beyond LLM training. These advancements can also accelerate research in fields like computer vision and reinforcement learning by enabling the training of larger, more complex models with less time and fewer resources, ultimately saving time and cost. In terms of inference, models with FP8 activations have shown to improve two-fold over BF16 models.

Conclusion

The adoption of FP8 precision and SageMaker P5 instances marks a significant milestone in the evolution of LLM training. By pushing the boundaries of model size, training speed, and efficiency, these advancements have opened up new possibilities for research and innovation in large models. As the AI community builds on these technological strides, we can expect even more breakthroughs in the future. Ongoing research is exploring further improvements through techniques such as PyTorch 2.0 Fully Sharded Data Parallel (FSDP) and TorchCompile. Coupling these advancements with FP8 training could lead to even faster and more efficient LLM training. For those interested in the potential impact of FP8, experiments with 1B or 7B models, such as GPT-Neo or Meta Llama 2, on SageMaker P5 instances could offer valuable insights into the performance differences compared to FP16 or FP32.


About the Authors

Romil Shah is a Sr. Data Scientist at AWS Professional Services. Romil has more than 8 years of industry experience in computer vision, machine learning, generative AI, and IoT edge devices. He works with customers, helping in training, optimizing and deploying foundation models for edge devices and on the cloud.

Mike Garrison is a Global Solutions Architect based in Ypsilanti, Michigan. Utilizing his twenty years of experience, he helps accelerate tech transformation of automotive companies. In his free time, he enjoys playing video games and travel.

Read More

The Need for Speed: NVIDIA Accelerates Majority of World’s Supercomputers to Drive Advancements in Science and Technology

The Need for Speed: NVIDIA Accelerates Majority of World’s Supercomputers to Drive Advancements in Science and Technology

Starting with the release of CUDA in 2006, NVIDIA has driven advancements in AI and accelerated computing — and the most recent TOP500 list of the world’s most powerful supercomputers highlights the culmination of the company’s achievements in the field.

This year, 384 systems on the TOP500 list are powered by NVIDIA technologies. Among the 53 new to the list, 87% — 46 systems — are accelerated. Of those accelerated systems, 85% use NVIDIA Hopper GPUs, driving advancements in areas like climate forecasting, drug discovery and quantum simulation.

Accelerated computing is much more than floating point operations per second (FLOPS). It requires full-stack, application-specific optimization. At SC24 this week, NVIDIA announced the release of cuPyNumeric, an NVIDIA CUDA-X library that enables over 5 million developers to seamlessly scale to powerful computing clusters without modifying their Python code.

At the conference, NVIDIA also revealed significant updates to the NVIDIA CUDA-Q development platform, which empowers quantum researchers to simulate quantum devices at a scale previously thought computationally impossible.

A New Era of Scientific Discovery With Mixed Precision and AI

Mixed-precision floating-point operations and AI have become the tools of choice for researchers grappling with the complexities of modern science. They offer greater speed, efficiency and adaptability than traditional methods, without compromising accuracy.

This shift isn’t just theoretical — it’s already happening. At SC24, two Gordon Bell finalist projects revealed how using AI and mixed precision helped advance genomics and protein design.

In his paper titled “Using Mixed Precision for Genomics,” David Keyes, a professor at King Abdullah University of Science and Technology, used 0.8 exaflops of mixed precision to explore relationships between genomes and their generalized genotypes, and then to the prevalence of diseases to which they are subject.

Similarly, Arvind Ramanathan, a computational biologist from the Argonne National Laboratory, harnessed 3 exaflops of AI performance on the NVIDIA Grace Hopper-powered Alps system to speed up protein design.

To further advance AI-driven drug discovery and the development of lifesaving therapies, researchers can use NVIDIA BioNeMo, powerful tools designed specifically for pharmaceutical applications. Now in open source, the BioNeMo Framework can accelerate AI model creation, customization and deployment for drug discovery and molecular design.

Across the TOP500, the widespread use of AI and mixed-precision floating-point operations reflects a global shift in computing priorities. A total of 249 exaflops of AI performance are now available to TOP500 systems, supercharging innovations and discoveries across industries.

TOP500 total AI, FP32 and FP64 FLOPs by year.

NVIDIA-accelerated TOP500 systems excel across key metrics like AI and mix-precision system performance. With over 190 exaflops of AI performance and 17 exaflops of single-precision (FP32), NVIDIA’s accelerated computing platform is the new engine of scientific computing. NVIDIA also delivers 4 exaflops of double-precision (FP64) performance for certain scientific calculations that still require it.

Accelerated Computing Is Sustainable Computing

As the demand for computational capacity grows, so does the need for sustainability.

In the Green500 list of the world’s most energy-efficient supercomputers, systems with NVIDIA accelerated computing rank among eight of the top 10. The JEDI system at EuroHPC/FZJ, for example, achieves a staggering 72.7 gigaflops per watt, setting a benchmark for what’s possible when performance and sustainability align.

For climate forecasting, NVIDIA announced at SC24 two new NVIDIA NIM microservices for NVIDIA Earth-2, a digital twin platform for simulating and visualizing weather and climate conditions. The CorrDiff NIM and FourCastNet NIM microservices can accelerate climate change modeling and simulation results by up to 500x.

In a world increasingly conscious of its environmental footprint, NVIDIA’s innovations in accelerated computing balance high performance with energy efficiency to help realize a brighter, more sustainable future.

Watch the replay of NVIDIA’s special address at SC24 and learn more about the company’s news in the SC24 online press kit.

See notice regarding software product information.

Read More

Do LLMs Internally “Know” When They Follow Instructions?

This paper was accepted at the Foundation Model Interventions (MINT) Workshop at NeurIPS 2024.
Instruction-following is crucial for building AI agents with large language models (LLMs), as these models must adhere strictly to user-provided guidelines. However, LLMs often fail to follow even simple instructions. To improve instruction-following behavior and prevent undesirable outputs, we need a deeper understanding of how LLMs’ internal states relate to these outcomes. Our analysis of LLM internal states reveal a dimension in the input embedding space linked to successful…Apple Machine Learning Research

Transformation-Invariant Learning and Theoretical Guarantees for OOD Generalization

Learning with identical train and test distributions has been extensively investigated both practically and theoretically. Much remains to be understood, however, in statistical learning under distribution shifts. This paper focuses on a distribution shift setting where train and test distributions can be related by classes of (data) transformation maps. We initiate a theoretical study for this framework, investigating learning scenarios where the target class of transformations is either known or unknown. We establish learning rules and algorithmic reductions to Empirical Risk Minimization…Apple Machine Learning Research

Faster Algorithms for User-Level Private Stochastic Convex Optimization

We study private stochastic convex optimization (SCO) under user-level differential privacy (DP) constraints. In this setting, there are nnn users, each possessing mmm data items, and we need to protect the privacy of each user’s entire collection of data items. Existing algorithms for user-level DP SCO are impractical in many large-scale machine learning scenarios because: (i) they make restrictive assumptions on the smoothness parameter of the loss function and require the number of users to grow polynomially with the dimension of the parameter space; or (ii) they are prohibitively slow…Apple Machine Learning Research

Private Stochastic Convex Optimization with Heavy Tails: Near-Optimality from Simple Reductions

We study the problem of differentially private stochastic convex optimization (DP-SCO) with heavy-tailed gradients, where we assume a kthk^{text{th}}kth-moment bound on the Lipschitz constants of sample functions, rather than a uniform bound. We propose a new reduction-based approach that enables us to obtain the first optimal rates (up to logarithmic factors) in the heavy-tailed setting, achieving error G2⋅1n+Gk⋅(dnε)1−1kG_2 cdot frac 1 {sqrt n} + G_k cdot (frac{sqrt d}{nvarepsilon})^{1 – frac 1 k}G2​⋅n​1​+Gk​⋅(nεd​​)1−k1​ under (ε,δ)(varepsilon, delta)(ε,δ)-approximate…Apple Machine Learning Research

Private Online Learning via Lazy Algorithms

We study the problem of private online learning, specifically, online prediction from experts (OPE) and online convex optimization (OCO). We propose a new transformation that transforms lazy online learning algorithms into private algorithms. We apply our transformation for differentially private OPE and OCO using existing lazy algorithms for these problems. Our final algorithms obtain regret which significantly improves the regret in the high privacy regime ε≪1varepsilon ll 1ε≪1, obtaining Tlog⁡d+T1/3log⁡(d)/ε2/3sqrt{T log d} + T^{1/3} log(d)/varepsilon^{2/3}Tlogd​+T1/3log(d)/ε2/3 for…Apple Machine Learning Research

Do LLMs Estimate Uncertainty Well in Instruction-Following?

This paper was accepted at the Safe Generative AI Workshop (SGAIW) at NeurIPS 2024.
Large language models (LLMs) could be valuable personal AI agents across various domains, provided they can precisely follow user instructions. However, recent studies have shown significant limitations in LLMs’ instruction-following capabilities, raising concerns about their reliability in high-stakes applications. Accurately estimating LLMs’ uncertainty in adhering to instructions is critical to mitigating deployment risks. We present, to our knowledge, the first systematic evaluation of uncertainty…Apple Machine Learning Research