Fine-tune GPT-J using an Amazon SageMaker Hugging Face estimator and the model parallel library

Fine-tune GPT-J using an Amazon SageMaker Hugging Face estimator and the model parallel library

GPT-J is an open-source 6-billion-parameter model released by Eleuther AI. The model is trained on the Pile and can perform various tasks in language processing. It can support a wide variety of use cases, including text classification, token classification, text generation, question and answering, entity extraction, summarization, sentiment analysis, and many more. GPT-J is a transformer model trained using Ben Wang’s Mesh Transformer JAX.

In this post, we present a guide and best practices on training large language models (LLMs) using the Amazon SageMaker distributed model parallel library to reduce training time and cost. You will learn how to train a 6-billion-parameter GPT-J model on SageMaker with ease. Finally, we share the main features of SageMaker distributed model parallelism that help with speeding up training time.

Transformer neural networks

A transformer neural network is a popular deep learning architecture to solve sequence-to-sequence tasks. It uses attention as the learning mechanism to achieve close to human-level performance. Some of the other useful properties of the architecture compared to previous generations of natural language processing (NLP) models include the ability distribute, scale, and pre-train. Transformers-based models can be applied across different use cases when dealing with text data, such as search, chatbots, and many more. Transformers use the concept of pre-training to gain intelligence from large datasets. Pre-trained transformers can be used as is or fine-tuned on your datasets, which can be much smaller and specific to your business.

Hugging Face on SageMaker

Hugging Face is a company developing some of the most popular open-source libraries providing state-of-the-art NLP technology based on transformers architectures. The Hugging Face transformers, tokenizers, and datasets libraries provide APIs and tools to download and predict using pre-trained models in multiple languages. SageMaker enables you to train, fine-tune, and run inference using Hugging Face models directly from its Hugging Face Model Hub using the Hugging Face estimator in the SageMaker SDK. The integration makes it easier to customize Hugging Face models on domain-specific use cases. Behind the scenes, the SageMaker SDK uses AWS Deep Learning Containers (DLCs), which are a set of prebuilt Docker images for training and serving models offered by SageMaker. The DLCs are developed through a collaboration between AWS and Hugging Face. The integration also offers integration between the Hugging Face transformers SDK and SageMaker distributed training libraries, enabling you to scale your training jobs on a cluster of GPUs.

Overview of the SageMaker distributed model parallel library

Model parallelism is a distributed training strategy that partitions the deep learning model over numerous devices, within or across instances. Deep learning (DL) models with more layers and parameters perform better in complex tasks like computer vision and NLP. However, the maximum model size that can be stored in the memory of a single GPU is limited. GPU memory constraints can be bottlenecks while training DL models in the following ways:

  • They limit the size of the model that can be trained because a model’s memory footprint scales proportionately to the number of parameters
  • They reduce GPU utilization and training efficiency by limiting the per-GPU batch size during training

SageMaker includes the distributed model parallel library to help distribute and train DL models effectively across many compute nodes, overcoming the restrictions associated with training a model on a single GPU. Furthermore, the library allows you to obtain the most optimal distributed training utilizing EFA-supported devices, which improves inter-node communication performance with low latency, high throughput, and OS bypass.

Because large models such as GPT-J, with billions of parameters, have a GPU memory footprint that exceeds a single chip, it becomes essential to partition them across multiple GPUs. The SageMaker model parallel (SMP) library enables automatic partitioning of models across multiple GPUs. With SageMaker model parallelism, SageMaker runs an initial profiling job on your behalf to analyze the compute and memory requirements of the model. This information is then used to decide how the model is partitioned across GPUs, in order to maximize an objective, such as maximizing speed or minimizing memory footprint.

It also supports optional pipeline run scheduling in order to maximize the overall utilization of available GPUs. The propagation of activations during forward pass and gradients during backward pass requires sequential computation, which limits the amount of GPU utilization. SageMaker overcomes the sequential computation constraint utilizing the pipeline run schedule by splitting mini-batches into micro-batches to be processed in parallel on different GPUs. SageMaker model parallelism supports two modes of pipeline runs:

  • Simple pipeline – This mode finishes the forward pass for each micro-batch before starting the backward pass.
  • Interleaved pipeline – In this mode, the backward run of the micro-batches is prioritized whenever possible. This allows for quicker release of the memory used for activations, thereby using memory more efficiently.

Tensor parallelism

Individual layers, ornn.Modules, are divided across devices using tensor parallelism so they can run concurrently. The simplest example of how the library divides a model with four layers to achieve two-way tensor parallelism ("tensor_parallel_degree": 2) is shown in the following figure. Each model replica’s layers are bisected (divided in half) and distributed between two GPUs. The degree of data parallelism is eight in this example because the model parallel configuration additionally includes "pipeline_parallel_degree": 1 and "ddp": True. The library manages communication among the replicas of the tensor-distributed model.

Tensor parallelism

The benefit of this feature is that you may choose which layers or which subset of layers you want to apply tensor parallelism to. To dive deep into tensor parallelism and other memory-saving features for PyTorch, and to learn how to set up a combination of pipeline and tensor parallelism, see Extended Features of the SageMaker Model Parallel Library for PyTorch.

SageMaker sharded data parallelism

Sharded data parallelism is a memory-saving distributed training technique that splits the training state of a model (model parameters, gradients, and optimizer states) across GPUs in a data parallel group.

When scaling up your training job to a large GPU cluster, you can reduce the per-GPU memory footprint of the model by sharding the training state over multiple GPUs. This returns two benefits: you can fit larger models, which would otherwise run out of memory with standard data parallelism, or you can increase the batch size using the freed-up GPU memory.

The standard data parallelism technique replicates the training states across the GPUs in the data parallel group and performs gradient aggregation based on the AllReduce operation. In effect, sharded data parallelism introduces a trade-off between the communication overhead and GPU memory efficiency. Using sharded data parallelism increases the communication cost, but the memory footprint per GPU (excluding the memory usage due to activations) is divided by the sharded data parallelism degree, therefore larger models can fit in a GPU cluster.

SageMaker implements sharded data parallelism through the MiCS implementation. For more information, see Near-linear scaling of gigantic-model training on AWS.

Refer to Sharded Data Parallelism for further details on how to apply sharded data parallelism to your training jobs.

Use the SageMaker model parallel library

The SageMaker model parallel library comes with the SageMaker Python SDK. You need to install the SageMaker Python SDK to use the library, and it’s already installed on SageMaker notebook kernels. To make your PyTorch training script utilize the capabilities of the SMP library, you need to make the following changes:

  1. Strat by importing and initializing the smp library using the smp.init()call.
  2. Once it’s initialized, you can wrap your model with the smp.DistributedModel wrapper and use the returned DistributedModel object instead of the user model.
  3. For your optimizer state, use the smp.DistributedOptimizer wrapper around your model optimizer, enabling smp to save and load the optimizer state. The forward and backward pass logic can be abstracted as a separate function and add a smp.step decorator to the function. Essentially, the forward pass and back-propagation needs to be run inside the function with the smp.step decorator placed over it. This allows smp to split the tensor input to the function into a number of microbatches specified while launching the training job.
  4. Next, we can move the input tensors to the GPU used by the current process using the torch.cuda.set_device API followed by the .to() API call.
  5. Finally, for back-propagation, we replace torch.Tensor.backward and torch.autograd.backward.

See the following code:

@smp.step
def train_step(model, data, target):
    output = model(data)
    loss = F.nll_loss(output, target, reduction="mean")
    model.backward(Loss)
    
    return output, loss

with smp.tensor_parallelism():
    model = AutoModelForCausalLM.from_config(model_config)
    
model = smp.DistributedModel (model)
optimizer = smp. DistributedOptimizer(optimizer)

The SageMaker model parallel library’s tensor parallelism offers out-of-the-box support for the following Hugging Face Transformer models:

  • GPT-2, BERT, and RoBERTa (available in the SMP library v1.7.0 and later)
  • GPT-J (available in the SMP library v1.8.0 and later)
  • GPT-Neo (available in the SMP library v1.10.0 and later)

Best practices for performance tuning with the SMP library

When training large models, consider the following steps so that your model fits in GPU memory with a reasonable batch size:

  • It’s recommended to use instances with higher GPU memory and high bandwidth interconnect for performance, such as p4d and p4de instances.
  • Optimizer state sharding can be enabled in most cases, and will be helpful when you have more than one copy of the model (data parallelism enabled). You can turn on optimizer state sharding by setting "shard_optimizer_state": True in the modelparallel configuration.
  • Use activation checkpointing, a technique to reduce memory usage by clearing activations of certain layers and recomputing them during a backward pass of selected modules in the model.
  • Use activation offloading, an additional feature that can further reduce memory usage. To use activation offloading, set "offload_activations": True in the modelparallel configuration. Use when activation checkpointing and pipeline parallelism are turned on and the number of microbatches is greater than one.
  • Enable tensor parallelism and increase parallelism degrees where the degree is a power of 2. Typically for performance reasons, tensor parallelism is restricted to within a node.

We have run many experiments to optimize training and tuning GPT-J on SageMaker with the SMP library. We have managed to reduce GPT-J training time for an epoch on SageMaker from 58 minutes to less than 10 minutes—six times faster training time per epoch. It took initialization, model and dataset download from Amazon Simple Storage Service (Amazon S3) less than a minute, tracing and auto partitioning with GPU as the tracing device less than 1 minute, and training an epoch 8 minutes using tensor parallelism on one ml.p4d.24xlarge instance, FP16 precision, and a SageMaker Hugging Face estimator.

To reduce training time as a best practice, when training GPT-J on SageMaker, we recommend the following:

  • Store your pretrained model on Amazon S3
  • Use FP16 precision
  • Use GPU as a tracing device
  • Use auto-partitioning, activation checkpointing, and optimizer state sharding:
    • auto_partition: True
    • shard_optimizer_state: True
  • Use tensor parallelism
  • Use a SageMaker training instance with multiple GPUs such as ml.p3.16xlarge, ml.p3dn.24xlarge, ml.g5.48xlarge, ml.p4d.24xlarge, or ml.p4de.24xlarge.

GPT-J model training and tuning on SageMaker with the SMP library

A working step-by-step code sample is available on the Amazon SageMaker Examples public repository. Navigate to the training/distributed_training/pytorch/model_parallel/gpt-j folder. Select the gpt-j folder and open the train_gptj_smp_tensor_parallel_notebook.jpynb Jupyter notebook for the tensor parallelism example and train_gptj_smp_notebook.ipynb for the pipeline parallelism example. You can find a code walkthrough in our Generative AI on Amazon SageMaker workshop.

This notebook walks you through how to use the tensor parallelism features provided by the SageMaker model parallelism library. You’ll learn how to run FP16 training of the GPT-J model with tensor parallelism and pipeline parallelism on the GLUE sst2 dataset.

Summary

The SageMaker model parallel library offers several functionalities. You can reduce cost and speed up training LLMs on SageMaker. You can also learn and run sample codes for BERT, GPT-2, and GPT-J on the Amazon SageMaker Examples public repository. To learn more about AWS best practices for training LLMS using the SMP library, refer to the following resources:

To learn how one of our customers achieved low-latency GPT-J inference on SageMaker, refer to How Mantium achieves low-latency GPT-J inference with DeepSpeed on Amazon SageMaker.

If you’re looking to accelerate time-to-market of your LLMs and reduce your costs, SageMaker can help. Let us know what you build!


About the Authors

Zmnako AwrahmanZmnako Awrahman, PhD, is a Practice Manager, ML SME, and Machine Learning Technical Field Community (TFC) member at Global Competency Center, Amazon Web Services. He helps customers leverage the power of the cloud to extract value from their data with data analytics and machine learning.

Roop BainsRoop Bains is a Senior Machine Learning Solutions Architect at AWS. He is passionate about helping customers innovate and achieve their business objectives using artificial intelligence and machine learning. He helps customers train, optimize, and deploy deep learning models.

Anastasia Pachni TsitiridouAnastasia Pachni Tsitiridou is a Solutions Architect at AWS. Anastasia lives in Amsterdam and supports software businesses across the Benelux region in their cloud journey. Prior to joining AWS, she studied electrical and computer engineering with a specialization in computer vision. What she enjoys most nowadays is working with very large language models.

Dhawal PatelDhawal Patel is a Principal Machine Learning Architect at AWS. He has worked with organizations ranging from large enterprises to mid-sized startups on problems related to distributed computing and artificial intelligence. He focuses on deep learning, including NLP and computer vision domains. He helps customers achieve high-performance model inference on SageMaker.

Wioletta StobienieckaWioletta Stobieniecka is a Data Scientist at AWS Professional Services. Throughout her professional career, she has delivered multiple analytics-driven projects for different industries such as banking, insurance, telco, and the public sector. Her knowledge of advanced statistical methods and machine learning is well combined with a business acumen. She brings recent AI advancements to create value for customers.

Rahul HuilgolRahul Huilgol is a Senior Software Development Engineer in Distributed Deep Learning at Amazon Web Services.

Read More

Meet the Maker: Software Engineer Ramps Up NVIDIA Jetson to Build Self-Driving Skate Park

Meet the Maker: Software Engineer Ramps Up NVIDIA Jetson to Build Self-Driving Skate Park

Kirk Kaiser

Kirk Kaiser grew up a fan of the video game Paperboy, where players act as cyclists delivering newspapers while encountering various obstacles, like ramps that appear in the middle of the street.

This was the inspiration behind the software developer’s latest project using the NVIDIA Jetson platform for edge AI and robotics — a self-driving skate ramp.

“I wanted the absurdity and fun of Paperboy to be a part of my life,” said Kaiser, an avid skateboarder based in Naples, Fla. “I was boarding one day with my dog Benji running beside me and I was like, ‘What if I had ramps that came with me?’”

He’s now building just that — technology that could lead to a portable, autonomous skate park.

So far, he’s developed an electric platform that can elevate a ramp and make it level with the ground. It’s steerable using a PS4 controller linked via Bluetooth to an NVIDIA Jetson Nano Developer Kit.

Now, he’s collecting data to train AI models that’ll enable the platform to recognize streets and obstacles — and eventually become fully autonomous — with the help of the new NVIDIA Jetson Orin Nano Developer Kit.

It’s a project for when he isn’t engrossed in his work as the head of developer relations at Gitpod, a startup that provides cloud development environments for software makers.

About the Maker

Kaiser learned software engineering at a young age and received a scholarship to a prestigious high school specialized in tech. There, he honed his programming skills before taking time in his early adulthood to see and experience the world in completely different ways.

At 18 years old, he packed a bag and lived for a year in a wildlife refuge in Costa Rica, where he worked on a permaculture farm, growing food and collecting rainwater to drink. Relocating to Vermont, Kaiser then spent a year farming with a Zen Buddhist before hiking 1,000 miles of Appalachian Trail, passing through four states.

Upon leaving the trail, Kirk launched a travel website, got his first software job at a cosmetics company, and worked in R&D for a lighting company before rekindling his passion for software engineering as a way to provide for his family — including his now four-year-old son.

His Inspiration

Before all of this, skateboarding was Kaiser’s greatest love. “I just wanted to skateboard as a kid,” he said. “I wanted to maximize the amount of time I could spend skateboarding.”

He built his own skate parks growing up, which made him familiar with the mechanics of building a wooden ramp — knowledge that came in handy when building the foundation of his latest Jetson-powered project.

And to inspire others to embark on inventive projects with technology, Kaiser authored Make Art With Python, a step-by-step introduction to programming for creative people.

He was spurred to write the book while talking to high school students at a biohacker bootcamp in New York.

“What the high schoolers said blew my mind — they basically thought that software engineering was for overachievers,” he said. “So I wanted to write a book that would convince younger people that programming is fundamentally a platform for creating worlds, and it can be for anyone, which is a really exciting thing.”

His Favorite Jetson Projects

Kaiser kicked off his self-driving skate park project 18 months ago, intending to start with a ramp about the size of a golf cart. The electrical components needed to steer it were prohibitively expensive, however, and getting such a large platform to break along two axes of rotation was incredibly challenging, he said.

Rescaling the project to the size of a skateboard itself, Kaiser bought a welder and a metal brake, learned how to use both tools for the first time, and built a platform that can raise and lower, as well as accept any kind of ramp.

It’s fully steerable along both axes thanks to the edge capabilities of NVIDIA Jetson. And the developer’s now training the platform’s self-driving features using Robot Operating System repositories available through the NVIDIA Isaac platform for accelerated, AI-powered robotics.

“In the machine learning space, NVIDIA is really the only show in town,” he said. “The Jetson platform is the industry standard for edge AI, and its compatibility with other development platforms and the onboard GPU are huge pluses.”

Kaiser dives deeper into the technical aspects of his skate ramp project on his blog.

The developer’s other favorite projects using the NVIDIA Jetson platform include training an AI model for turning lights off and on using a dab and T-pose, as well as creating an AI-powered camera for bird-watching.

“The acceleration of smaller-scale robotics is becoming more accessible to everyone,” Kaiser said, “which is really exciting because I think robotics is so damn cool.”

Go along for the ride by keeping up with Kaiser’s work, and learn more about the NVIDIA Jetson platform.

Read More