Distributed Mask RCNN training with Amazon SageMakerCV

Computer vision algorithms are at the core of many deep learning applications. Self-driving cars, security systems, healthcare, logistics, and image processing all incorporate various aspects of computer vision. But despite their ubiquity, training computer vision algorithms, like Mask or Cascade RCNN, is hard. These models employ complex architectures, train on large datasets, and require computer clusters, often requiring dozens of GPUs.

Last year at AWS re:Invent we announced record-breaking Mask RCNN training times of 6:45 minutes on PyTorch and 6:12 minutes on TensorFlow, which we achieved through a series of algorithmic, system, and infrastructure improvements. Our model made heavy use of half precision computation, state-of-the-art optimizers and loss functions, the AWS Elastic Fabric Adapter, and a new parameter server distribution approach.

Now, we’re making these optimizations available in Amazon SageMaker in our new SageMakerCV package. SageMakerCV takes all the high performance tools we developed last year and combines them with the convenience features of SageMaker, such as interactive development in SageMaker Studio, Spot training, and streaming data directly from Amazon Simple Storage Service (Amazon S3).

The challenge of training object detection and instance segmentation

Object detection models, like Mask RCNN, have complex architectures. They typically involve a pretrained backbone, such as a ResNet model, a region proposal network, classifiers, and regression heads. Essentially, these models work like a collection of neural networks working on slightly different, but related, tasks. On top of that, developers often need to modify these models for their own use case. For example, along with the classifier, we might want a model that can identify human poses, as part of an autonomous vehicle project, in order to predict movement and behavior. This involves adding an additional network to the model, alongside the classifier and regression heads.

Mask RCNN architecture

The following diagram illustrates the Mask RCNN architecture.

For more information on Mask RCNN, see the following blog posts:

Modifying models like this is a time-consuming process. The updated model might train slower, or not converge as well as the previous model. SageMakerCV solves these issues by simplifying both the model modification and optimization process. The modification process is streamlined by modularizing the models, and using the interactive development environment in Studio. At the same time, we can apply all the optimizations we developed for our record training time to the new model.

GPU and algorithmic improvements

Several pieces of Mask RCNN are difficult to optimize for GPUs. For example, as part of the region proposal step, we want to reduce the number of regions using non-max suppression (NMS), the process of removing overlapping boxes. Many implementations of Mask RCNN run NMS on the CPU, which means moving a lot of data off the GPU in the middle of training. Other parts of the model, such as anchor generation and assignment, and ROI align, encounter similar problems.

As part of our Mask RCNN optimizations in 2020, we worked with NVIDIA to develop efficient CUDA implementations of NMS, ROI align, and anchor tools, all of which are built into SageMakerCV. This means data stays on the GPU and models train faster. Options for mixed and half precision training means larger batch sizes, shorter step times, and higher GPU utilization.

SageMakerCV also includes the same improved optimizers and loss functions we used in our record Mask RCNN training. NovoGrad means you can now train a model on batch sizes as large as 512. GIoU loss boosts both box and mask performance by around 5%. Combined, these improvements make it possible to train Mask RCNN to state-of-the-art levels of performance in under 7 minutes.

The following table summarizes the benchmark training times for Mask RCNN trained to MLPerf convergence levels using SageMakerCV on P4d.24xlarge instances SageMaker instances. Total time refers to the entire elapsed time, including SageMaker instance setup, Docker and data download, training, and evaluation.

Framework	Nodes	Total Time	Training Time	Box MaP	Seg MaP
PyTorch	1	1:33:04	1:25:59	37.8	34.1
PyTorch	2	0:57:05	0:50:21	38.0	34.4
PyTorch	4	0:36:27	0:29:40	37.9	34.3
TensorFlow	1	2:23:52	2:18:24	37.7	34.3
TensorFlow	2	1:09:02	1:03:29	37.8	34.5
TensorFlow	4	0:48:55	0:42:33	38.0	34.8

Interactive development

Our goal with SageMakerCV was not only to provide fast training models to our users, but also to make developing new models easier. To that end, we provide a series of template object detection models in a highly modularized format, with a simple registry structure for adding new pieces. We also provide tools to modify and test models directly in Studio, so you can quickly go from prototyping a model to launching a distributed training cluster.

For example, say you want to add a custom keypoint head to Mask RCNN in TensorFlow. You first build your new head using the TensorFlow 2 Keras API, and add the SageMakerCV registry decorator at the top. The registry is a set of dictionaries organized into sections of the model. For example, the HEADS section triggers when the build_detector function is called, and the KeypointHead value from the configuration file tells the build to include the new ROI head. See the following code:

import tensorflow as tf
from sagemakercv.builder import HEADS

@HEADS.register("KeypointHead")
class KeypointHead(tf.keras.Model):
    def __init__(self, cfg):
        ...

Then you can call your new head by adding it to a YAML configuration file:

MODEL:
    RCNN:
        ROI_HEAD: "KeypointHead"

You provide this new configuration when building a model:

from configs.default_config import _C as cfg
from sagemakercv.detection import build_detector

cfg.merge_from_file('keypoint_config.yaml')

model = build_detector(cfg)

We know that building a new model is never as straightforward as we’re describing here, so we provide example notebooks of how to prototype models in Studio. This allows developers to quickly iterate on and debug their ideas.

Distributed training

SageMakerCV uses the distributed training capabilities of SageMaker right out of the box. You can go from prototyping a model on a single GPU to launching training on dozens of GPUs with just a few lines of code. SageMakerCV automatically supports SageMaker Distributed Data Parallel, which uses EFA to provide unmatched multi-node scaling efficiency. We also provide support for DDP in PyTorch, and Horovod in TensorFlow. By default, SageMakerCV automatically selects the optimal distributed training strategy for the cluster configuration you select. All you have to do is set your instance type and number of nodes, and SageMakerCV takes care of the rest.

Distributed training also typically involves huge amounts of data, often in the order of many terabytes. Getting all that data onto the training instances can take time, providing it will even fit. To fix this problem, SageMakerCV provides built-in support for streaming data directly from Amazon S3 with our recently released S3 plugin, reducing startup times and training costs.

Get started

We provide detailed tutorial notebooks that walk you through the entire process, from getting the COCO dataset, to building a model in Studio, to launching a distributed cluster. What follows is a brief overview.

Follow the instructions in Onboard to Amazon SageMaker Studio Using Quick Start. On your Studio instance, open a system terminal and clone the SageMakerCV repo.

git clone https://github.com/aws-samples/amazon-sagemaker-cv

Create a new Studio notebook with the PyTorch DLC, and install SageMakerCV in editable mode:

cd amazon-sagemaker-cv/pytorch
pip install -e .

In your notebook, create a new training configuration:

from configs import cfg

cfg.SOLVER.OPTIMIZER="NovoGrad" 
cfg.SOLVER.BASE_LR=0.042
cfg.SOLVER.LR_SCHEDULE="COSINE"
cfg.SOLVER.IMS_PER_BATCH=384 
cfg.SOLVER.WEIGHT_DECAY=0.001 
cfg.SOLVER.MAX_ITER=5000
cfg.OPT_LEVEL="O1"

Set your data sources by using either channels, or an S3 location to stream data during training:

S3_DATA_LOCATION = 's3://my-bucket/coco/'
CHANNELS_DIR='/opt/ml/input/data/' # on node, set by SageMaker

channels = {'validation': os.path.join(S3_DATA_LOCATION, 'val2017'),
            'weights': S3_WEIGHTS_LOCATION,
            'annotations': os.path.join(S3_DATA_LOCATION, 'annotations')}
            
cfg.INPUT.VAL_INPUT_DIR = os.path.join(CHANNELS_DIR, 'validation') 
cfg.INPUT.TRAIN_ANNO_DIR = os.path.join(CHANNELS_DIR, 'annotations', 'instances_train2017.json')
cfg.INPUT.VAL_ANNO_DIR = os.path.join(CHANNELS_DIR, 'annotations', 'instances_val2017.json')
cfg.MODEL.WEIGHT=os.path.join(CHANNELS_DIR, 'weights', R50_WEIGHTS) 
cfg.INPUT.TRAIN_INPUT_DIR = os.path.join(S3_DATA_LOCATION, "train2017") 
cfg.OUTPUT_DIR = '/opt/ml/checkpoints' # SageMaker output dir

# Save the new configuration file
dist_config_file = f"configs/dist-training-config.yaml"
with open(dist_config_file, 'w') as outfile:
    with redirect_stdout(outfile): print(cfg.dump())
    
hyperparameters = {"config": dist_config_file}

Finally, we can launch a distributed training job. For example, we can say we want four ml.p4d.24xlarge instances, and train a model to state-of-the-art convergence in about 45 minutes:

estimator = PyTorch(
                entry_point='train.py', 
                source_dir='.', 
                py_version='py3',
                framework_version='1.8.1',
                role=get_execution_role(),
                instance_count=4,
                instance_type='ml.p4d.24xlarge',
                distribution={ "smdistributed": { "dataparallel": { "enabled": True } } } ,
                output_path='s3://my-bucket/output/',
                checkpoint_s3_uri='s3://my-bucket/checkpoints/',
                model_dir='s3://my-bucket/model/',
                hyperparameters=hyperparameters,
                volume_size=500,
)

estimator.fit(channels)

Clean up

After training your model, be sure to check that all your training instances are complete or stopped by using the SageMaker console and choosing Training Jobs in the navigation pane.

Also, make sure to stop all Studio instances by choosing the Studio session monitor (square inside a circle icon) at the left of the page in Studio. Choose the power icon next to any running instances to shut them down. Your files are saved on your Studio EBS.

Conclusion

SageMakerCV started life as our project to break training records for computer vision models. In the process, we developed new tools and techniques to boost both training speed and accuracy. Now, we’ve combined those advances with SageMaker’s unified machine learning development experience. By combining the latest algorithmic advances, GPU hardware, EFA, and the ability to stream huge datasets from Amazon S3, SageMakerCV is the ideal place to develop the most advanced computer vision models. We look forward to seeing what new models and applications the machine learning community develops, and welcome any and all contributions. To get started, see our comprehensive tutorial notebooks in PyTorch and TensorFlow on GitHub.

About the Authors

Ben Snyder is an applied scientist with AWS Deep Learning. His research interests include computer vision models, reinforcement learning, and distributed optimization. Outside of work, he enjoys cycling and backcountry camping.

Khaled ElGalaind is the engineering manager for AWS Deep Engine Benchmarking, focusing on performance improvements for AWS Machine Learning customers. Khaled is passionate about democratizing deep learning. Outside of work, he enjoys volunteering with the Boy Scouts, BBQ, and hiking in Yosemite.

Sami Kama is a software engineer in AWS Deep Learning with expertise in performance optimization, HPC/HTC, Deep learning frameworks and distributed computing. Sami aims to reduce the environmental impact of Deep Learning by increasing the computation efficiency. He enjoys spending time with his kids, catching up with science and technology and occasional video games.

Vedere AI