SageMaker Distribution is now available on Amazon SageMaker Studio

SageMaker Distribution is a pre-built Docker image containing many popular packages for machine learning (ML), data science, and data visualization. This includes deep learning frameworks like PyTorch, TensorFlow, and Keras; popular Python packages like NumPy, scikit-learn, and pandas; and IDEs like JupyterLab. In addition to this, SageMaker Distribution supports conda, micromamba, and pip as Python package managers.

In May 2023, we launched SageMaker Distribution as an open-source project at JupyterCon. This launch helped you use SageMaker Distribution to run experiments on your local environments. We are now natively providing that image in Amazon SageMaker Studio so that you gain the high performance, compute, and security benefits of running your experiments on Amazon SageMaker.

Compared to the earlier open-source launch, you have the following additional capabilities:

  • The open-source image is now available as a first-party image in SageMaker Studio. You can now simply choose the open-source SageMaker Distribution from the list when choosing an image and kernel for your notebooks, without having to create a custom image.
  • The SageMaker Python SDK package is now built-in with the image.

In this post, we show the features and advantages of using the SageMaker Distribution image.

Use SageMaker Distribution in SageMaker Studio

If you have access to an existing Studio domain, you can launch SageMaker Studio. To create a Studio domain, follow the directions in Onboard to Amazon SageMaker Domain.

  1. In the SageMaker Studio UI, choose File from the menu bar, choose New, and choose Notebook.
  2. When prompted for the image and instance, choose the SageMaker Distribution v0 CPU or SageMaker Distribution v0 GPU image.
  3. Choose your Kernel, then choose Select.

You can now start running your commands without needing to install common ML packages and frameworks! You can also run notebooks running on supported frameworks such as PyTorch and TensorFlow from the SageMaker examples repository, without having to switch the active kernels.

Run code remotely using SageMaker Distribution

In the public beta announcement, we discussed graduating notebooks from local compute environments to SageMaker Studio, and also operationalizing the notebook using notebook jobs.

Additionally, you can directly run your local notebook code as a SageMaker training job by simply adding a @remote decorator to your function.

Let’s try an example. Add the following code to your Studio notebook running on the SageMaker Distribution image:

from sagemaker.remote_function import remote

@remote(instance_type="ml.m5.xlarge", dependencies='./requirements.txt')
def divide(x, y):
    return x / y

divide(2, 3.0)

When you run the cell, the function will run as a remote SageMaker training job on an ml.m5.xlarge notebook, and the SDK automatically picks up the SageMaker Distribution image as the training image in Amazon Elastic Container Registry (Amazon ECR). For deep learning workloads, you can also run your script on multiple parallel instances.

Reproduce Conda environments from SageMaker Distribution elsewhere

SageMaker Distribution is available as a public Docker image. However, for data scientists more familiar with Conda environments than Docker, the GitHub repository also provides the environment files for each image build so you can build Conda environments for both CPU and GPU versions.

The build artifacts for each version are stored under the sagemaker-distribution/build_artifacts directory. To create the same environment as any of the available SageMaker Distribution versions, run the following commands, replacing the --file parameter with the right environment files:

conda create --name conda-sagemaker-distribution 
  --file sagemaker-distribution/build_artifacts/v0/v0.2/v0.2.1/cpu.env.out
# activate the environment
conda activate conda-sagemaker-distribution

Customize the open-source SageMaker Distribution image

The open-source SageMaker Distribution image has the most commonly used packages for data science and ML. However, data scientists might require access to additional packages, and enterprise customers might have proprietary packages that provide additional capabilities for their users. In such cases, there are multiple options to have a runtime environment with all required packages. In order of increasing complexity, they are listed as follows:

  • You can install packages directly on the notebook. We recommend Conda and micromamba, but pip also works.
  • Data scientists familiar with Conda for package management can reproduce the Conda environment from SageMaker Distribution elsewhere and install and manage additional packages in that environment going forward.
  • If administrators want a repeatable and controlled runtime environment for their users, they can extend SageMaker Distribution’s Docker images and maintain their own image. See Bring your own SageMaker image for detailed instructions to create and use a custom image in Studio.

Clean up

If you experimented with SageMaker Studio, shut down all Studio apps to avoid paying for unused compute usage. See Shut down and Update Studio Apps for instructions.

Conclusion

Today, we announced the launch of the open-source SageMaker Distribution image within SageMaker Studio. We showed you how to use the image in SageMaker Studio as one of the available first-party images, how to operationalize your scripts using the SageMaker Python SDK @remote decorator, how to reproduce the Conda environments from SageMaker Distribution outside Studio, and how to customize the image. We encourage you to try out SageMaker Distribution and share your feedback through GitHub!

Additional References


About the authors

Durga Sury is an ML Solutions Architect in the Amazon SageMaker Service SA team. She is passionate about making machine learning accessible to everyone. In her 4 years at AWS, she has helped set up AI/ML platforms for enterprise customers. When she isn’t working, she loves motorcycle rides, mystery novels, and hiking with her 5-year-old husky.

Ketan Vijayvargiya is a Senior Software Development Engineer in Amazon Web Services (AWS). His focus areas are machine learning, distributed systems and open source. Outside work, he likes to spend his time self-hosting and enjoying nature.

Read More