With help from the Alexa Fund, the company is making it easier to virtually reconstruct reality.Read More
Build an email spam detector using Amazon SageMaker
Spam emails, also known as junk mail, are sent to a large number of users at once and often contain scams, phishing content, or cryptic messages. Spam emails are sometimes sent manually by a human, but most often they are sent using a bot. Examples of spam emails include fake ads, chain emails, and impersonation attempts. There is a risk that a particularly well-disguised spam email may land in your inbox, which can be dangerous if clicked on. It’s important to take extra precautions to protect your device and sensitive information.
As technology is improving, the detection of spam emails becomes a challenging task due to its changing nature. Spam is quite different from other types of security threats. It may at first appear like an annoying message and not a threat, but it has an immediate effect. Also spammers often adapt new techniques. Organizations who provide email services want to minimize spam as much as possible to avoid any damage to their end customers.
In this post, we show how straightforward it is to build an email spam detector using Amazon SageMaker. The built-in BlazingText algorithm offers optimized implementations of Word2vec and text classification algorithms. Word2vec is useful for various natural language processing (NLP) tasks, such as sentiment analysis, named entity recognition, and machine translation. Text classification is essential for applications like web searches, information retrieval, ranking, and document classification.
Solution overview
This post demonstrates how you can set up email spam detector and filter spam emails using SageMaker. Let’s see how a spam detector typically works, as shown in the following diagram.
Emails are sent through a spam detector. An email is sent to the spam folder if the spam detector detects it as spam. Otherwise, it’s sent to the customer’s inbox.
We walk you through the following steps to set up our spam detector model:
- Download the sample dataset from the GitHub repo.
- Load the data in an Amazon SageMaker Studio notebook.
- Prepare the data for the model.
- Train, deploy, and test the model.
Prerequisites
Before diving into this use case, complete the following prerequisites:
- Set up an AWS account.
- Set up a SageMaker domain.
- Create an Amazon Simple Storage Service (Amazon S3) bucket. For instructions, see Create your first S3 bucket.
Download the dataset
Download the email_dataset.csv from GitHub and upload the file to the S3 bucket.
The BlazingText algorithm expects a single preprocessed text file with space-separated tokens. Each line in the file should contain a single sentence. If you need to train on multiple text files, concatenate them into one file and upload the file in the respective channel.
Load the data in SageMaker Studio
To perform the data load, complete the following steps:
- Download the
spam_detector.ipynb
file from GitHub and upload the file in SageMaker Studio. - In your Studio notebook, open the
spam_detector.ipynb
notebook. - If you are prompted to choose a Kernel, choose the Python 3 (Data Science 3.0) kernel and choose Select. If not, verify that the right kernel has been automatically selected.
- Import the required Python library and set the roles and the S3 buckets. Specify the S3 bucket and prefix where you uploaded email_dataset.csv.
- Run the data load step in the notebook.
- Check if the dataset is balanced or not based on the Category labels.
We can see our dataset is balanced.
Prepare the data
The BlazingText algorithm expects the data in the following format:
Here’s an example:
Check Training and Validation Data Format for the BlazingText Algorithm.
You now run the data preparation step in the notebook.
- First, you need to convert the Category column to an integer. The following cell replaces the SPAM value with 1 and the HAM value with 0.
- The next cell adds the prefix
__label__
to each Category value and tokenizes the Message column.
- The next step is to split the dataset into train and validation datasets and upload the files to the S3 bucket.
Train the model
To train the model, complete the following steps in the notebook:
- Set up the BlazingText estimator and create an estimator instance passing the container image.
- Set the learning mode hyperparameter to supervised.
BlazingText has both unsupervised and supervised learning modes. Our use case is text classification, which is supervised learning.
- Create the train and validation data channels.
- Start training the model.
- Get the accuracy of the train and validation dataset.
Deploy the model
In this step, we deploy the trained model as an endpoint. Choose your preferred instance
Test the model
Let’s provide an example of three email messages that we want to get predictions for:
- Click on below link, provide your details and win this award
- Best summer deal here
- See you in the office on Friday.
Tokenize the email message and specify the payload to use when calling the REST API.
Now we can predict the email classification for each email. Call the predict method of the text classifier, passing the tokenized sentence instances (payload) into the data argument.
Clean up
Finally , you can delete the endpoint to avoid any unexpected cost.
Also, delete the data file from S3 bucket.
Conclusion
In this post, we walked you through the steps to create an email spam detector using the SageMaker BlazingText algorithm. With the BlazingText algorithm, you can scale to large datasets. BlazingText is used for textual analysis and text classification problems, and has both unsupervised and supervised learning modes. You can use the algorithm for use cases like customer sentiment analysis and text classification.
To learn more about the BlazingText algorithm, check out BlazingText algorithm.
About the Author
Dhiraj Thakur is a Solutions Architect with Amazon Web Services. He works with AWS customers and partners to provide guidance on enterprise cloud adoption, migration, and strategy. He is passionate about technology and enjoys building and experimenting in the analytics and AI/ML space.
Llama 2 foundation models from Meta are now available in Amazon SageMaker JumpStart
Today, we are excited to announce that Llama 2 foundation models developed by Meta are available for customers through Amazon SageMaker JumpStart. The Llama 2 family of large language models (LLMs) is a collection of pre-trained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. Fine-tuned LLMs, called Llama-2-chat, are optimized for dialogue use cases. You can easily try out these models and use them with SageMaker JumpStart, which is a machine learning (ML) hub that provides access to algorithms, models, and ML solutions so you can quickly get started with ML.
In this post, we walk through how to use Llama 2 models via SageMaker JumpStart.
What is Llama 2
Llama 2 is an auto-regressive language model that uses an optimized transformer architecture. Llama 2 is intended for commercial and research use in English. It comes in a range of parameter sizes—7 billion, 13 billion, and 70 billion—as well as pre-trained and fine-tuned variations. According to Meta, the tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align to human preferences for helpfulness and safety. Llama 2 was pre-trained on 2 trillion tokens of data from publicly available sources. The tuned models are intended for assistant-like chat, whereas pre-trained models can be adapted for a variety of natural language generation tasks. Regardless of which version of the model a developer uses, the responsible use guide from Meta can assist in guiding additional fine-tuning that may be necessary to customize and optimize the models with appropriate safety mitigations.
What is SageMaker JumpStart
With SageMaker JumpStart, ML practitioners can choose from a broad selection of open source foundation models. ML practitioners can deploy foundation models to dedicated Amazon SageMaker instances from a network isolated environment and customize models using SageMaker for model training and deployment.
You can now discover and deploy Llama 2 with a few clicks in Amazon SageMaker Studio or programmatically through the SageMaker Python SDK, enabling you to derive model performance and MLOps controls with SageMaker features such as Amazon SageMaker Pipelines, Amazon SageMaker Debugger, or container logs. The model is deployed in an AWS secure environment and under your VPC controls, helping ensure data security. Llama 2 models are available today in Amazon SageMaker Studio, initially in us-east 1
and us-west 2
regions.
Discover models
You can access the foundation models through SageMaker JumpStart in the SageMaker Studio UI and the SageMaker Python SDK. In this section, we go over how to discover the models in SageMaker Studio.
SageMaker Studio is an integrated development environment (IDE) that provides a single web-based visual interface where you can access purpose-built tools to perform all ML development steps, from preparing data to building, training, and deploying your ML models. For more details on how to get started and set up SageMaker Studio, refer to Amazon SageMaker Studio.
Once you’re on the SageMaker Studio, you can access SageMaker JumpStart, which contains pre-trained models, notebooks, and prebuilt solutions, under Prebuilt and automated solutions.
From the SageMaker JumpStart landing page, you can browse for solutions, models, notebooks, and other resources. You can find two flagship Llama 2 models in the Foundation Models: Text Generation carousel. If you don’t see Llama 2 models, update your SageMaker Studio version by shutting down and restarting. For more information about version updates, refer to Shut down and Update Studio Apps.
You can also find other four model variants by choosing Explore all Text Generation Models or searching for llama
in the search box.
You can choose the model card to view details about the model such as license, data used to train, and how to use. You can also find two buttons, Deploy and Open Notebook, which help you use the model.
When you choose either button, a pop-up will show the end-user license agreement and acceptable use policy for you to acknowledge.
Upon acknowledging, you will proceed to the next step to use the model.
Deploy a model
When you choose Deploy and acknowledge the terms, model deployment will start. Alternatively, you can deploy through the example notebook that shows up by choosing Open Notebook. The example notebook provides end-to-end guidance on how to deploy the model for inference and clean up resources.
To deploy using a notebook, we start by selecting an appropriate model, specified by the model_id
. You can deploy any of the selected models on SageMaker with the following code:
This deploys the model on SageMaker with default configurations, including default instance type and default VPC configurations. You can change these configurations by specifying non-default values in JumpStartModel. After it’s deployed, you can run inference against the deployed endpoint through the SageMaker predictor:
Fine-tuned chat models (Llama-2-7b-chat, Llama-2-13b-chat, Llama-2-70b-chat) accept a history of chat between the user and the chat assistant, and generate the subsequent chat. The pre-trained models (Llama-2-7b, Llama-2-13b, Llama-2-70b) requires a string prompt and perform text completion on the provided prompt. See the following code:
Note that by default, accept_eula
is set to false. You need to set accept_eula=true
to invoke the endpoint successfully. By doing so, you accept the user license agreement and acceptable use policy as mentioned earlier. You can also download the license agreement.
Custom_attributes
used to pass EULA are key/value pairs. The key and value are separated by =
and pairs are separated by ;
. If the user passes the same key more than once, the last value is kept and passed to the script handler (i.e., in this case, used for conditional logic). For example, if accept_eula=false; accept_eula=true
is passed to the server, then accept_eula=true
is kept and passed to the script handler.
Inference parameters control the text generation process at the endpoint. The maximum new tokens control refers to the size of the output generated by the model. Note that this is not the same as the number of words because the vocabulary of the model is not the same as the English language vocabulary, and each token may not be an English language word. Temperature controls the randomness in the output. Higher temperature results in more creative and hallucinated outputs. All the inference parameters are optional.
The following table lists all the Llama models available in SageMaker JumpStart along with the model_ids
, default instance types, and the maximum number of total tokens (sum of number of input tokens and number of generated tokens) supported for each of these models.
Model Name | Model ID | Max Total Tokens | Default Instance Type |
Llama-2-7b | meta-textgeneration-llama-2-7b | 4096 | ml.g5.2xlarge |
Llama-2-7b-chat | meta-textgeneration-llama-2-7b-f | 4096 | ml.g5.2xlarge |
Llama-2-13b | meta-textgeneration-llama-2-13b | 4096 | ml.g5.12xlarge |
Llama-2-13b-chat | meta-textgeneration-llama-2-13b-f | 4096 | ml.g5.12xlarge |
Llama-2-70b | meta-textgeneration-llama-2-70b | 4096 | ml.g5.48xlarge |
Llama-2-70b-chat | meta-textgeneration-llama-2-70b-f | 4096 | ml.g5.48xlarge |
Note that SageMaker endpoints have a timeout limit of 60s. Thus, even though the model may be able to generate 4096 tokens, if text generation takes more than 60s, request will fail. For 7B, 13B, and 70B models, we recommend to set max_new_tokens
no greater than 1500, 1000, and 500 respectively, while keeping the total number of tokens less than 4K.
Inference and example prompts for Llama-2-70b
You can use Llama models for text completion for any piece of text. Through text generation, you can perform a variety of tasks, such as answering questions, language translation, sentiment analysis, and many more. Input payload to the endpoint looks like the following code:
The following are some sample example prompts and the text generated by the model. All outputs are generated with inference parameters {"max_new_tokens":256, "top_p":0.9, "temperature":0.6}
.
In the next example, we show how to use Llama models with few-shot in-context learning, where we provide training samples available to the model. Note that we only make inference on the deployed model and during this process, model weights don’t change.
Inference and example prompts for Llama-2-70b-chat
With Llama-2-Chat models, which are optimized for dialogue use cases, the input to the chat model endpoints is the previous history between the chat assistant and the user. You can ask questions contextual to the conversation that has happened so far. You can also provide the system configuration, such as personas that define the chat assistant’s behavior. The input payload to the endpoint looks like the following code:
The following are some sample example prompts and the text generated by the model. All outputs are generated with the inference parameters {"max_new_tokens": 512, "top_p": 0.9, "temperature": 0.6}
.
In the following example, the user has had a conversation with the assistant about tourist sites in Paris. Next, the user is inquiring about the first option recommended by the chat assistant.
In the following examples, we set the system’s configuration:
Clean up
After you’re done running the notebook, make sure to delete all resources so that all the resources that you created in the process are deleted and your billing is stopped:
Conclusion
In this post, we showed you how to get started with Llama 2 models in SageMaker Studio. With this, you have access to six Llama 2 foundation models that contain billions of parameters. Because foundation models are pre-trained, they can also help lower training and infrastructure costs and enable customization for your use case. To get started with SageMaker JumpStart, visit the following resources:
- SageMaker JumpStart documentation
- SageMaker JumpStart Foundation Models documentation
- SageMaker JumpStart product detail page
- SageMaker JumpStart model catalog
About the authors
June Won is a product manager with SageMaker JumpStart. He focuses on making foundation models easily discoverable and usable to help customers build generative AI applications. His experience at Amazon also includes mobile shopping application and last mile delivery.
Dr. Vivek Madan is an Applied Scientist with the Amazon SageMaker JumpStart team. He got his PhD from University of Illinois at Urbana-Champaign and was a Post Doctoral Researcher at Georgia Tech. He is an active researcher in machine learning and algorithm design and has published papers in EMNLP, ICLR, COLT, FOCS, and SODA conferences.
Dr. Kyle Ulrich is an Applied Scientist with the Amazon SageMaker JumpStart team. His research interests include scalable machine learning algorithms, computer vision, time series, Bayesian non-parametrics, and Gaussian processes. His PhD is from Duke University and he has published papers in NeurIPS, Cell, and Neuron.
Dr. Ashish Khetan is a Senior Applied Scientist with Amazon SageMaker JumpStart and helps develop machine learning algorithms. He got his PhD from University of Illinois Urbana-Champaign. He is an active researcher in machine learning and statistical inference, and has published many papers in NeurIPS, ICML, ICLR, JMLR, ACL, and EMNLP conferences.
Sundar Ranganathan is the Global Head of GenAI/Frameworks GTM Specialists at AWS. He focuses on developing GTM strategy for large language models, GenAI, and large-scale ML workloads across AWS services like Amazon EC2, EKS, EFA, AWS Batch, and Amazon SageMaker. His experience includes leadership roles in product management and product development at NetApp, Micron Technology, Qualcomm, and Mentor Graphics.
Configure cross-account access of Amazon Redshift clusters in Amazon SageMaker Studio using VPC peering
With cloud computing, as compute power and data became more available, machine learning (ML) is now making an impact across every industry and is a core part of every business and industry.
Amazon SageMaker Studio is the first fully integrated ML development environment (IDE) with a web-based visual interface. You can perform all ML development steps and have complete access, control, and visibility into each step required to build, train, and deploy models.
Amazon Redshift is a fully managed, fast, secure, and scalable cloud data warehouse. Organizations often want to use SageMaker Studio to get predictions from data stored in a data warehouse such as Amazon Redshift.
As described in the AWS Well-Architected Framework, separating workloads across accounts enables your organization to set common guardrails while isolating environments. This can be particularly useful for certain security requirements, as well as to simplify cost controls and monitoring between projects and teams. Organizations with a multi-account architecture typically have Amazon Redshift and SageMaker Studio in two separate AWS accounts. Also, Amazon Redshift and SageMaker Studio are typically configured in VPCs with private subnets to improve security and reduce the risk of unauthorized access as a best practice.
Amazon Redshift natively supports cross-account data sharing when RA3 node types are used. If you’re using any other Amazon Redshift node types, such as DS2 or DC2, you can use VPC peering to establish a cross-account connection between Amazon Redshift and SageMaker Studio.
In this post, we walk through step-by-step instructions to establish a cross-account connection to any Amazon Redshift node type (RA3, DC2, DS2) by connecting the Amazon Redshift cluster located in one AWS account to SageMaker Studio in another AWS account in the same Region using VPC peering.
Solution overview
We start with two AWS accounts: a producer account with the Amazon Redshift data warehouse, and a consumer account for Amazon SageMaker ML use cases that has SageMaker Studio set up. The following is a high-level overview of the workflow:
- Set up SageMaker Studio with
VPCOnly
mode in the consumer account. This prevents SageMaker from providing internet access to your studio notebooks. All SageMaker Studio traffic is through the specified VPC and subnets. - Update your SageMaker Studio domain to turn on
SourceIdentity
to propagate the user profile name. - Create an AWS Identity and Access Management (IAM) role in the Amazon Redshift producer account that the SageMaker Studio IAM role will assume to access Amazon Redshift.
- Update the SageMaker IAM execution role in the SageMaker Studio consumer account that SageMaker Studio will use to assume the role in the producer Amazon Redshift account.
- Set up a peering connection between VPCs in the Amazon Redshift producer account and SageMaker Studio consumer account.
- Query Amazon Redshift in SageMaker Studio in the consumer account.
The following diagram illustrates our solution architecture.
Prerequisites
The steps in this post assume that Amazon Redshift is launched in a private subnet in the Amazon Redshift producer account. Launching Amazon Redshift in a private subnet provides an additional layer of security and isolation compared to launching it in a public subnet because the private subnet is not directly accessible from the internet and more secure from external attacks.
To download public libraries, you must create a VPC and a private and public subnet in the SageMaker consumer account. Then launch a NAT gateway in the public subnet and add an internet gateway for SageMaker Studio in the private subnet to access the internet. For instructions on how to establish a connection to a private subnet, refer to How do I set up a NAT gateway for a private subnet in Amazon VPC?
Set up SageMaker Studio with VPCOnly mode in the consumer account
To create SageMaker Studio with VPCOnly
mode, complete the following steps:
- On the SageMaker console, choose Studio in the navigation pane.
- Launch SageMaker Studio, choose Standard setup, and choose Configure.
If you’re already using AWS IAM Identity Center (successor to AWS Single Sign-On) for accessing your AWS accounts, you can use it for authentication. Otherwise, you can use IAM for authentication and use your existing federated roles.
- In the General settings section, select Create a new role.
- In the Create an IAM role section, optionally specify your Amazon Simple Storage Service (Amazon S3) buckets by selecting Any, Specific, or None, then choose Create role.
This creates a SageMaker execution role, such as AmazonSageMaker-ExecutionRole-00000000
.
- Under Network and Storage Section, choose your VPC, subnet (private subnet), and security group that you created as a prerequisite.
- Select VPC Only, then choose Next.
Update your SageMaker Studio domain to turn on SourceIdentity to propagate the user profile name
SageMaker Studio is integrated with AWS CloudTrail to enable administrators to monitor and audit user activity and API calls from SageMaker Studio notebooks. You can configure SageMaker Studio to record the user identity (specifically, the user profile name) to monitor and audit user activity and API calls from SageMaker Studio notebooks in CloudTrail events.
To log specific user activity among several user profiles, we recommended that you turn on SourceIdentity
to propagate the SageMaker Studio domain with the user profile name. This allows you to persist the user information into the session so you can attribute actions to a specific user. This attribute is also persisted over when you chain roles, so you can get fine-grained visibility into their actions in the producer account. As of the time this post was written, you can only configure this using the AWS Command Line Interface (AWS CLI) or any command line tool.
To update this configuration, all apps in the domain must be in the Stopped or Deleted state.
Use the following code to enable the propagation of the user profile name as the SourceIdentity
:
This requires that you add sts:SetSourceIdentity
in the trust relationship for your execution role.
Create an IAM role in the Amazon Redshift producer account that SageMaker Studio must assume to access Amazon Redshift
To create a role that SageMaker will assume to access Amazon Redshift, complete the following steps:
- Open the IAM console in the Amazon Redshift producer account.
- Choose Roles in the navigation pane, then choose Create role.
- On the Select trusted entity page, select Custom trust policy.
- Enter the following custom trust policy into the editor and provide your SageMaker consumer account ID and the SageMaker execution role that you created:
- Choose Next.
- On the Add required permissions page, choose Create policy.
- Add the following sample policy and make necessary edits based on your configuration.
- Save the policy by adding a name, such as
RedshiftROAPIUserAccess
.
The SourceIdentity
attribute is used to tie the identity of the original SageMaker Studio user to the Amazon Redshift database user. The actions by the user in the producer account can then be monitored using CloudTrail and Amazon Redshift database audit logs.
- On the Name, review, and create page, enter a role name, review the settings, and choose Create role.
Update the IAM role in the SageMaker consumer account that SageMaker Studio assumes in the Amazon Redshift producer account
To update the SageMaker execution role for it to assume the role that we just created, complete the following steps:
- Open the IAM console in the SageMaker consumer account.
- Choose Roles in the navigation pane, then choose the SageMaker execution role that we created (
AmazonSageMaker-ExecutionRole-*
). - In the Permissions policy section, on the Add permissions menu, choose Create inline policy.
- In the editor, on the JSON tab, enter the following policy, where <StudioRedshiftRoleARN> is the ARN of the role you created in the Amazon Redshift producer account:
You can get the ARN of the role created in the Amazon Redshift producer account on the IAM console, as shown in the following screenshot.
- Choose Review policy.
- For Name, enter a name for your policy.
- Choose Create policy.
Your permission policies should look similar to the following screenshot.
Set up a peering connection between the VPCs in the Amazon Redshift producer account and SageMaker Studio consumer account
To establish communication between the SageMaker Studio VPC and Amazon Redshift VPC, the two VPCs need to be peered using VPC peering. Complete the following steps to establish a connection:
- In either the Amazon Redshift or SageMaker account, open the Amazon VPC console.
- In the navigation pane, choose Peering connections, then choose Create peering connection.
- For Name, enter a name for your connection.
- Under Select a local VPC to peer with, choose a local VPC.
- Under Select another VPC to peer with, specify another VPC in the same Region and another account.
- Choose Create peering connection.
- Review the VPC peering connection and choose Accept request to activate.
After the VPC peering connection is successfully established, you create routes on both the SageMaker and Amazon Redshift VPCs to complete connectivity between them.
- In the SageMaker account, open the Amazon VPC console.
- Choose Route tables in the navigation pane, then choose the VPC that is associated with SageMaker and edit the routes.
- Add CIDR for the destination Amazon Redshift VPC and the target as the peering connection.
- Additionally, add a NAT gateway.
- Choose Save changes.
- In the Amazon Redshift account, open the Amazon VPC console.
- Choose Route tables in the navigation pane, then choose the VPC that is associated with Amazon Redshift and edit the routes.
- Add CIDR for the destination SageMaker VPC and the target as the peering connection.
- Additionally, add an internet gateway.
- Choose Save changes.
You can connect to SageMaker Studio from your VPC through an interface endpoint in your VPC instead of connecting over the internet. When you use a VPC interface endpoint, communication between your VPC and the SageMaker API or runtime is conducted entirely and securely within the AWS network.
- To create a VPC endpoint, in the SageMaker account, open the VPC console.
- Choose Endpoints in the navigation pane, then choose Create endpoint.
- Specify the SageMaker VPC, the respective subnets and appropriate security groups to allow inbound and outbound NFS traffic for your SageMaker notebooks domain, and choose Create VPC endpoint.
Query Amazon Redshift in SageMaker Studio in the consumer account
After all the networking has been successfully established, follow the steps in this section to connect to the Amazon Redshift cluster in the SageMaker Studio consumer account using the AWS SDK for pandas library:
- In SageMaker Studio, create a new notebook.
- If the AWS SDK for pandas package is not installed you can install it using the following:
This installation is not persistent and will be lost if the KernelGateway App is deleted. Custom packages can be added as part of a Lifecycle Configuration.
- Enter the following code in the first cell and run the code. Replace
RoleArn
andregion_name
values based on your account settings:
- Enter the following code in a new cell and run the code to get the current SageMaker user profile name:
- Enter the following code in a new cell and run the code:
To successfully query Amazon Redshift, your database administrator needs to assign the newly created user with the required read permissions within the Amazon Redshift cluster in the producer account.
- Enter the following code in a new cell, update the query to match your Amazon Redshift table, and run the cell. This should return the records successfully for further data processing and analysis.
You can now start building your data transformations and analysis based on your business requirements.
Clean up
To clean up any resources to avoid incurring recurring costs, delete the SageMaker VPC endpoints, Amazon Redshift cluster, and SageMaker Studio apps, users, and domain. Also delete any S3 buckets and objects you created.
Conclusion
In this post, we showed how to establish a cross-account connection between private Amazon Redshift and SageMaker Studio VPCs in different accounts using VPC peering and access Amazon Redshift data in SageMaker Studio using IAM role chaining, while also logging the user identity when the user accessed Amazon Redshift from SageMaker Studio. With this solution, you eliminate the need to manually move data between accounts to access data. We also walked through how to access the Amazon Redshift cluster using the AWS SDK for pandas library in SageMaker Studio and prepare the data for your ML use cases.
To learn more about Amazon Redshift and SageMaker, refer to the Amazon Redshift Database Developer Guide and Amazon SageMaker Documentation.
About the Authors
Supriya Puragundla is a Senior Solutions Architect at AWS. She helps key customer accounts on their AI and ML journey. She is passionate about data-driven AI and the area of depth in machine learning.
Marc Karp is a Machine Learning Architect with the Amazon SageMaker team. He focuses on helping customers design, deploy, and manage ML workloads at scale. In his spare time, he enjoys traveling and exploring new places.
Effectively solve distributed training convergence issues with Amazon SageMaker Hyperband Automatic Model Tuning
Recent years have shown amazing growth in deep learning neural networks (DNNs). This growth can be seen in more accurate models and even opening new possibilities with generative AI: large language models (LLMs) that synthesize natural language, text-to-image generators, and more. These increased capabilities of DNNs come with the cost of having massive models that require significant computational resources in order to be trained. Distributed training addresses this problem with two techniques: data parallelism and model parallelism. Data parallelism is used to scale the training process over multiple nodes and workers, and model parallelism splits a model and fits them over the designated infrastructure. Amazon SageMaker distributed training jobs enable you with one click (or one API call) to set up a distributed compute cluster, train a model, save the result to Amazon Simple Storage Service (Amazon S3), and shut down the cluster when complete. Furthermore, SageMaker has continuously innovated in the distributed training space by launching features like heterogeneous clusters and distributed training libraries for data parallelism and model parallelism.
Efficient training on a distributed environment requires adjusting hyperparameters. A common example of good practice when training on multiple GPUs is to multiply batch (or mini-batch) size by the GPU number in order to keep the same batch size per GPU. However, adjusting hyperparameters often impacts model convergence. Therefore, distributed training needs to balance three factors: distribution, hyperparameters, and model accuracy.
In this post, we explore the effect of distributed training on convergence and how to use Amazon SageMaker Automatic Model Tuning to fine-tune model hyperparameters for distributed training using data parallelism.
The source code mentioned in this post can be found on the GitHub repository (an m5.xlarge instance is recommended).
Scale out training from a single to distributed environment
Data parallelism is a way to scale the training process to multiple compute resources and achieve faster training time. With data parallelism, data is partitioned among the compute nodes, and each node computes the gradients based on their partition and updates the model. These updates can be done using one or multiple parameter servers in an asynchronous, one-to-many, or all-to-all fashion. Another way can be to use an AllReduce algorithm. For example, in the ring-allreduce algorithm, each node communicates with only two of its neighboring nodes, thereby reducing the overall data transfers. To learn more about parameter servers and ring-allreduce, see Launching TensorFlow distributed training easily with Horovod or Parameter Servers in Amazon SageMaker. With regards to data partitioning, if there are n compute nodes, then each node should get a subset of the data, approximately 1/n in size.
To demonstrate the effect of scaling out training on model convergence, we run two simple experiments:
- Train an image classification model using a fully connected-layer DNN with ReLU activation functions using MXNet and Gluon frameworks. For training data, we used the MNIST dataset of handwritten digits. We used the source provided in the SageMaker example repository.
- Train a binary classification model using the SageMaker built-in XGBoost algorithm. We used the direct marketing dataset to predict bank customers who are likely to respond with a specific offer. The source code and steps to reproduce the experiment can be found on the GitHub repo.
Each model training ran twice: on a single instance and distributed over multiple instances. For the DNN distributed training, in order to fully utilize the distributed processors, we multiplied the mini-batch size by the number of instances (four). The following table summarizes the setup and results.
Problem type | Image classification | Binary classification | ||
Model | DNN | XGBoost | ||
Instance | ml.c4.xlarge | ml.m5.2xlarge | ||
Data set |
(Labeled images) |
Direct Marketing (tabular, numeric and vectorized categories) |
||
Validation metric | Accuracy | AUC | ||
Epocs/Rounds | 20 | 150 | ||
Number of Instances | 1 | 4 | 1 | 3 |
Distribution type | N/A | Parameter server | N/A | AllReduce |
Training time (minutes) | 8 | 3 | 3 | 1 |
Final Validation score | 0.97 | 0.11 | 0.78 | 0.63 |
For both models, the training time was reduced almost linearly by the distribution factor. However, model convergence suffered a significant drop. This behavior is consistent for the two different models, the different compute instances, the different distribution methods, and different data types. So, why did distributing the training process affect model accuracy?
There are a number of theories that try to explain this effect:
- When tensor updates are big in size, traffic between workers and the parameter server can get congested. Therefore, asynchronous parameter servers will suffer significantly worse convergence due to delays in weights updates [1].
- Increasing batch size can lead to over-fitting and poor generalization, thereby reducing the validation accuracy [2].
- When asynchronously updating model parameters, some DNNs might not be using the most recent updated model weights; therefore, they will be calculating gradients based on weights that are a few iterations behind. This leads to weight staleness [3] and can be caused by a number of reasons.
- Some hyperparameters are model or optimizer specific. For example, the XGBoost official documentation says that the
exact
value for thetree_mode
hyperparameter doesn’t support distributed training because XGBoost employs row splitting data distribution whereas theexact
tree method works on a sorted column format. - Some researchers proposed that configuring a larger mini-batch may lead to gradients with less stochasticity. This can happen when the loss function contains local minima and saddle points and no change is made to step size, to optimization getting stuck in such local minima or saddle point [4].
Optimize for distributed training
Hyperparameter optimization (HPO) is the process of searching and selecting a set of hyperparameters that are optimal for a learning algorithm. SageMaker Automatic Model Tuning (AMT) provides HPO as a managed service by running multiple training jobs on the provided dataset. SageMaker AMT searches the ranges of hyperparameters that you specify and returns the best values, as measured by a metric that you choose. You can use SageMaker AMT with the built-in algorithms or use your custom algorithms and containers.
However, optimizing for distributed training differs from common HPO because instead of launching a single instance per training job, each job actually launches a cluster of instances. This means a greater impact on cost (especially if you consider costly GPU-accelerated instances, which are typical for DNN). In addition to AMT limits, you could possibly hit SageMaker account limits for concurrent number of training instances. Finally, launching clusters can introduce operational overhead due to longer starting time. SageMaker AMT has specific features to address these issues. Hyperband with early stopping ensures that well-performing hyperparameters configurations are fine-tuned and those that underperform are automatically stopped. This enables efficient use of training time and reduces unnecessary costs. Also, SageMaker AMT fully supports the use of Amazon EC2 Spot Instances, which can optimize the cost of training up to 90% over on-demand instances. With regards to long start times, SageMaker AMT automatically reuses training instances within each tuning job, thereby reducing the average startup time of each training job by 20 times. Additionally, you should follow AMT best practices, such as choosing the relevant hyperparameters, their appropriate ranges and scales, and the best number of concurrent training jobs, and setting a random seed to reproduce results.
In the next section, we see these features in action as we configure, run, and analyze an AMT job using the XGBoost example we discussed earlier.
Configure, run, and analyze a tuning job
As mentioned earlier, the source code can be found on the GitHub repo. In Steps 1–5, we download and prepare the data, create the xgb3
estimator (the distributed XGBoost estimator is set to use three instances), run the training jobs, and observe the results. In this section, we describe how to set up the tuning job for that estimator, assuming you already went through Steps 1–5.
A tuning job computes optimal hyperparameters for the training jobs it launches by using a metric to evaluate performance. You can configure your own metric, which SageMaker will parse based on regex you configure and emit to stdout
, or use the metrics of SageMaker built-in algorithms. In this example, we use the built-in XGBoost objective metric, so we don’t need to configure a regex. To optimize for model convergence, we optimize based on the validation AUC metric:
We tune seven hyperparameters:
- num_round – Number of rounds for boosting during the training.
- eta – Step size shrinkage used in updates to prevent overfitting.
- alpha – L1 regularization term on weights.
- min_child_weight – Minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than
min_child_weight
, the building process gives up further partitioning. - max_depth – Maximum depth of a tree.
- colsample_bylevel – Subsample ratio of columns for each split, in each level. This subsampling takes place once for every new depth level reached in a tree.
- colsample_bytree – Subsample ratio of columns when constructing each tree. For every tree constructed, the subsampling occurs once.
To learn more about XGBoost hyperparameters, see XGBoost Hyperparameters. The following code shows the seven hyperparameters and their ranges:
Next, we provide the configuration for the Hyperband strategy and the tuner object configuration using the SageMaker SDK. HyperbandStrategyConfig
can use two parameters: max_resource
(optional) for the maximum number of iterations to be used for a training job to achieve the objective, and min_resource
– the minimum number of iterations to be used by a training job before stopping the training. We use HyperbandStrategyConfig
to configure StrategyConfig
, which is later used by the tuning job definition. See the following code:
Now we create a HyperparameterTuner
object, to which we pass the following information:
- The XGBoost estimator, set to run with three instances
- The objective metric name and definition
- Our hyperparameter ranges
- Tuning resource configurations such as number of training jobs to run in total and how many training jobs can be run in parallel
- Hyperband settings (the strategy and configuration we configured in the last step)
- Early stopping (
early_stopping_type
) set toOff
Why do we set early stopping to Off? Training jobs can be stopped early when they are unlikely to improve the objective metric of the hyperparameter tuning job. This can help reduce compute time and avoid overfitting your model. However, Hyperband uses an advanced built-in mechanism to apply early stopping. Therefore, the parameter early_stopping_type
must be set to Off
when using the Hyperband internal early stopping feature. See the following code:
Finally, we start the automatic model tuning job by calling the fit method. If you want to launch the job in an asynchronous fashion, set wait
to False
. See the following code:
You can follow the job progress and summary on the SageMaker console. In the navigation pane, under Training, choose Hyperparameter tuning jobs, then choose the relevant tuning job. The following screenshot shows the tuning job with details on the training jobs’ status and performance.
When the tuning job is complete, we can review the results. In the notebook example, we show how to extract results using the SageMaker SDK. First, we examine how the tuning job increased model convergence. You can attach the HyperparameterTuner
object using the job name and call the describe method. The method returns a dictionary containing tuning job metadata and results.
In the following code, we retrieve the value of the best-performing training job, as measured by our objective metric (validation AUC):
The result is 0.78 in AUC on the validation set. That’s a significant improvement over the initial 0.63!
Next, let’s see how fast our training job ran. For that, we use the HyperparameterTuningJobAnalytics method in the SDK to fetch results about the tuning job, and read into a Pandas data frame for analysis and visualization:
Let’s see the average time a training job took with Hyperband strategy:
The average time took approximately 1 minute. This is consistent with the Hyperband strategy mechanism that stops underperforming training jobs early. In terms of cost, the tuning job charged us for a total of 30 minutes of training time. Without Hyperband early stopping, the total billable training duration was expected to be 90 minutes (30 jobs * 1 minutes per job * 3 instances per job). That is three times better in cost savings! Finally, we see that the tuning job ran 30 training jobs and took a total of 12 minutes. That is almost 50% less of the expected time (30 jobs/4 jobs in parallel * 3 minutes per job).
Conclusion
In this post, we described some observed convergence issues when training models with distributed environments. We saw that SageMaker AMT using Hyperband addressed the main concerns that optimizing data parallel distributed training introduced: convergence (which improved by more than 10%), operational efficiency (the tuning job took 50% less time than a sequential, non-optimized job would have taken) and cost-efficiency (30 vs. the 90 billable minutes of training job time). The following table summarizes our results:
Improvement Metric | No Tuning/Naive Model Tuning Implementation | SageMaker Hyperband Automatic Model Tuning | Measured Improvement |
Model Quality (Measured by validation AUC) |
0.63 | 0.78 | 15% |
Cost (Measured by billable training minutes) |
90 | 30 | 66% |
Operational efficiency (Measured by total running time) |
24 | 12 | 50% |
In order to fine-tune with regards to scaling (cluster size), you can repeat the tuning job with multiple cluster configurations and compare the results to find the optimal hyperparameters that satisfy speed and model accuracy.
We included the steps to achieve this in the last section of the notebook.
References
[1] Lian, Xiangru, et al. “Asynchronous decentralized parallel stochastic gradient descent.” International Conference on Machine Learning. PMLR, 2018. [2] Keskar, Nitish Shirish, et al. “On large-batch training for deep learning: Generalization gap and sharp minima.” arXiv preprint arXiv:1609.04836 (2016). [3] Dai, Wei, et al. “Toward understanding the impact of staleness in distributed machine learning.” arXiv preprint arXiv:1810.03264 (2018). [4] Dauphin, Yann N., et al. “Identifying and attacking the saddle point problem in high-dimensional non-convex optimization.” Advances in neural information processing systems 27 (2014).About the Author
Uri Rosenberg is the AI & ML Specialist Technical Manager for Europe, Middle East, and Africa. Based out of Israel, Uri works to empower enterprise customers to design, build, and operate ML workloads at scale. In his spare time, he enjoys cycling, hiking, and complaining about data preparation.
Bringing code analysis tools to Jupyter notebooks
Based on a survey of thousands of machine learning practitioners, a new CodeGuru extension addresses common problems, such as code cell execution order, incorrect API calls, and security.Read More
Pronunciation detection for Alexa’s new English-learning experience
Data augmentation, novel loss functions, and weakly supervised training enable a state-of-the art model for recognizing mispronunciations.Read More
Access private repos using the @remote decorator for Amazon SageMaker training workloads
As more and more customers are looking to put machine learning (ML) workloads in production, there is a large push in organizations to shorten the development lifecycle of ML code. Many organizations prefer writing their ML code in a production-ready style in the form of Python methods and classes as opposed to an exploratory style (writing code without using methods or classes) because this helps them ship production-ready code faster.
With Amazon SageMaker, you can use the @remote decorator to run a SageMaker training job simply by annotating your Python code with an @remote decorator. The SageMaker Python SDK will automatically translate your existing workspace environment and any associated data processing code and datasets into a SageMaker training job that runs on the SageMaker training platform.
Running a Python function locally often requires several dependencies, which may not come with the local Python runtime environment. You can install them via package and dependency management tools like pip or conda.
However, organizations operating in regulated industries like banking, insurance, and healthcare operate in environments that have strict data privacy and networking controls in place. These controls often mandate having no internet access available to any of their environments. The reason for such restriction is to have full control over egress and ingress traffic so they can reduce the chances of unscrupulous actors sending or receiving non-verified information through their network. It’s often also mandated to have such network isolation as part of the auditory and industrial compliance rules. When it comes to ML, this restricts data scientists from downloading any package from public repositories like PyPI, Anaconda, or Conda-Forge.
To provide data scientists access to the tools of their choice while also respecting the restrictions of the environment, organizations often set up their own private package repository hosted in their own environment. You can set up private package repositories on AWS in multiple ways:
- Using AWS CodeArtifact
- Using Amazon Simple Storage (Amazon S3)
- Hosting a repository on Amazon Elastic Compute Cloud (Amazon EC2)
In this post, we focus on the first option: using CodeArtifact.
Solution overview
The following architecture diagram shows the solution architecture.
The high-level steps to implement the solution are as follows
- Set up a virtual private cloud (VPC) with no internet access using an AWS CloudFormation template.
- Use a second CloudFormation template to set up CodeArtifact as a private PyPI repository and provide connectivity to the VPC, and set up an Amazon SageMaker Studio environment to use the private PyPI repository.
- Train a classification model based on the MNIST dataset using an @remote decorator from the open-source SageMaker Python SDK. All the dependencies will be downloaded from the private PyPI repository.
Note that using SageMaker Studio in this post is optional. You can choose to work in any integrated development environment (IDE) of your choice. You just need to set up your AWS Command Line Interface (AWS CLI) credentials correctly. For more information, refer to Configure the AWS CLI.
Prerequisites
You need an AWS account with an AWS Identity and Access Management (IAM) role with permissions to manage resources created as part of the solution. For details, refer to Creating an AWS account.
Set up a VPC with no internet connection
Create a new CloudFormation stack using the vpc.yaml template. This template creates the following resources:
- A VPC with two private subnets across two Availability Zones with no internet connectivity
- A Gateway VPC endpoint for accessing Amazon S3
- Interface VPC endpoints for SageMaker, CodeArtifact, and a few other services to allow the resources in the VPC to connect to AWS services via AWS PrivateLink
Provide a stack name, such as No-Internet
, and complete the stack creation process.
Wait for the stack creation process to complete.
Set up a private repository and SageMaker Studio using the VPC
The next step is to deploy another CloudFormation stack using the sagemaker_studio_codeartifact.yaml template. This template creates the following resources:
- A SageMaker domain connected to the VPC created in the previous step
- A CodeArtifact domain
- A CodeArtifact private repository connected to an upstream public PyPI repository
Provide a stack name and keep the default values or adjust the parameters for the CodeArtifact domain name, private repository name, user profile name for SageMaker Studio, and name for the upstream public PyPI repository. You also we need to provide the VPC stack name created in the previous step.
When the stack creation is complete, the SageMaker domain should be visible on the SageMaker console.
To verify there is no internet connection available in SageMaker Studio, launch SageMaker Studio. Choose File
, New
, and Terminal
to launch a terminal and try to curl any internet resource. It should fail to connect, as shown in the following screenshot.
Train an image classifier using an @remote decorator with the private PyPI repository
In this section, we use the @remote decorator to run a PyTorch training job that produces a MNIST image classification model. To achieve this, we set up a configuration file, develop the training script, and run the training code.
Set up a configuration file
We set up a config.yaml
file and provide the configurations needed to do the following:
- Run a SageMaker training job in the no-internet VPC created earlier
- Download the required packages by connecting to the private PyPI repository created earlier
The file looks like the following code:
The Dependencies
field contains the path to requirements.txt
, which contains all the dependencies needed. Note that all the dependencies will be downloaded from the private repository. The requirements.txt
file contains the following code:
The PreExecutionCommands
section contains the command to connect to the private PyPI repository. To get the CodeArtifact VPC endpoint URL, use the following code:
Generally, we get two VPC endpoints for CodeArtifact, and we can use any of them in the connection commands. For more details, refer to Use CodeArtifact from a VPC.
Additionally, configurations like execution role
, output location
, and VPC configurations
are provided in the config file. These configurations are needed to run the SageMaker training job. To know more about all the configurations supported, refer to Configuration file.
It’s not mandatory to use the config.yaml
file in order to work with the @remote decorator. This is just a cleaner way to supply all configurations to the @remote decorator. All the configs could also be supplied directly in the decorator arguments, but that reduces readability and maintainability of changes in the long run. Also, the config file can be created by an admin and shared with all the users in an environment.
Develop the training script
Next, we prepare the training code in simple Python files. We have divided the code into three files:
- load_data.py – Contains the code to download the MNIST dataset
- model.py – Contains the code for the neural network architecture for the model
- train.py – Contains the code for training the model by using load_data.py and model.py
In train.py
, we need to decorate the main training function as follows:
Now we’re ready to run the training code.
Run the training code with an @remote decorator
We can run the code from a terminal or from any executable prompt. In this post, we use a SageMaker Studio notebook cell to demonstrate this:
Running the preceding command triggers the training job. In the logs, we can see that it’s downloading the packages from the private PyPI repository.
This concludes the implementation of an @remote decorator working with a private repository in an environment with no internet access.
Clean up
To clean up the resources, follow the instructions in CLEANUP.md.
Conclusion
In this post, we learned how to effectively use the @remote decorator’s capabilities while still working in restrictive environments without any internet access. We also learned how can we integrate CodeArtifact private repository capabilities with the help of configuration file support in SageMaker. This solution makes iterative development much simpler and faster. Another added advantage is that you can still continue to write the training code in a more natural, object-oriented way and still use SageMaker capabilities to run training jobs on a remote cluster with minimal changes in your code. All the code shown as part of this post is available in the GitHub repository.
As a next step, we encourage you to check out the @remote decorator functionality and Python SDK API and use it in your choice of environment and IDE. Additional examples are available in the amazon-sagemaker-examples repository to get you started quickly. You can also check out the post Run your local machine learning code as Amazon SageMaker Training jobs with minimal code changes for more details.
About the author
Vikesh Pandey is a Machine Learning Specialist Solutions Architect at AWS, helping customers from financial industries design and build solutions on generative AI and ML. Outside of work, Vikesh enjoys trying out different cuisines and playing outdoor sports.
A quick guide to Amazon’s 65-plus papers at this year’s ACL
Familiar topics such as question answering and natural-language understanding remain well represented, but a new concentration on language modeling and multimodal models reflect the spread of generative AI.Read More
Do large language models really need all those layers?
Finding that 70% of attention heads and 20% of feed-forward networks can be excised with minimal effect on in-context learning suggests that large language models are undertrained.Read More