Configure cross-account access of Amazon Redshift clusters in Amazon SageMaker Studio using VPC peering

Configure cross-account access of Amazon Redshift clusters in Amazon SageMaker Studio using VPC peering

With cloud computing, as compute power and data became more available, machine learning (ML) is now making an impact across every industry and is a core part of every business and industry.

Amazon SageMaker Studio is the first fully integrated ML development environment (IDE) with a web-based visual interface. You can perform all ML development steps and have complete access, control, and visibility into each step required to build, train, and deploy models.

Amazon Redshift is a fully managed, fast, secure, and scalable cloud data warehouse. Organizations often want to use SageMaker Studio to get predictions from data stored in a data warehouse such as Amazon Redshift.

As described in the AWS Well-Architected Framework, separating workloads across accounts enables your organization to set common guardrails while isolating environments. This can be particularly useful for certain security requirements, as well as to simplify cost controls and monitoring between projects and teams. Organizations with a multi-account architecture typically have Amazon Redshift and SageMaker Studio in two separate AWS accounts. Also, Amazon Redshift and SageMaker Studio are typically configured in VPCs with private subnets to improve security and reduce the risk of unauthorized access as a best practice.

Amazon Redshift natively supports cross-account data sharing when RA3 node types are used. If you’re using any other Amazon Redshift node types, such as DS2 or DC2, you can use VPC peering to establish a cross-account connection between Amazon Redshift and SageMaker Studio.

In this post, we walk through step-by-step instructions to establish a cross-account connection to any Amazon Redshift node type (RA3, DC2, DS2) by connecting the Amazon Redshift cluster located in one AWS account to SageMaker Studio in another AWS account in the same Region using VPC peering.

Solution overview

We start with two AWS accounts: a producer account with the Amazon Redshift data warehouse, and a consumer account for Amazon SageMaker ML use cases that has SageMaker Studio set up. The following is a high-level overview of the workflow:

  1. Set up SageMaker Studio with VPCOnly mode in the consumer account. This prevents SageMaker from providing internet access to your studio notebooks. All SageMaker Studio traffic is through the specified VPC and subnets.
  2. Update your SageMaker Studio domain to turn on SourceIdentity to propagate the user profile name.
  3. Create an AWS Identity and Access Management (IAM) role in the Amazon Redshift producer account that the SageMaker Studio IAM role will assume to access Amazon Redshift.
  4. Update the SageMaker IAM execution role in the SageMaker Studio consumer account that SageMaker Studio will use to assume the role in the producer Amazon Redshift account.
  5. Set up a peering connection between VPCs in the Amazon Redshift producer account and SageMaker Studio consumer account.
  6. Query Amazon Redshift in SageMaker Studio in the consumer account.

The following diagram illustrates our solution architecture.

Prerequisites

The steps in this post assume that Amazon Redshift is launched in a private subnet in the Amazon Redshift producer account. Launching Amazon Redshift in a private subnet provides an additional layer of security and isolation compared to launching it in a public subnet because the private subnet is not directly accessible from the internet and more secure from external attacks.

To download public libraries, you must create a VPC and a private and public subnet in the SageMaker consumer account. Then launch a NAT gateway in the public subnet and add an internet gateway for SageMaker Studio in the private subnet to access the internet. For instructions on how to establish a connection to a private subnet, refer to How do I set up a NAT gateway for a private subnet in Amazon VPC?

Set up SageMaker Studio with VPCOnly mode in the consumer account

To create SageMaker Studio with VPCOnly mode, complete the following steps:

  1. On the SageMaker console, choose Studio in the navigation pane.
  2. Launch SageMaker Studio, choose Standard setup, and choose Configure.

If you’re already using AWS IAM Identity Center (successor to AWS Single Sign-On) for accessing your AWS accounts, you can use it for authentication. Otherwise, you can use IAM for authentication and use your existing federated roles.

  1. In the General settings section, select Create a new role.
  2. In the Create an IAM role section, optionally specify your Amazon Simple Storage Service (Amazon S3) buckets by selecting Any, Specific, or None, then choose Create role.

This creates a SageMaker execution role, such as AmazonSageMaker-ExecutionRole-00000000.

  1. Under Network and Storage Section, choose your VPC, subnet (private subnet), and security group that you created as a prerequisite.
  2. Select VPC Only, then choose Next.

Update your SageMaker Studio domain to turn on SourceIdentity to propagate the user profile name

SageMaker Studio is integrated with AWS CloudTrail to enable administrators to monitor and audit user activity and API calls from SageMaker Studio notebooks. You can configure SageMaker Studio to record the user identity (specifically, the user profile name) to monitor and audit user activity and API calls from SageMaker Studio notebooks in CloudTrail events.

To log specific user activity among several user profiles, we recommended that you turn on SourceIdentity to propagate the SageMaker Studio domain with the user profile name. This allows you to persist the user information into the session so you can attribute actions to a specific user. This attribute is also persisted over when you chain roles, so you can get fine-grained visibility into their actions in the producer account. As of the time this post was written, you can only configure this using the AWS Command Line Interface (AWS CLI) or any command line tool.

To update this configuration, all apps in the domain must be in the Stopped or Deleted state.

Use the following code to enable the propagation of the user profile name as the SourceIdentity:

update-domain
--domain-id <value>
[--default-user-settings <value>]
[--domain-settings-for-update "ExecutionRoleIdentityConfig=USER_PROFILE_NAME"]

This requires that you add sts:SetSourceIdentity in the trust relationship for your execution role.

Create an IAM role in the Amazon Redshift producer account that SageMaker Studio must assume to access Amazon Redshift

To create a role that SageMaker will assume to access Amazon Redshift, complete the following steps:

  1. Open the IAM console in the Amazon Redshift producer account.

  1. Choose Roles in the navigation pane, then choose Create role.

  1. On the Select trusted entity page, select Custom trust policy.
  2. Enter the following custom trust policy into the editor and provide your SageMaker consumer account ID and the SageMaker execution role that you created:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::<<SageMaker-Consumer-Account-ID>>:role/service-role/AmazonSageMaker-ExecutionRole-XXXXXX"
            },
            "Action": [
                "sts:AssumeRole",
                "sts:SetSourceIdentity"
           ]
            
        }
    ]
}

  1. Choose Next.
  2. On the Add required permissions page, choose Create policy.
  3. Add the following sample policy and make necessary edits based on your configuration.
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "GetRedshiftCredentials",
            "Effect": "Allow",
            "Action": "redshift:GetClusterCredentials",
            "Resource": [
                "arn:aws:redshift:<<redshift-region-name>>:<<REDSHIFT-PRODUCER-ACCOUNT-ID>>:dbname:<<redshift-cluster-name>>/<<redshift-db-name>>",
                "arn:aws:redshift:<<redshift-region-name>>:<<REDSHIFT-PRODUCER-ACCOUNT-ID>>:dbuser:<<redshift-cluster-name>>/${redshift:DbUser}",
                "arn:aws:redshift:<<redshift-region-name>>:<<REDSHIFT-PRODUCER-ACCOUNT-ID>>:cluster:<<redshift-cluster-name>>"
            ],
            "Condition": {
                "StringEquals": {
                    "redshift:DbUser": "${aws:SourceIdentity}"
                }
            }
        },
        {
            "Sid": "DynamicUserCreation",
            "Effect": "Allow",
            "Action": "redshift:CreateClusterUser",
            "Resource": "arn:aws:redshift:<<redshift-region-name>>:<<REDSHIFT-PRODUCER-ACCOUNT-ID>>:dbuser:<<redshift-cluster-name>>/${redshift:DbUser}",
            "Condition": {
                "StringEquals": {
                    "redshift:DbUser": "${aws:SourceIdentity}"
                }
            }
        },
        {
            "Effect": "Allow",
            "Action": "redshift:JoinGroup",
            "Resource": "arn:aws:redshift:<<redshift-region-name>>:<<REDSHIFT-PRODUCER-ACCOUNT-ID>>:dbgroup:<<redshift-cluster-name>>/*"
        },
        {
            "Sid": "DataAPIPermissions",
            "Effect": "Allow",
            "Action": [
                "redshift-data:ExecuteStatement",
                "redshift-data:CancelStatement",
                "redshift-data:ListStatements",
                "redshift-data:GetStatementResult",
                "redshift-data:DescribeStatement",
                "redshift-data:ListDatabases",
                "redshift-data:ListSchemas",
                "redshift-data:ListTables",
                "redshift-data:DescribeTable"
            ],
            "Resource": "*"
        },
        {
            "Sid": "ReadPermissions",
            "Effect": "Allow",
            "Action": [
                "redshift:Describe*",
                "redshift:ViewQueriesInConsole"
            ],
            "Resource": "*"
        }
    ]
}

  1. Save the policy by adding a name, such as RedshiftROAPIUserAccess.

The SourceIdentity attribute is used to tie the identity of the original SageMaker Studio user to the Amazon Redshift database user. The actions by the user in the producer account can then be monitored using CloudTrail and Amazon Redshift database audit logs.

  1. On the Name, review, and create page, enter a role name, review the settings, and choose Create role.

Update the IAM role in the SageMaker consumer account that SageMaker Studio assumes in the Amazon Redshift producer account

To update the SageMaker execution role for it to assume the role that we just created, complete the following steps:

  1. Open the IAM console in the SageMaker consumer account.
  2. Choose Roles in the navigation pane, then choose the SageMaker execution role that we created (AmazonSageMaker-ExecutionRole-*).
  3. In the Permissions policy section, on the Add permissions menu, choose Create inline policy.

  1. In the editor, on the JSON tab, enter the following policy, where <StudioRedshiftRoleARN> is the ARN of the role you created in the Amazon Redshift producer account:
{
    "Version": "2012-10-17",
    "Statement": {
        "Effect": "Allow",
        "Action": "sts:AssumeRole",
        "Resource": "<StudioRedshiftRoleARN>"
    }
}

You can get the ARN of the role created in the Amazon Redshift producer account on the IAM console, as shown in the following screenshot.

  1. Choose Review policy.
  2. For Name, enter a name for your policy.
  3. Choose Create policy.

Your permission policies should look similar to the following screenshot.

Set up a peering connection between the VPCs in the Amazon Redshift producer account and SageMaker Studio consumer account

To establish communication between the SageMaker Studio VPC and Amazon Redshift VPC, the two VPCs need to be peered using VPC peering. Complete the following steps to establish a connection:

  1. In either the Amazon Redshift or SageMaker account, open the Amazon VPC console.
  2. In the navigation pane, choose Peering connections, then choose Create peering connection.
  3. For Name, enter a name for your connection.
  4. Under Select a local VPC to peer with, choose a local VPC.
  5. Under Select another VPC to peer with, specify another VPC in the same Region and another account.
  6. Choose Create peering connection.

  1. Review the VPC peering connection and choose Accept request to activate.

After the VPC peering connection is successfully established, you create routes on both the SageMaker and Amazon Redshift VPCs to complete connectivity between them.

  1. In the SageMaker account, open the Amazon VPC console.
  2. Choose Route tables in the navigation pane, then choose the VPC that is associated with SageMaker and edit the routes.
  3. Add CIDR for the destination Amazon Redshift VPC and the target as the peering connection.
  4. Additionally, add a NAT gateway.
  5. Choose Save changes.

  1. In the Amazon Redshift account, open the Amazon VPC console.
  2. Choose Route tables in the navigation pane, then choose the VPC that is associated with Amazon Redshift and edit the routes.
  3. Add CIDR for the destination SageMaker VPC and the target as the peering connection.
  4. Additionally, add an internet gateway.
  5. Choose Save changes.

You can connect to SageMaker Studio from your VPC through an interface endpoint in your VPC instead of connecting over the internet. When you use a VPC interface endpoint, communication between your VPC and the SageMaker API or runtime is conducted entirely and securely within the AWS network.

  1. To create a VPC endpoint, in the SageMaker account, open the VPC console.
  2. Choose Endpoints in the navigation pane, then choose Create endpoint.
  3. Specify the SageMaker VPC, the respective subnets and appropriate security groups to allow inbound and outbound NFS traffic for your SageMaker notebooks domain, and choose Create VPC endpoint.

Query Amazon Redshift in SageMaker Studio in the consumer account

After all the networking has been successfully established, follow the steps in this section to connect to the Amazon Redshift cluster in the SageMaker Studio consumer account using the AWS SDK for pandas library:

  1. In SageMaker Studio, create a new notebook.
  2. If the AWS SDK for pandas package is not installed you can install it using the following:
!pip install awswrangler #AWS SDK for pandas

This installation is not persistent and will be lost if the KernelGateway App is deleted. Custom packages can be added as part of a Lifecycle Configuration.

  1. Enter the following code in the first cell and run the code. Replace RoleArn and region_name values based on your account settings:
import boto3
import awswrangler as wr
import pandas as pd
from datetime import datetime
import json
sts_client = boto3.client('sts')

# Call the assume_role method of the STSConnection object and pass the role
# ARN and a role session name.
assumed_role_object=sts_client.assume_role(
    RoleArn="arn:aws:iam::<<REDSHIFT-PRODUCER-ACCOUNT-ID>>:role/<<redshift-account-role>>",
    RoleSessionName="RedshiftSession"
)
credentials=assumed_role_object['Credentials']

# Use the temporary credentials that AssumeRole returns to make a 
# connection to Amazon S3  
redshift_session=boto3.Session(
    aws_access_key_id=credentials['AccessKeyId'],
    aws_secret_access_key=credentials['SecretAccessKey'],
    aws_session_token=credentials['SessionToken'],
    region_name="<<redshift-region-name>>",
)
  1. Enter the following code in a new cell and run the code to get the current SageMaker user profile name:
def get_userprofile_name():
    metadata_file_path = '/opt/ml/metadata/resource-metadata.json'
    with open(metadata_file_path, 'r') as logs:
        metadata = json.load(logs)
    return metadata.get("UserProfileName")
  1. Enter the following code in a new cell and run the code:
con_redshift = wr.redshift.connect_temp(
    cluster_identifier="<<redshift-cluster-name>>",
    database="<<redshift-db-name>>",
    user=get_userprofile_name(),
    auto_create=True,
    db_groups=[<<list-redshift-user-group>>],
    boto3_session = redshift_session
)

To successfully query Amazon Redshift, your database administrator needs to assign the newly created user with the required read permissions within the Amazon Redshift cluster in the producer account.

  1. Enter the following code in a new cell, update the query to match your Amazon Redshift table, and run the cell. This should return the records successfully for further data processing and analysis.
df = wr.redshift.read_sql_query(
    sql="SELECT * FROM users",
    con=con_redshift
)

You can now start building your data transformations and analysis based on your business requirements.

Clean up

To clean up any resources to avoid incurring recurring costs, delete the SageMaker VPC endpoints, Amazon Redshift cluster, and SageMaker Studio apps, users, and domain. Also delete any S3 buckets and objects you created.

Conclusion

In this post, we showed how to establish a cross-account connection between private Amazon Redshift and SageMaker Studio VPCs in different accounts using VPC peering and access Amazon Redshift data in SageMaker Studio using IAM role chaining, while also logging the user identity when the user accessed Amazon Redshift from SageMaker Studio. With this solution, you eliminate the need to manually move data between accounts to access data. We also walked through how to access the Amazon Redshift cluster using the AWS SDK for pandas library in SageMaker Studio and prepare the data for your ML use cases.

To learn more about Amazon Redshift and SageMaker, refer to the Amazon Redshift Database Developer Guide and Amazon SageMaker Documentation.


About the Authors

Supriya Puragundla is a Senior Solutions Architect at AWS. She helps key customer accounts on their AI and ML journey. She is passionate about data-driven AI and the area of depth in machine learning.

Marc Karp is a Machine Learning Architect with the Amazon SageMaker team. He focuses on helping customers design, deploy, and manage ML workloads at scale. In his spare time, he enjoys traveling and exploring new places.

Read More

Effectively solve distributed training convergence issues with Amazon SageMaker Hyperband Automatic Model Tuning

Effectively solve distributed training convergence issues with Amazon SageMaker Hyperband Automatic Model Tuning

Recent years have shown amazing growth in deep learning neural networks (DNNs). This growth can be seen in more accurate models and even opening new possibilities with generative AI: large language models (LLMs) that synthesize natural language, text-to-image generators, and more. These increased capabilities of DNNs come with the cost of having massive models that require significant computational resources in order to be trained. Distributed training addresses this problem with two techniques: data parallelism and model parallelism. Data parallelism is used to scale the training process over multiple nodes and workers, and model parallelism splits a model and fits them over the designated infrastructure. Amazon SageMaker distributed training jobs enable you with one click (or one API call) to set up a distributed compute cluster, train a model, save the result to Amazon Simple Storage Service (Amazon S3), and shut down the cluster when complete. Furthermore, SageMaker has continuously innovated in the distributed training space by launching features like heterogeneous clusters and distributed training libraries for data parallelism and model parallelism.

Efficient training on a distributed environment requires adjusting hyperparameters. A common example of good practice when training on multiple GPUs is to multiply batch (or mini-batch) size by the GPU number in order to keep the same batch size per GPU. However, adjusting hyperparameters often impacts model convergence. Therefore, distributed training needs to balance three factors: distribution, hyperparameters, and model accuracy.

In this post, we explore the effect of distributed training on convergence and how to use Amazon SageMaker Automatic Model Tuning to fine-tune model hyperparameters for distributed training using data parallelism.

The source code mentioned in this post can be found on the GitHub repository (an m5.xlarge instance is recommended).

Scale out training from a single to distributed environment

Data parallelism is a way to scale the training process to multiple compute resources and achieve faster training time. With data parallelism, data is partitioned among the compute nodes, and each node computes the gradients based on their partition and updates the model. These updates can be done using one or multiple parameter servers in an asynchronous, one-to-many, or all-to-all fashion. Another way can be to use an AllReduce algorithm. For example, in the ring-allreduce algorithm, each node communicates with only two of its neighboring nodes, thereby reducing the overall data transfers. To learn more about parameter servers and ring-allreduce, see Launching TensorFlow distributed training easily with Horovod or Parameter Servers in Amazon SageMaker. With regards to data partitioning, if there are n compute nodes, then each node should get a subset of the data, approximately 1/n in size.

To demonstrate the effect of scaling out training on model convergence, we run two simple experiments:

Each model training ran twice: on a single instance and distributed over multiple instances. For the DNN distributed training, in order to fully utilize the distributed processors, we multiplied the mini-batch size by the number of instances (four). The following table summarizes the setup and results.

Problem type Image classification Binary classification
Model DNN XGBoost
Instance ml.c4.xlarge ml.m5.2xlarge
Data set

MNIST

(Labeled images)

Direct Marketing
(tabular, numeric and vectorized categories)
Validation metric Accuracy AUC
Epocs/Rounds 20 150
Number of Instances 1 4 1 3
Distribution type N/A Parameter server N/A AllReduce
Training time (minutes) 8 3 3 1
Final Validation score 0.97 0.11 0.78 0.63

For both models, the training time was reduced almost linearly by the distribution factor. However, model convergence suffered a significant drop. This behavior is consistent for the two different models, the different compute instances, the different distribution methods, and different data types. So, why did distributing the training process affect model accuracy?

There are a number of theories that try to explain this effect:

  • When tensor updates are big in size, traffic between workers and the parameter server can get congested. Therefore, asynchronous parameter servers will suffer significantly worse convergence due to delays in weights updates [1].
  • Increasing batch size can lead to over-fitting and poor generalization, thereby reducing the validation accuracy [2].
  • When asynchronously updating model parameters, some DNNs might not be using the most recent updated model weights; therefore, they will be calculating gradients based on weights that are a few iterations behind. This leads to weight staleness [3] and can be caused by a number of reasons.
  • Some hyperparameters are model or optimizer specific. For example, the XGBoost official documentation says that the exact value for the tree_mode hyperparameter doesn’t support distributed training because XGBoost employs row splitting data distribution whereas the exact tree method works on a sorted column format.
  • Some researchers proposed that configuring a larger mini-batch may lead to gradients with less stochasticity. This can happen when the loss function contains local minima and saddle points and no change is made to step size, to optimization getting stuck in such local minima or saddle point [4].

Optimize for distributed training

Hyperparameter optimization (HPO) is the process of searching and selecting a set of hyperparameters that are optimal for a learning algorithm. SageMaker Automatic Model Tuning (AMT) provides HPO as a managed service by running multiple training jobs on the provided dataset. SageMaker AMT searches the ranges of hyperparameters that you specify and returns the best values, as measured by a metric that you choose. You can use SageMaker AMT with the built-in algorithms or use your custom algorithms and containers.

However, optimizing for distributed training differs from common HPO because instead of launching a single instance per training job, each job actually launches a cluster of instances. This means a greater impact on cost (especially if you consider costly GPU-accelerated instances, which are typical for DNN). In addition to AMT limits, you could possibly hit SageMaker account limits for concurrent number of training instances. Finally, launching clusters can introduce operational overhead due to longer starting time. SageMaker AMT has specific features to address these issues. Hyperband with early stopping ensures that well-performing hyperparameters configurations are fine-tuned and those that underperform are automatically stopped. This enables efficient use of training time and reduces unnecessary costs. Also, SageMaker AMT fully supports the use of Amazon EC2 Spot Instances, which can optimize the cost of training up to 90% over on-demand instances. With regards to long start times, SageMaker AMT automatically reuses training instances within each tuning job, thereby reducing the average startup time of each training job by 20 times. Additionally, you should follow AMT best practices, such as choosing the relevant hyperparameters, their appropriate ranges and scales, and the best number of concurrent training jobs, and setting a random seed to reproduce results.

In the next section, we see these features in action as we configure, run, and analyze an AMT job using the XGBoost example we discussed earlier.

Configure, run, and analyze a tuning job

As mentioned earlier, the source code can be found on the GitHub repo. In Steps 1–5, we download and prepare the data, create the xgb3 estimator (the distributed XGBoost estimator is set to use three instances), run the training jobs, and observe the results. In this section, we describe how to set up the tuning job for that estimator, assuming you already went through Steps 1–5.

A tuning job computes optimal hyperparameters for the training jobs it launches by using a metric to evaluate performance. You can configure your own metric, which SageMaker will parse based on regex you configure and emit to stdout, or use the metrics of SageMaker built-in algorithms. In this example, we use the built-in XGBoost objective metric, so we don’t need to configure a regex. To optimize for model convergence, we optimize based on the validation AUC metric:

objective_metric_name="validation:auc"

We tune seven hyperparameters:

  • num_round – Number of rounds for boosting during the training.
  • eta – Step size shrinkage used in updates to prevent overfitting.
  • alpha – L1 regularization term on weights.
  • min_child_weight – Minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, the building process gives up further partitioning.
  • max_depth – Maximum depth of a tree.
  • colsample_bylevel – Subsample ratio of columns for each split, in each level. This subsampling takes place once for every new depth level reached in a tree.
  • colsample_bytree – Subsample ratio of columns when constructing each tree. For every tree constructed, the subsampling occurs once.

To learn more about XGBoost hyperparameters, see XGBoost Hyperparameters. The following code shows the seven hyperparameters and their ranges:

hyperparameter_ranges = {
    "num_round": IntegerParameter(100, 200),
    "eta": ContinuousParameter(0, 1),
    "min_child_weight": ContinuousParameter(1, 10),
    "alpha": ContinuousParameter(0, 2),
    "max_depth": IntegerParameter(1, 10),
    "colsample_bylevel": ContinuousParameter(0, 1),
    "colsample_bytree": ContinuousParameter(0, 1),
}

Next, we provide the configuration for the Hyperband strategy and the tuner object configuration using the SageMaker SDK. HyperbandStrategyConfig can use two parameters: max_resource (optional) for the maximum number of iterations to be used for a training job to achieve the objective, and min_resource – the minimum number of iterations to be used by a training job before stopping the training. We use HyperbandStrategyConfig to configure StrategyConfig, which is later used by the tuning job definition. See the following code:

hsc = HyperbandStrategyConfig(max_resource=30, min_resource=1)
sc = StrategyConfig(hyperband_strategy_config=hsc)

Now we create a HyperparameterTuner object, to which we pass the following information:

  • The XGBoost estimator, set to run with three instances
  • The objective metric name and definition
  • Our hyperparameter ranges
  • Tuning resource configurations such as number of training jobs to run in total and how many training jobs can be run in parallel
  • Hyperband settings (the strategy and configuration we configured in the last step)
  • Early stopping (early_stopping_type) set to Off

Why do we set early stopping to Off? Training jobs can be stopped early when they are unlikely to improve the objective metric of the hyperparameter tuning job. This can help reduce compute time and avoid overfitting your model. However, Hyperband uses an advanced built-in mechanism to apply early stopping. Therefore, the parameter early_stopping_type must be set to Off when using the Hyperband internal early stopping feature. See the following code:

tuner = HyperparameterTuner(
    xgb3,
    objective_metric_name,
    hyperparameter_ranges,
    max_jobs=30,
    max_parallel_jobs=4,
    strategy="Hyperband",
    early_stopping_type="Off",
    strategy_config=sc
)

Finally, we start the automatic model tuning job by calling the fit method. If you want to launch the job in an asynchronous fashion, set wait to False. See the following code:

tuner.fit(
{"train": s3_input_train, "validation": s3_input_validation},
include_cls_metadata=False,
wait=True,
)

You can follow the job progress and summary on the SageMaker console. In the navigation pane, under Training, choose Hyperparameter tuning jobs, then choose the relevant tuning job. The following screenshot shows the tuning job with details on the training jobs’ status and performance.

When the tuning job is complete, we can review the results. In the notebook example, we show how to extract results using the SageMaker SDK. First, we examine how the tuning job increased model convergence. You can attach the HyperparameterTuner object using the job name and call the describe method. The method returns a dictionary containing tuning job metadata and results.

In the following code, we retrieve the value of the best-performing training job, as measured by our objective metric (validation AUC):

tuner = HyperparameterTuner.attach(tuning_job_name=tuning_job_name)
tuner.describe()["BestTrainingJob"]["FinalHyperParameterTuningJobObjectiveMetric"]["Value"]

The result is 0.78 in AUC on the validation set. That’s a significant improvement over the initial 0.63!

Next, let’s see how fast our training job ran. For that, we use the HyperparameterTuningJobAnalytics method in the SDK to fetch results about the tuning job, and read into a Pandas data frame for analysis and visualization:

tuner_analytics = sagemaker.HyperparameterTuningJobAnalytics(tuning_job_name)
full_df = tuner_analytics.dataframe()
full_df.sort_values(by=["FinalObjectiveValue"], ascending=False).head()

Let’s see the average time a training job took with Hyperband strategy:

full_df["TrainingElapsedTimeSeconds"].mean()

The average time took approximately 1 minute. This is consistent with the Hyperband strategy mechanism that stops underperforming training jobs early. In terms of cost, the tuning job charged us for a total of 30 minutes of training time. Without Hyperband early stopping, the total billable training duration was expected to be 90 minutes (30 jobs * 1 minutes per job * 3 instances per job). That is three times better in cost savings! Finally, we see that the tuning job ran 30 training jobs and took a total of 12 minutes. That is almost 50% less of the expected time (30 jobs/4 jobs in parallel * 3 minutes per job).

Conclusion

In this post, we described some observed convergence issues when training models with distributed environments. We saw that SageMaker AMT using Hyperband addressed the main concerns that optimizing data parallel distributed training introduced: convergence (which improved by more than 10%), operational efficiency (the tuning job took 50% less time than a sequential, non-optimized job would have taken) and cost-efficiency (30 vs. the 90 billable minutes of training job time). The following table summarizes our results:

Improvement Metric No Tuning/Naive Model Tuning Implementation SageMaker Hyperband Automatic Model Tuning Measured Improvement
Model Quality
(Measured by validation AUC)
0.63 0.78 15%
Cost
(Measured by billable training minutes)
90 30 66%
Operational efficiency
(Measured by total running time)
24 12 50%

In order to fine-tune with regards to scaling (cluster size), you can repeat the tuning job with multiple cluster configurations and compare the results to find the optimal hyperparameters that satisfy speed and model accuracy.

We included the steps to achieve this in the last section of the notebook.

References

[1] Lian, Xiangru, et al. “Asynchronous decentralized parallel stochastic gradient descent.” International Conference on Machine Learning. PMLR, 2018.

[2] Keskar, Nitish Shirish, et al. “On large-batch training for deep learning: Generalization gap and sharp minima.” arXiv preprint arXiv:1609.04836 (2016).

[3] Dai, Wei, et al. “Toward understanding the impact of staleness in distributed machine learning.” arXiv preprint arXiv:1810.03264 (2018).

[4] Dauphin, Yann N., et al. “Identifying and attacking the saddle point problem in high-dimensional non-convex optimization.” Advances in neural information processing systems 27 (2014).


About the Author

Uri Rosenberg is the AI & ML Specialist Technical Manager for Europe, Middle East, and Africa. Based out of Israel, Uri works to empower enterprise customers to design, build, and operate ML workloads at scale. In his spare time, he enjoys cycling, hiking, and complaining about data preparation.

Read More

Access private repos using the @remote decorator for Amazon SageMaker training workloads

Access private repos using the @remote decorator for Amazon SageMaker training workloads

As more and more customers are looking to put machine learning (ML) workloads in production, there is a large push in organizations to shorten the development lifecycle of ML code. Many organizations prefer writing their ML code in a production-ready style in the form of Python methods and classes as opposed to an exploratory style (writing code without using methods or classes) because this helps them ship production-ready code faster.

With Amazon SageMaker, you can use the @remote decorator to run a SageMaker training job simply by annotating your Python code with an @remote decorator. The SageMaker Python SDK will automatically translate your existing workspace environment and any associated data processing code and datasets into a SageMaker training job that runs on the SageMaker training platform.

Running a Python function locally often requires several dependencies, which may not come with the local Python runtime environment. You can install them via package and dependency management tools like pip or conda.

However, organizations operating in regulated industries like banking, insurance, and healthcare operate in environments that have strict data privacy and networking controls in place. These controls often mandate having no internet access available to any of their environments. The reason for such restriction is to have full control over egress and ingress traffic so they can reduce the chances of unscrupulous actors sending or receiving non-verified information through their network. It’s often also mandated to have such network isolation as part of the auditory and industrial compliance rules. When it comes to ML, this restricts data scientists from downloading any package from public repositories like PyPI, Anaconda, or Conda-Forge.

To provide data scientists access to the tools of their choice while also respecting the restrictions of the environment, organizations often set up their own private package repository hosted in their own environment. You can set up private package repositories on AWS in multiple ways:

In this post, we focus on the first option: using CodeArtifact.

Solution overview

The following architecture diagram shows the solution architecture.

Solution-Architecture-vpc-no-internet

The high-level steps to implement the solution are as follows

  • Set up a virtual private cloud (VPC) with no internet access using an AWS CloudFormation template.
  • Use a second CloudFormation template to set up CodeArtifact as a private PyPI repository and provide connectivity to the VPC, and set up an Amazon SageMaker Studio environment to use the private PyPI repository.
  • Train a classification model based on the MNIST dataset using an @remote decorator from the open-source SageMaker Python SDK. All the dependencies will be downloaded from the private PyPI repository.

Note that using SageMaker Studio in this post is optional. You can choose to work in any integrated development environment (IDE) of your choice. You just need to set up your AWS Command Line Interface (AWS CLI) credentials correctly. For more information, refer to Configure the AWS CLI.

Prerequisites

You need an AWS account with an AWS Identity and Access Management (IAM) role with permissions to manage resources created as part of the solution. For details, refer to Creating an AWS account.

Set up a VPC with no internet connection

Create a new CloudFormation stack using the vpc.yaml template. This template creates the following resources:

  • A VPC with two private subnets across two Availability Zones with no internet connectivity
  • A Gateway VPC endpoint for accessing Amazon S3
  • Interface VPC endpoints for SageMaker, CodeArtifact, and a few other services to allow the resources in the VPC to connect to AWS services via AWS PrivateLink

Provide a stack name, such as No-Internet, and complete the stack creation process.

vpc-no-internet-stack

Wait for the stack creation process to complete.

Set up a private repository and SageMaker Studio using the VPC

The next step is to deploy another CloudFormation stack using the sagemaker_studio_codeartifact.yaml template. This template creates the following resources:

Provide a stack name and keep the default values or adjust the parameters for the CodeArtifact domain name, private repository name, user profile name for SageMaker Studio, and name for the upstream public PyPI repository. You also we need to provide the VPC stack name created in the previous step.

Studio-CodeArtifact-stack

When the stack creation is complete, the SageMaker domain should be visible on the SageMaker console.

studio-domain

To verify there is no internet connection available in SageMaker Studio, launch SageMaker Studio. Choose File, New, and Terminal to launch a terminal and try to curl any internet resource. It should fail to connect, as shown in the following screenshot.

terminal-showing-no-internet

Train an image classifier using an @remote decorator with the private PyPI repository

In this section, we use the @remote decorator to run a PyTorch training job that produces a MNIST image classification model. To achieve this, we set up a configuration file, develop the training script, and run the training code.

Set up a configuration file

We set up a config.yaml file and provide the configurations needed to do the following:

  • Run a SageMaker training job in the no-internet VPC created earlier
  • Download the required packages by connecting to the private PyPI repository created earlier

The file looks like the following code:

SchemaVersion: '1.0'
SageMaker:
  PythonSDK:
    Modules:
      RemoteFunction:
        Dependencies: '../config/requirements.txt'
        InstanceType: 'ml.m5.xlarge'
        PreExecutionCommands:
            - 'aws codeartifact login --tool pip --domain <domain-name> --domain-owner <AWS account number> --repository <private repository name> --endpoint-url <VPC-endpoint-url-prefixed with https://>
        RoleArn: '<execution role ARN for running training job>'
        S3RootUri: '<s3 bucket to store the job output>'
        VpcConfig:
            SecurityGroupIds: 
            - '<security group id used by SageMaker Studio>'
            Subnets: 
            - '<VPC subnet id 1>'
            - '<VPC subnet id 2>'

The Dependencies field contains the path to requirements.txt, which contains all the dependencies needed. Note that all the dependencies will be downloaded from the private repository. The requirements.txt file contains the following code:

torch
torchvision
sagemaker>=2.156.0,<3

The PreExecutionCommands section contains the command to connect to the private PyPI repository. To get the CodeArtifact VPC endpoint URL, use the following code:

response = ec2.describe_vpc_endpoints(
    Filters=[
        {
            'Name': 'service-name',
            'Values': [
                f'com.amazonaws.{boto3_session.region_name}.codeartifact.api'
            ]
        },
    ]
)

code_artifact_api_vpc_endpoint = response['VpcEndpoints'][0]['DnsEntries'][0]['DnsName']

endpoint_url = f'https://{code_artifact_api_vpc_endpoint}'
endpoint_url

Generally, we get two VPC endpoints for CodeArtifact, and we can use any of them in the connection commands. For more details, refer to Use CodeArtifact from a VPC.

Additionally, configurations like execution role, output location, and VPC configurations are provided in the config file. These configurations are needed to run the SageMaker training job. To know more about all the configurations supported, refer to Configuration file.

It’s not mandatory to use the config.yaml file in order to work with the @remote decorator. This is just a cleaner way to supply all configurations to the @remote decorator. All the configs could also be supplied directly in the decorator arguments, but that reduces readability and maintainability of changes in the long run. Also, the config file can be created by an admin and shared with all the users in an environment.

Develop the training script

Next, we prepare the training code in simple Python files. We have divided the code into three files:

  • load_data.py – Contains the code to download the MNIST dataset
  • model.py – Contains the code for the neural network architecture for the model
  • train.py – Contains the code for training the model by using load_data.py and model.py

In train.py, we need to decorate the main training function as follows:

@remote(include_local_workdir=True)
def perform_train(train_data,
                  test_data,
                  *,
                  batch_size: int = 64,
                  test_batch_size: int = 1000,
                  epochs: int = 3,
                  lr: float = 1.0,
                  gamma: float = 0.7,
                  no_cuda: bool = True,
                  no_mps: bool = True,
                  dry_run: bool = False,
                  seed: int = 1,
                  log_interval: int = 10,
                  ):
    # pytorch native training code........

Now we’re ready to run the training code.

Run the training code with an @remote decorator

We can run the code from a terminal or from any executable prompt. In this post, we use a SageMaker Studio notebook cell to demonstrate this:

!python ./train.py

Running the preceding command triggers the training job. In the logs, we can see that it’s downloading the packages from the private PyPI repository.

training-job-logs

This concludes the implementation of an @remote decorator working with a private repository in an environment with no internet access.

Clean up

To clean up the resources, follow the instructions in CLEANUP.md.

Conclusion

In this post, we learned how to effectively use the @remote decorator’s capabilities while still working in restrictive environments without any internet access. We also learned how can we integrate CodeArtifact private repository capabilities with the help of configuration file support in SageMaker. This solution makes iterative development much simpler and faster. Another added advantage is that you can still continue to write the training code in a more natural, object-oriented way and still use SageMaker capabilities to run training jobs on a remote cluster with minimal changes in your code. All the code shown as part of this post is available in the GitHub repository.

As a next step, we encourage you to check out the @remote decorator functionality and Python SDK API and use it in your choice of environment and IDE. Additional examples are available in the amazon-sagemaker-examples repository to get you started quickly. You can also check out the post Run your local machine learning code as Amazon SageMaker Training jobs with minimal code changes for more details.


About the author

Vikesh Pandey is a Machine Learning Specialist Solutions Architect at AWS, helping customers from financial industries design and build solutions on generative AI and ML. Outside of work, Vikesh enjoys trying out different cuisines and playing outdoor sports.

Read More

Integrate SaaS platforms with Amazon SageMaker to enable ML-powered applications

Integrate SaaS platforms with Amazon SageMaker to enable ML-powered applications

Amazon SageMaker is an end-to-end machine learning (ML) platform with wide-ranging features to ingest, transform, and measure bias in data, and train, deploy, and manage models in production with best-in-class compute and services such as Amazon SageMaker Data Wrangler, Amazon SageMaker Studio, Amazon SageMaker Canvas, Amazon SageMaker Model Registry, Amazon SageMaker Feature Store, Amazon SageMaker Pipelines, Amazon SageMaker Model Monitor, and Amazon SageMaker Clarify. Many organizations choose SageMaker as their ML platform because it provides a common set of tools for developers and data scientists. A number of AWS independent software vendor (ISV) partners have already built integrations for users of their software as a service (SaaS) platforms to utilize SageMaker and its various features, including training, deployment, and the model registry.

In this post, we cover the benefits for SaaS platforms to integrate with SageMaker, the range of possible integrations, and the process for developing these integrations. We also deep dive into the most common architectures and AWS resources to facilitate these integrations. This is intended to accelerate time-to-market for ISV partners and other SaaS providers building similar integrations and inspire customers who are users of SaaS platforms to partner with SaaS providers on these integrations.

Benefits of integrating with SageMaker

There are a number of benefits for SaaS providers to integrate their SaaS platforms with SageMaker:

  • Users of the SaaS platform can take advantage of a comprehensive ML platform in SageMaker
  • Users can build ML models with data that is in or outside of the SaaS platform and exploit these ML models
  • It provides users with a seamless experience between the SaaS platform and SageMaker
  • Users can utilize foundation models available in Amazon SageMaker JumpStart to build generative AI applications
  • Organizations can standardize on SageMaker
  • SaaS providers can focus on their core functionality and offer SageMaker for ML model development
  • It equips SaaS providers with a basis to build joint solutions and go to market with AWS

SageMaker overview and integration options

SageMaker has tools for every step of the ML lifecycle. SaaS platforms can integrate with SageMaker across the ML lifecycle from data labeling and preparation to model training, hosting, monitoring, and managing models with various components, as shown in the following figure. Depending on the needs, any and all parts of the ML lifecycle can be run in either the customer AWS account or SaaS AWS account, and data and models can be shared across accounts using AWS Identity and Access Management (IAM) policies or third-party user-based access tools. This flexibility in the integration makes SageMaker an ideal platform for customers and SaaS providers to standardize on.

SageMaker overview

Integration process and architectures

In this section, we break the integration process into four main stages and cover the common architectures. Note that there can be other integration points in addition to these, but those are less common.

  • Data access – How data that is in the SaaS platform is accessed from SageMaker
  • Model training – How the model is trained
  • Model deployment and artifacts – Where the model is deployed and what artifacts are produced
  • Model inference – How the inference happens in the SaaS platform

The diagrams in the following sections assume SageMaker is running in the customer AWS account. Most of the options explained are also applicable if SageMaker is running in the SaaS AWS account. In some cases, an ISV may deploy their software in the customer AWS account. This is usually in a dedicated customer AWS account, meaning there still needs to be cross-account access to the customer AWS account where SageMaker is running.

There are a few different ways in which authentication across AWS accounts can be achieved when data in the SaaS platform is accessed from SageMaker and when the ML model is invoked from the SaaS platform. The recommended method is to use IAM roles. An alternative is to use AWS access keys consisting of an access key ID and secret access key.

Data access

There are multiple options on how data that is in the SaaS platform can be accessed from SageMaker. Data can either be accessed from a SageMaker notebook, SageMaker Data Wrangler, where users can prepare data for ML, or SageMaker Canvas. The most common data access options are:

  • SageMaker Data Wrangler built-in connector – The SageMaker Data Wrangler connector enables data to be imported from a SaaS platform to be prepared for ML model training. The connector is developed jointly by AWS and the SaaS provider. Current SaaS platform connectors include Databricks and Snowflake.
  • Amazon Athena Federated Query for the SaaS platformFederated queries enable users to query the platform from a SageMaker notebook via Amazon Athena using a custom connector that is developed by the SaaS provider.
  • Amazon AppFlow – With Amazon AppFlow, you can use a custom connector to extract data into Amazon Simple Storage Service (Amazon S3) which subsequently can be accessed from SageMaker. The connector for a SaaS platform can be developed by AWS or the SaaS provider. The open-source Custom Connector SDK enables the development of a private, shared, or public connector using Python or Java.
  • SaaS platform SDK – If the SaaS platform has an SDK (Software Development Kit), such as a Python SDK, this can be used to access data directly from a SageMaker notebook.
  • Other options – In addition to these, there can be other options depending on whether the SaaS provider exposes their data via APIs, files or an agent. The agent can be installed on Amazon Elastic Compute Cloud (Amazon EC2) or AWS Lambda. Alternatively, a service such as AWS Glue or a third-party extract, transform, and load (ETL) tool can be used for data transfer.

The following diagram illustrates the architecture for data access options.

Data access

Model training

The model can be trained in SageMaker Studio by a data scientist, using Amazon SageMaker Autopilot by a non-data scientist, or in SageMaker Canvas by a business analyst. SageMaker Autopilot takes away the heavy lifting of building ML models, including feature engineering, algorithm selection, and hyperparameter settings, and it is also relatively straightforward to integrate directly into a SaaS platform. SageMaker Canvas provides a no-code visual interface for training ML models.

In addition, Data scientists can use pre-trained models available in SageMaker JumpStart, including foundation models from sources such as Alexa, AI21 Labs, Hugging Face, and Stability AI, and tune them for their own generative AI use cases.

Alternatively, the model can be trained in a third-party or partner-provided tool, service, and infrastructure, including on-premises resources, provided the model artifacts are accessible and readable.

The following diagram illustrates these options.

Model training

Model deployment and artifacts

After you have trained and tested the model, you can either deploy it to a SageMaker model endpoint in the customer account, or export it from SageMaker and import it into the SaaS platform storage. The model can be stored and imported in standard formats supported by the common ML frameworks, such as pickle, joblib, and ONNX (Open Neural Network Exchange).

If the ML model is deployed to a SageMaker model endpoint, additional model metadata can be stored in the SageMaker Model Registry, SageMaker Model Cards, or in a file in an S3 bucket. This can be the model version, model inputs and outputs, model metrics, model creation date, inference specification, data lineage information, and more. Where there isn’t a property available in the model package, the data can be stored as custom metadata or in an S3 file.

Creating such metadata can help SaaS providers manage the end-to-end lifecycle of the ML model more effectively. This information can be synced to the model log in the SaaS platform and used to track changes and updates to the ML model. Subsequently, this log can be used to determine whether to refresh downstream data and applications that use that ML model in the SaaS platform.

The following diagram illustrates this architecture.

Model deployment and artifacts

Model inference

SageMaker offers four options for ML model inference: real-time inference, serverless inference, asynchronous inference, and batch transform. For the first three, the model is deployed to a SageMaker model endpoint and the SaaS platform invokes the model using the AWS SDKs. The recommended option is to use the Python SDK. The inference pattern for each of these is similar in that the predictor’s predict() or predict_async() methods are used. Cross-account access can be achieved using role-based access.

It’s also possible to seal the backend with Amazon API Gateway, which calls the endpoint via a Lambda function that runs in a protected private network.

For batch transform, data from the SaaS platform first needs to be exported in batch into an S3 bucket in the customer AWS account, then the inference is done on this data in batch. The inference is done by first creating a transformer job or object, and then calling the transform() method with the S3 location of the data. Results are imported back into the SaaS platform in batch as a dataset, and joined to other datasets in the platform as part of a batch pipeline job.

Another option for inference is to do it directly in the SaaS account compute cluster. This would be the case when the model has been imported into the SaaS platform. In this case, SaaS providers can choose from a range of EC2 instances that are optimized for ML inference.

The following diagram illustrates these options.

Model inference

Example integrations

Several ISVs have built integrations between their SaaS platforms and SageMaker. To learn more about some example integrations, refer to the following:

Conclusion

In this post, we explained why and how SaaS providers should integrate SageMaker with their SaaS platforms by breaking the process into four parts and covering the common integration architectures. SaaS providers looking to build an integration with SageMaker can utilize these architectures. If there are any custom requirements beyond what has been covered in this post, including with other SageMaker components, get in touch with your AWS account teams. Once the integration has been built and validated, ISV partners can join the AWS Service Ready Program for SageMaker and unlock a variety of benefits.

We also ask customers who are users of SaaS platforms to register their interest in an integration with Amazon SageMaker with their AWS account teams, as this can help inspire and progress the development for SaaS providers.


About the Authors

Mehmet Bakkaloglu is a Principal Solutions Architect at AWS, focusing on Data Analytics, AI/ML and ISV partners.

Raj Kadiyala is a Principal AI/ML Evangelist at AWS.

Read More