Implement backup and recovery using an event-driven serverless architecture with Amazon SageMaker Studio

Amazon SageMaker Studio is the first fully integrated development environment (IDE) for ML. It provides a single, web-based visual interface where you can perform all machine learning (ML) development steps required to build, train, tune, debug, deploy, and monitor models. It gives data scientists all the tools you need to take ML models from experimentation to production without leaving the IDE. Moreover, as of November 2022, Studio supports shared spaces to accelerate real-time collaboration and multiple Amazon SageMaker domains in a single AWS Region for each account.

There are two prevailing use cases for Studio domain backup and recovery. The first use case involves a customer business unit and project wanting a functionality to replicate data scientists’ artifacts and data files to any target domains and profiles at will. The second use case involves the replication only when the domain and profile are deleted due to conditions such as the change from a customer-managed key to an AWS-managed key or a change of onboarding from AWS Identity and Access Management (IAM) authentication (see Onboard to Amazon SageMaker Domain Using IAM) to AWS IAM Identity Center (see Onboard to Amazon SageMaker Domain Using IAM Identity Center).

This post mainly covers the second use case by presenting how to back up and recover users’ work when the user and space profiles are deleted and recreated, but we also provide the Python script to support the first use case.

When the user and space profiles are recreated in the existing Studio domain, a new ID of the profile directory will be created within the Studio Amazon Elastic File System (Amazon EFS) volume. As a result, the Studio users could lose access to the model artifacts and data files stored in their previous profile directory if they are deleted. Additionally, Studio domains don’t currently support mounting custom or additional EFS volumes. We recommend keeping the previous Studio EFS volume as a backup using RetentionPolicy in Studio.

Therefore, a proper recovery solution needs to be implemented to access the data from the previous directory in case of profile deletion or to recover files from a detached volume in case of domain deletion. Data scientists can minimize the potential impacts of deleting the domain and profiles if they frequently commit their code to the repository and utilize external storage for data access. However, having the capability to back up and recover the data scientist’s workspace is another layer to ensure their continuity of work, which may increase their productivity. Moreover, if you have tens and hundreds of Studio users, consider how to automate the recovery process to avoid mistakes and save costs and time. To solve this problem, we provide a solution to supplement Studio domain recovery.

This post explains the backup and recovery module and one approach to automate the process using an event-driven architecture. First, we demonstrate how to perform backup and recovery if you create a new Studio domain, user, and space profiles using AWS CloudFormation templates. Next, we explain the required steps to test our recovery solution using the existing domain and profiles without using our CloudFormation templates (you can use your own templates). Although this post focuses on a single domain setting, our solution works for multiple Studio domains as well. Finally, we have automated the provisioning of all resources using the AWS Serverless Application Model (AWS SAM), an open-source framework for building serverless applications.

Solution overview

The following diagram illustrates the high-level workflow of Studio domain backup and recovery with an event-driven architecture.

The event-driven app includes the following steps:

An Amazon CloudWatch events rule uses AWS CloudTrail to trackCreateUserProfile and CreateSpace API calls, trigger the rule, and invoke the AWS Lambda function.
The function updates the user table and appends items in the history table in Amazon DynamoDB. In addition, the database layer keeps track of the domain and profile name and file system mapping.

The following image shows the DynamoDB tables structure. The partition key and sort key in the studioUser table consist of the profile and domain name. The replication column holds the replication flag with true as the default value. In addition, bytes_written, bytes_file_transferred, total_duration_ms, and replication_status fields are populated when the replication completes successfully.

The database layer can be replaced by other services, such as Amazon Relational Database Service (Amazon RDS) or Amazon Simple Storage Service (Amazon S3). However, we chose DynamoDB because of the Amazon DynamoDB Streams feature.

DynamoDB Streams is enabled on the user table, and the Lambda function is set as a trigger and synchronously invoked when new stream records are available.
Another Lambda function triggers the process to restore the files using the user and space files restore tools.

The backup and recovery workflow includes the following steps:

The backup and recovery workflow consists of AWS Step Functions, integrated with other AWS services, including AWS DataSync, to orchestrate the recovery of the user and space files from the previous directory to a new directory between the same Studio domain EFS volume (profile recreation) or a new domain EFS volume (domain recreation). With the Step Functions Workflow Studio, the workflow can be implemented with no code (such as in this case) or low code for a more customized solution. The Step Functions state machine is invoked when the event-driven app detects the profile creation event. For each profile, the Step Functions state machine runs the DataSync task to copy all files from their previous directories to the new directory.

The following image is the actual graph of the Step Functions state machine. Note that the ListApp* step ensures the profile directories are populated in the Studio EFS volume before proceeding. Also, we implemented retry with exponential backoff to handle API throttle for DataSync CreateLocationEfs and CreateTask API calls.

When the users open their Studio, all the files from the respective directories from the previous directory will be available to continue their work. The DataSync job replicating one gigabyte of data from our experiment took approximately 1 minute.

The following are services that will be used as part of the solution:

AWS CloudFormation
AWS CloudTrail
Amazon CloudWatch
AWS DataSync
Amazon DynamoDB
AWS Identity and Access Management (IAM)
AWS Lambda
Amazon SageMaker Studio
Amazon Simple Queue Service (Amazon SQS)
AWS Step Functions
AWS Systems Manager

Prerequisites

To implement this solution, you must have the following prerequisites:

An AWS account if you don’t already have one. The IAM user that you use must have sufficient permissions to make the necessary AWS service calls and manage AWS resources.
The AWS SAM CLI installed and configured.
Your AWS credentials set up.
Git installed.
Python 3.9.
A Studio profile and domain name combination that is unique across all Studio domains within a Region and account.
You need to use the existing Amazon VPC and S3 bucket to follow the deployment step.
Also, be aware of the service quota for the maximum number of DataSync tasks per account per Region (default is 100). You can request a quota increase to meet the number of replication tasks for your use case.

Refer to the AWS Regional Services List for service availability based on Region. Additionally, review Amazon SageMaker endpoints and quotas.

Set up a Studio profile recovery infrastructure

The following diagram shows the logical steps for a SageMaker administrator to set up the Studio user and space recovery infrastructure, which a single command can complete with our automated solution.

To set up the environment, clone the GitHub repo in the terminal:

git clone https://github.com/aws-samples/sagemaker-studio-efs-recovery-serverless.git && cd sagemaker-studio-efs-recovery-serverless

The following code shows the deployment script usage:

bash deploy.sh -h

Usage: deploy.sh [-n <stack_name>] [-v <vpc_id>] [-s <subnet_id>] [-b <s3_bucket>] [-r <aws_region>] [-d] Options: -n: specify stack name -v: specify your vpc id -s: specify subnet -b: specify s3 bucket name to store artifacts -r: specify aws region -d: whether to skip a creation of a new SageMaker Studio Domain (default: no)

To create a new Amazon SageMaker domain, run the following command. You need to specify which Amazon VPC and subnet you want to use. We use VPC only mode for the Studio deployment. If you don’t have any preference, you can use the default VPC and subnet. Also, specify any stack name, AWS Region, and S3 bucket name for AWS SAM to deploy the Lambda function:

bash deploy.sh -v <vpc_id> -s <subnet_id> -b <s3_bucket_name> -n <stack_name> -r <aws_region>

If you want to use an existing Studio domain, run the following command. Option -d yes will skip creating a new Studio domain:

bash deploy.sh -v <vpc_id> -s <subnet_id> -b <s3_bucket_name> -n <stack_name> -r <aws_region> -d yes

For the existing domains, the SageMaker administrator must also update the source and target Studio EFS security groups to allow connection to the user and space file restore tool. For example, to run the following command, you need to specify HomeEfsFileSystemId, the EFS file system ID, and SecurityGroupId used by the user and space file restore tool (we discuss this in more detail later in the post):

python3 src/add-security-group.py --efs-id <HomeEfsFileSystemId> --security-groups <SecurityGroupId> --region <aws_region>

User and space recovery logical flow

The following diagram shows the logical user and space recovery flow diagram for a SageMaker administrator to understand how the solution works, and no additional setup is required. If the profile (user or space) and domain are accidentally deleted, the EFS volume is detached but not deleted. A possible scenario is that we may want to revert the deletion by recreating a new domain and profiles. If the same profiles are being onboarded again, they may wish to access the files from their respective workspace in the detached volume. The recovery process is almost entirely automated; the only action required by the SageMaker administrator is to recreate the Studio domain and profiles using the same CloudFormation template. The rest of the steps are automated.

Optionally, if the SageMaker admin wants control over replication, run the following command to turn off replication for specific domains and profiles. This script updates the replication field given the domain and profile name in the table. Note that you need to run the script for the same user each time they get recreated.

python3 src/update-replication-flag.py --profile-name <profile_name> --domain-name <domain_name> --region <aws_region> --no-replication

The following optional step provides the solution for the first use case to allow replication to take place between the specified source file system to any target domain and profile name. If the SageMaker admin wants to replicate particular profile data to a different domain and a profile that doesn’t exist yet, run the following command. The script inserts the new domain and profile name with the specified source file system information. The subsequent profile creation will trigger the replication task. Note that you need to run add-security-group.py from the previous step to allow connection to the file restore tool.

python3 src/add-replication-target.py --src-profile-name <profile_name> --src-domain-name <domain_name> --target-profile-name <profile_name> --target-domain-name <domain_name> --region <aws_region>

In the following sections, we test two scenarios to confirm that the solution works as expected.

Create a new Studio domain

Our first test scenario assumes you are starting from scratch and want to create a new Studio domain and profiles in your environment using our templates. Then we deploy the Studio domain, user and space, backup and recovery workflow, and event app. The purpose of the first scenario is to confirm that the profile file is recovered in the new home directory automatically when the profile is deleted and recreated within the same Studio domain.

Complete the following steps:

To deploy the application, run the following command:

bash deploy.sh -v <vpc_id> -s <subnet_id> -b <s3_bucket_name> -n <stack_name> -r <aws_region>

On the AWS CloudFormation console, ensure the following stacks are in CREATE_COMPLETE status:
1. <stack_name>-DemoBootstrap-*
2. <stack_name>-StepFunction-*
3. <stack_name>-EventApp-*
4. <stack_name>-StudioDomain-*
5. <stack_name>-StudioUser1-*
6. <stack_name>-StudioSpace-*

If the deployment failed in any stacks, check the error and resolve the issues. Then, proceed to the next step only if the problems are resolved.

On the DynamoDB console, choose Tables in the navigation pane and confirm that the studioUser and studioUserHistory tables are created.
Select studioUser and choose Explore table items to confirm that items for user1 and space1 are populated in the table.
On the SageMaker console, choose Domains in the navigation pane.
Choose demo-myapp-dev-studio-domain.
On the User profiles tab, select user1 and choose Launch, and choose Studio to open the Studio for the user.

Note that Studio may take 10-15 minutes to load for the first time.

On the File menu, choose Terminal to launch a new terminal within Studio.
Run the following command in the terminal to create a file for testing:
```
echo "i don't want to lose access to this file" > user1.txt
```

Repeat these steps for space1 (choose Spaces in Step 7). Feel free to create a file of your choice.

Delete the Studio user user1 and space1 by removing the nested stacks <stack_name>-StudioUser1-* and <stack_name>-StudioSpace-* from the parent. Delete the stacks by commenting out the following code blocks from the AWS SAM template file, template.yaml. Make sure to save the file after the edit:

StudioUser1:
  Type: AWS::Serverless::Application
  Condition: CreateDomainCond
  DependsOn: StudioDomain
  Properties:
    Location: Infrastructure/Templates/sagemaker-studio-user.yaml
    Parameters:
      LambdaLayerArn: !GetAtt DemoBootstrap.Outputs.LambdaLayerArn
      StudioUserProfileName: !Ref StudioUserProfileName1
      UID: !Ref UID
      Env: !Ref Env
      AppName: !Ref AppName
StudioSpace:
  Type: AWS::Serverless::Application
  Condition: CreateDomainCond
  DependsOn: StudioDomain
  Properties:
    Location: Infrastructure/Templates/sagemaker-studio-space.yaml
    Parameters:
      LambdaLayerArn: !GetAtt DemoBootstrap.Outputs.LambdaLayerArn
      StudioSpaceName: !Ref StudioSpaceName
      UID: !Ref UID
      Env: !Ref Env
      AppName: !Ref AppName

Run the following command to deploy the stack with this change:

bash deploy.sh -v <vpc_id> -s <subnet_id> -b <s3_bucket_name> -n <stack_name> -r <aws_region>

Recreate the Studio profiles by adding back the stack back to the parent. Uncomment the code block from the previous step, save the file, and run the same command:
```
bash deploy.sh -v <vpc_id> -s <subnet_id> -b <s3_bucket_name> -n <stack_name> -r <aws_region>
```

After a successful deployment, you can check the results.

On the AWS CloudFormation console, choose the stack <stack_name>-StepFunction-*
In the stack, choose the value for Physical ID of StepFunction in the Resources section.
Choose the most recent run and confirm its status in Graph view.

It should look like the following screenshot for the user profile replication. You can also check the other run to ensure the same for the space profile.

If you completed Steps 5–10, open the Studio domain for user1 and confirm that the user1.txt file is copied to the newly created directory.

It should not be visible in space1 directory, keeping the same file ownership.

Repeat this step for space1.
On the DataSync console, choose the most recent task ID.
Choose History and the most recent run ID.

This is another way to inspect the configurations and the run status of the DataSync task. As an example, the following screenshot shows the task result for user1 directory replication.

We only covered profile recreation in this scenario. However, our solution works in the same way for Studio domain recreation, and it can be tested by deleting and recreating the domain.

Use an existing Studio domain

Our second test scenario assumes you want to use the existing SageMaker domain and profiles in the environment. Therefore, we only deploy the backup and recovery workflow and the event app. Again, you can use your own Studio CloudFormation template or create them through the AWS CloudFormation console to follow along. Because we’re using the existing Studio domain, the solution will list the current user and space for all domains within the Region, which we call seeding.

Complete the following steps:

To deploy the application, run the following command:

bash deploy.sh -v <vpc_id> -s <subnet_id> -b <s3_bucket_name> -n <stack_name> -r <aws_region> -d yes

On the AWS CloudFormation console, ensure the following stacks are in CREATE_COMPLETE status:
1. <stack_name>-DemoBootstrap-*
2. <stack_name>-StepFunction-*
3. <stack_name>-EventApp-*

If the deployment failed in any stacks, check the error and resolve the issues. Then, proceed to the next step only if the problems are resolved.

Verify the initial data seed has completed.
On the DynamoDB console, choose Tables in the navigation pane and confirm that the studioUser and studioUserHistory tables are created.
Choose studioUser and choose Explore table items to confirm that items for the existing Studio domain are populated in the table.

Proceed to the next step only if the seed has completed successfully. If the tables aren’t populated, check the CloudWatch logs of the corresponding Lambda function. On the AWS CloudFormation console, choose the stack <stack_name>-EventApp-*, and choose the physical ID of DDBSeedLambda in the Resources section. Under Monitor, choose View CloudWatch Logs and check the logs for the most recent run to troubleshoot.

To update the EFS security group, first get the SecurityGroupId. We use the security group created by the CloudFormation template, which allows all traffic in the outbound connection. Run the following command:
```
echo "SecurityGroupId:" $(aws ssm get-parameter --name /network/vpc/sagemaker/securitygroups --region

<aws_region> --query 'Parameter.Value')
```

Get the HomeEfsFileSystemId, which is the ID of the Studio home EFS volume. Run the following command:

echo "HomeEfsFileSystemId:" $(aws sagemaker describe-domain --domain-id <domain_id> --region <aws_region> --query 'HomeEfsFileSystemId')

Finally, update the EFS security group by allowing inbounds from the security group shared with the DataSync task using port 2049. Run the following command:
```
python3 src/add-security-group.py --efs-id <HomeEfsFileSystemId> --security-groups <SecurityGroupId> --region <aws_region>
```
Delete and recreate the Studio profiles of your choice using the same profile name.
Confirm the run status of the Step Functions state machine and recovery of the Studio profile directory by following the steps from the first scenario.

You can also test the Step Functions workflow manually with your choice of source and target inputs for replication (more details found in README.md in the GitHub repository).

Clean up

Run the following commands to clean up your resources:

sam delete --region <aws_region> --no-prompts --stack-name <stack_name>

Manually delete the SageMakerSecurityGroup after 20 minutes or so. Deletion of the Elastic Network Interface (ENI) can make the stack show as DELETE_IN_PROGRESS for some time, so we intentionally set the security group to be retained. Also, you need to disassociate that security group from the security group managed by SageMaker before you can delete it.

Conclusion

Studio is a powerful IDE that allows data scientists to quickly develop, train, test, and deploy models. This post discusses how to back up and recover the files stored in a data scientist’s home and shared space directory. We also demonstrated how an event-driven architecture can help automate the recovery process.

Our solution can help improve the resiliency of data scientists’ artifacts within Studio, leading to operational efficiency on the AWS Cloud. Also, the solution is modular, so you can use the necessary components and update them for your usage. For instance, the enhancement to this solution might be a cross-account replication. We hope that what we demonstrated in the post will be a helpful resource to support those ideas.

To get started with Studio, check out Amazon SageMaker for Data Scientists. Please send us feedback on the AWS forum for SageMaker or through your AWS support contacts. You can find other Studio examples in our GitHub repository.

About the Authors

Kenny Sato is a Machine Learning Engineer at AWS, guiding customers in architecting and implementing machine learning solutions. He received his master’s in Computer Engineering from Virginia Tech and is pursuing a PhD in Computer Science. In his spare time, you can find him in his backyard or out somewhere playing with his lovely daughters.

Gautam Nambiar is a DevOps Consultant with AWS. He is particularly interested in architecting and building automated solutions, MLOps pipelines, and creating reusable and secure DevOps best practice patterns. In his spare time, he likes playing and watching soccer.

Vedere AI