Organize machine learning development using shared spaces in SageMaker Studio for real-time collaboration

Organize machine learning development using shared spaces in SageMaker Studio for real-time collaboration

Amazon SageMaker Studio is the first fully integrated development environment (IDE) for machine learning (ML). It provides a single, web-based visual interface where you can perform all ML development steps, including preparing data and building, training, and deploying models.

Within an Amazon SageMaker Domain, users can provision a personal Amazon SageMaker Studio IDE application, which runs a free JupyterServer with built‑in integrations to examine Amazon SageMaker Experiments, orchestrate Amazon SageMaker Pipelines, and much more. Users only pay for the flexible compute on their notebook kernels. These personal applications automatically mount a respective user’s private Amazon Elastic File System (Amazon EFS) home directory so they can keep code, data, and other files isolated from other users. Amazon SageMaker Studio already supports sharing of notebooks between private applications, but the asynchronous mechanism can slow down the iteration process.

Now with shared spaces in Amazon SageMaker Studio, users can organize collaborative ML endeavors and initiatives by creating a shared IDE application that users utilize with their own Amazon SageMaker user profile. Data workers collaborating in a shared space get access to an Amazon SageMaker Studio environment where they can access, read, edit, and share their notebooks in real time, which gives them the quickest path to start iterating with their peers on new ideas. Data workers can even collaborate on the same notebook concurrently using real-time collaboration capabilities. The notebook indicates each co-editing user with a different cursor that shows their respective user profile name.

Shared spaces in SageMaker Studio automatically tag resources, such as Training jobs, Processing jobs, Experiments, Pipelines, and Model Registry entries created within the scope of a workspace with their respective sagemaker:space-arn. The space filters those resources within the Amazon SageMaker Studio user interface (UI) so users are only presented with SageMaker Experiments, Pipelines, and other resources that are pertinent to their ML endeavor.

Solution overview

Solution overview
Since shared spaces automatically tags resources, administrators can easily monitor costs associated with an ML endeavor and plan budgets using tools such as AWS Budgets and AWS Cost Explorer. As an administrator you’ll only need to attach a cost allocation tag for sagemaker:space-arn.

attach a cost allocation tag for sagemaker:space-arn

Once that’s complete, you can use AWS Cost Explorer to identify how much individual ML projects are costing your organization.

Once that’s complete, you can use AWS Cost Explorer to identify how much individual ML projects are costing your organization.

Get started with shared spaces in Amazon SageMaker Studio

In this section, we’ll analyze the typical workflow for creating and utilizing shared spaces in Amazon SageMaker Studio.

Create a shared space in Amazon SageMaker Studio

You can use the Amazon SageMaker Console or the AWS Command Line Interface (AWS CLI) to add support for spaces to an existing domain. For the most up to date information, please check Create a shared space. Shared spaces only work with a JupyterLab 3 SageMaker Studio image and for SageMaker Domains using AWS Identity and Access Management (AWS IAM) authentication.

Console creation

To create a space within a designated Amazon SageMaker Domain, you’ll first need to set a designated space default execution role. From the Domain details page, select the Domain settings tab and select Edit. Then you can set a space default execution role, which only needs to be completed once per Domain, as shown in the following diagram:

Next, you can go to the Space management tab within your domain and select the Create button, as shown in the following diagram:

go to the Space management tab within your domain and select the Create button

AWS CLI creation

You can also set a default Domain space execution role from the AWS CLI. In order to determine your region’s JupyterLab3 image ARN, check Setting a default JupyterLab version.

aws --region <REGION> 
sagemaker update-domain 
--domain-id <DOMAIN-ID> 
--default-space-settings "ExecutionRole=<YOUR-SAGEMAKER-EXECUTION-ROLE-ARN>"

Once that’s been completed for your Domain, you can create a shared space from the CLI.

aws --region <REGION> 
sagemaker create-space 
--domain-id <DOMAIN-ID> 
--space-name <SPACE-NAME> 

Launch a shared space in Amazon SageMaker Studio

Users can launch a shared space by selecting the Launch button next to their user profile within the AWS Console for their Amazon SageMaker Domain.

After selecting Spaces under the Collaborative section, then select which Space to launch:

Alternatively, users can generate a pre-signed URL to launch a space through the AWS CLI:

aws sagemaker create-presigned-domain-url 
--region <REGION> 
--domain-id <DOMAIN-ID> 
--space-name <SPACE-NAME> 
--user-profile-name <USER-PROFILE-NAME> 

Real time collaboration

Once the Amazon SageMaker Studio shared space IDE has been loaded, users can select the Collaborators tab on the left panel to see which users are actively working in your space and on what notebook. If more than one person is working on the same notebook, then you’ll see a cursor with the other user’s profile name where they are editing:

In the following screenshot, you can see the different user experiences for someone editing and viewing the same notebook:

Conclusion

In this post, we showed you how shared spaces in SageMaker Studio adds a real-time collaborative IDE experience to Amazon SageMaker Studio. Automated tagging helps users scope and filter their Amazon SageMaker resources, which includes: experiments, pipelines, and model registry entries to maximize user productivity. Additionally, administrators can use these applied tags to monitor the costs associated with a given space and set appropriate budgets using AWS Cost Explorer and AWS Budgets.

Accelerate your team’s collaboration today by setting up shared spaces in Amazon SageMaker Studio for your specific machine learning endeavors!


About the authors

Sean MorganSean Morgan is an AI/ML Solutions Architect at AWS. He has experience in the semiconductor and academic research fields, and uses his experience to help customers reach their goals on AWS. In his free time, Sean is an active open-source contributor/maintainer and is the special interest group lead for TensorFlow Add-ons.

Han Zhang is a Senior Software Engineer at Amazon Web Services. She is part of the launch team for Amazon SageMaker Notebooks and Amazon SageMaker Studio, and has been focusing on building secure machine learning environments for customers. In her spare time, she enjoys hiking and skiing in the Pacific Northwest.

Arkaprava De is a Senior Software Engineer at AWS. He has been at Amazon for over 7 years and is currently working on improving the Amazon SageMaker Studio IDE experience. You can find him on LinkedIn.

Kunal Jha is a Senior Product Manager at AWS. He is focused on building Amazon SageMaker Studio as the IDE of choice for all ML development steps. In his spare time, Kunal enjoys skiing and exploring the Pacific Northwest. You can find him on LinkedIn.

Read More

Minimize the production impact of ML model updates with Amazon SageMaker shadow testing

Minimize the production impact of ML model updates with Amazon SageMaker shadow testing

Amazon SageMaker now allows you to compare the performance of a new version of a model serving stack with the currently deployed version prior to a full production rollout using a deployment safety practice known as shadow testing. Shadow testing can help you identify potential configuration errors and performance issues before they impact end-users. With SageMaker, you don’t need to invest in building your shadow testing infrastructure, allowing you to focus on model development. SageMaker takes care of deploying the new version alongside the current version serving production requests, routing a portion of requests to the shadow version. You can then compare the performance of the two versions using metrics such as latency and error rate. This gives you greater confidence that production rollouts to SageMaker inference endpoints won’t cause performance regressions, and helps you avoid outages due to accidental misconfigurations.

In this post, we demonstrate this new SageMaker capability. The corresponding sample notebook is available in this GitHub repository.

Overview of solution

Your model serving infrastructure consists of the machine learning (ML) model, the serving container, or the compute instance. Let’s consider the following scenarios:

  • You’re considering promoting a new model that has been validated offline to production, but want to evaluate operational performance metrics, such as latency, error rate, and so on, before making this decision.
  • You’re considering changes to your serving infrastructure container, such as patching vulnerabilities or upgrading to newer versions, and want to assess the impact of these changes prior to promotion to production.
  • You’re considering changing your ML instance and want to evaluate how the new instance would perform with live inference requests.

The following diagram illustrates our solution architecture.

Shadow-arch

For each of these scenarios, select a production variant you want to test against and SageMaker automatically deploys the new variant in shadow mode and routes a copy of the inference requests to it in real time within the same endpoint. Only the responses of the production variant are returned to the calling application. You can choose to discard or log the responses of the shadow variant for offline comparison. Optionally, you can monitor the variants through a built-in dashboard with a side-by-side comparison of the performance metrics. You can use this capability either through SageMaker inference update-endpoint APIs or through the SageMaker console.

Shadow variants build on top of the production variant capability in SageMaker inference endpoints. To reiterate, a production variant consists of the ML model, serving container, and ML instance. Because each variant is independent of others, you can have different models, containers, or instance types across variants. SageMaker lets you specify auto scaling policies on a per-variant basis so they can scale independently based on incoming load. SageMaker supports up to 10 production variants per endpoint. You can either configure a variant to receive a portion of the incoming traffic by setting variant weights, or specify the target variant in the incoming request. The response from the production variant is forwarded back to the invoker.

A shadow variant(new) has the same components as a production variant. A user-specified portion of the requests, known as the traffic sampling percentage, is forwarded to the shadow variant. You can choose to log the response of the shadow variant in Amazon Simple Storage Service (Amazon S3) or discard it.

Note that SageMaker supports a maximum of one shadow variant per endpoint. For an endpoint with a shadow variant, there can be a maximum of one production variant.

After you set up the production and shadow variants, you can monitor the invocation metrics for both production and shadow variants in Amazon CloudWatch under the AWS/SageMaker namespace. All updates to the SageMaker endpoint are orchestrated using blue/green deployments and occur without any loss in availability. Your endpoints will continue responding to production requests as you add, modify, or remove shadow variants.

You can use this capability in one of two ways:

  • Managed shadow testing using the SageMaker Console – You can leverage the console for a guided experience to manage the end-to-end journey of shadow testing. This lets you setup shadow tests for a predefined duration of time, monitor the progress through a live dashboard, clean up upon completion, and act on the results.
  • Self-service shadow testing using the SageMaker Inference APIs – If your deployment workflow already uses create/update/delete-endpoint APIs, you can continue using them to manage Shadow Variants.

In the following sections, we walk through each of these scenarios.

Scenario 1 – Managed shadow testing using the SageMaker Console

If you wish to choose SageMaker to manage the end-to-end workflow of creating, managing, and acting on the results of the shadow tests, consider using the Shadow tests’ capability in the Inference section of the SageMaker Console. As stated earlier, this enables you to setup shadow tests for a predefined duration of time, monitor the progress through a live dashboard, presents clean up options upon completion, and act on the results. To learn more, please visit the shadow tests section of our documentation for a step-by-step walkthrough of this capability.

Pre-requisites

The models for production and shadow need to be created on SageMaker. Please refer to the CreateModel API here.

Step 1 – Create a shadow test

Navigate to the Inference section of the left navigation panel of the SageMaker console and then choose Shadow tests. This will take you to a dashboard with all the scheduled, running, and completed shadow tests. Click ‘create a shadow test’. Enter a name for the test and choose next.

This will take you to the shadow test settings page. You can choose an existing IAM role or create one that has the AmazonSageMakerFullAccess IAM policy attached. Next, choose ‘Create a new endpoint’ and enter a name (xgb-prod-shadow-1). You can add one production and one shadow variant associated with this endpoint by clicking on ‘Add’ in the Variants section. You can select the models you have created in the ‘Add Model’ dialog box. This creates a production or variant. Optionally, you can change the instance type and count associated with each variant.

All the traffic goes to the production variant andit responds to invocation requests. You can control a portion of the requests that is routed to the shadow variant by changing the Traffic Sampling Percentage.

You can control the duration of the test from one hour to 30 days. If unspecified, it defaults to 7 days. After this period, the test is marked complete. If you are running a test on an existing endpoint, it will be rolled back to the state prior to starting the test upon completion.

You can optionally capture the requests and responses of the Shadow variant using the Data Capture options. If left unspecified, the responses of the shadow variant are discarded.

Step 2 – Monitor a shadow test

You can view the list of shadow tests by navigating to the Shadow Tests section under Inference. Click on the shadow test created in the previous step to view the details of a shadow test and monitor it while it is in progress or after it has completed.

The Metrics section provides a comparison of the key metrics and provides overlaid graphs between the production and shadow variants, along with descriptive statistics. You can compare invocation metrics such as ModelLatency and Invocation4xxErrors as well as instance metrics such as CPUUtilization and DiskUtilization.

Step 3 – Promote the Shadow variant to the new production variant

Upon comparing, you can either choose to promote the shadow variant to be the new production variant or remove the shadow variant. For both these options, select ‘Mark Complete’ on the top of the page. This presents you with an option to either promote or remove the shadow variant.

If you choose to promote, you will be taken to a deployment page, where you can confirm the variant settings prior to deployment. Prior to deployment, we recommend sizing your shadow variants to be able to handle 100% of the invocation traffic. If you are not using shadow testing to evaluate alternate instance types or sizes, you can use the choose the ‘retain production variant settings. Otherwise, you can choose to ‘retain shadow variant settings. If you choose this option, please ensure that your traffic sampling is set at 100%. Alternatively, you can specify the instance type and count if you wish to override these settings.

Once you confirm the deployment, SageMaker will initiate an update to your endpoint to promote the shadow variant to the new production variant. As with SageMaker all updates, your endpoint will still be operational during the update.

Scenario 2: Shadow testing using SageMaker inference APIs

This section covers how to use the existing SageMaker create/update/delete-endpoint APIs to deploy shadow variants.

For this example, we have two XGBoost models that represent two different versions of the models that have been pre-trained. model.tar.gz is the model currently deployed in production. model2 is the newer model, and we want to test its performance in terms of operational metrics such as latency before deciding to use it in production. We deploy model2 as a shadow variant of model.tar.gz. Both pre-trained models are stored in the public S3 bucket s3://sagemaker-sample-files. We firstdownload the modelour local compute instance and then upload to S3.

The models in this example are used to predict the probability of a mobile customer leaving their current mobile operator. The dataset we use is publicly available and was mentioned in the book Discovering Knowledge in Data by Daniel T. Larose. These models were trained using the XGB Churn Prediction Notebook in SageMaker. You can also use your own pre-trained models, in which case you can skip downloading from s3://sagemaker-sample-files and copy your own models directly to model/ folder.

!aws s3 cp s3://sagemaker-sample-files/models/xgb-churn/xgb-churn-prediction-model.tar.gz model/
!aws s3 cp s3://sagemaker-sample-files/models/xgb-churn/xgb-churn-prediction-model2.tar.gz model/

Step 1 – Create models

We upload the model files to our own S3 bucket and create two SageMaker models. See the following code:

model_url = S3Uploader.upload(
    local_path="model/xgb-churn-prediction-model.tar.gz",
    desired_s3_uri=f"s3://{bucket}/{prefix}",
)
model_url2 = S3Uploader.upload(
    local_path="model/xgb-churn-prediction-model2.tar.gz",
    desired_s3_uri=f"s3://{bucket}/{prefix}",
from sagemaker import image_uris
image_uri = image_uris.retrieve("xgboost", boto3.Session().region_name, "0.90-1")
image_uri2 = image_uris.retrieve("xgboost", boto3.Session().region_name, "0.90-2")

model_name = f"DEMO-xgb-churn-pred-{datetime.now():%Y-%m-%d-%H-%M-%S}"
model_name2 = f"DEMO-xgb-churn-pred2-{datetime.now():%Y-%m-%d-%H-%M-%S}"

resp = sm.create_model(
    ModelName=model_name,
    ExecutionRoleArn=role,
    Containers=[{"Image": image_uri, "ModelDataUrl": model_url}],
)

resp = sm.create_model(
    ModelName=model_name2,
    ExecutionRoleArn=role,
    Containers=[{"Image": image_uri2, "ModelDataUrl": model_url2}],
)

Step 2 – Deploy the two models as production and shadow variants to a real-time inference endpoint

We create an endpoint config with the production and shadow variants. The ProductionVariants and ShadowProductionVariants are of particular interest. Both these variants have ml.m5.xlarge instances with 4 vCPUs and 16 GiB of memory, and the initial instance count is set to 1. See the following code:

ep_config_name = f"Shadow-EpConfig-{datetime.now():%Y-%m-%d-%H-%M-%S}"
production_variant_name = "production"
shadow_variant_name = "shadow"
create_endpoint_config_response = sm.create_endpoint_config(
    EndpointConfigName=ep_config_name,
    ProductionVariants=[
    # Type: Array of ProductionVariant (https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ProductionVariant.html) objects
      { 
         "VariantName": shadow_variant_name,
        {
            "VariantName": production_variant_name,
            "ModelName": model_name,
            "InstanceType": "ml.m5.xlarge",
            "InitialInstanceCount": 2,
            "InitialVariantWeight": 1,
        }
    ],
     # Type: Array of ShadowProductionVariants 
    ShadowProductionVariants = [
         "ModelName": model_name2,
         "InitialInstanceCount": 1,
         "InitialVariantWeight": 0.5,
         "InstanceType": "ml.m5.xlarge" 
      }
   ]
)

Lastly, we create the production and shadow variant:

endpoint_name = f"xgb-prod-shadow-{datetime.now():%Y-%m-%d-%H-%M-%S}"
create_endpoint_api_response = sm.create_endpoint(
                                    EndpointName=endpoint_name,
                                    EndpointConfigName=ep_config_name,
                                )

Step 3 – Invoke the endpoint for testing

After the endpoint has been successfully created, you can begin invoking it. We send about 3,000 requests in a sequential way:

def invoke_endpoint(endpoint_name, wait_interval_sec=0.01, should_raise_exp=False):
    with open("test_data/test-dataset-input-cols.csv", "r") as f:
        for row in f:
            payload = row.rstrip("n")
            try:
                for i in range(10): #send the same payload 10 times for testing purpose
                    response = sm_runtime.invoke_endpoint(
                        EndpointName=endpoint_name, ContentType="text/csv", Body=payload
                    )
            except Exception as e:
                print("E", end="", flush=True)
                if should_raise_exp:
                    raise e

invoke_endpoint(endpoint_name)

Step 4 – Compare metrics

Now that we have deployed both the production and shadow models, let’s compare the invocation metrics. For a list of invocation metrics available for comparison, refer to Monitor Amazon SageMaker with Amazon CloudWatch. Let’s start by comparing invocations between the production and shadow variants.

The InvocationsPerInstance metric refers to the number of invocations sent to the production variant. A fraction of these invocations, specified in the variant weight, are sent to the shadow variant. The invocation per instance is calculated by dividing the total number of invocations by the number of instances in a variant. As shown in the following charts, we can confirm that both the production and shadow variants are receiving invocation requests according to the weights specified in the endpoint config.

Next, let’s compare the model latency (ModelLatency metric) between the production and shadow variants. Model latency is the time taken by a model to respond as viewed from SageMaker. We can observe how the model latency of the shadow variant compares with the production variant without exposing end-users to the shadow variant.

We expect the overhead latency (OverheadLatency metric) to be comparable across production and shadow variants. Overhead latency is the interval measured from the time SageMaker receives the request until it returns a response to the client, minus the model latency.

Step 5- Promote your shadow variant

To promote the shadow model to production, create a new endpoint configuration with current ShadowProductionVariant as the new ProductionVariant and remove the ShadowProductionVariant. This will remove the current ProductionVariant and promote the shadow variant to become the new production variant. As always, all SageMaker updates are orchestrated as blue/green deployments under the hood, and there is no loss of availability while performing the update.

Optionally, you can leverage Deployment Guardrails if you want to use all-at-once traffic shifting and auto rollbacks during your update.

promote_ep_config_name = f"PromoteShadow-EpConfig-{datetime.now():%Y-%m-%d-%H-%M-%S}"

create_endpoint_config_response = sm.create_endpoint_config(
    EndpointConfigName=promote_ep_config_name,
    ProductionVariants=[
        {
            "VariantName": shadow_variant_name,
            "ModelName": model_name2,
            "InstanceType": "ml.m5.xlarge",
            "InitialInstanceCount": 2,
            "InitialVariantWeight": 1.0,
        }
    ],
)
print(f"Created EndpointConfig: {create_endpoint_config_response['EndpointConfigArn']}")

update_endpoint_api_response = sm.update_endpoint(
    EndpointName=endpoint_name,
    EndpointConfigName=promote_ep_config_name,
)

wait_for_endpoint_in_service(endpoint_name)

sm.describe_endpoint(EndpointName=endpoint_name)

Step 6 – Clean Up

If you do not plan to use this endpoint further, you should delete the endpoint to avoid incurring additional charges and clean up other resources created in this blog.

dsm.delete_endpoint(EndpointName=endpoint_name)
sm.delete_endpoint_config(EndpointConfigName=ep_config_name)
sm.delete_endpoint_config(EndpointConfigName=promote_ep_config_name)
sm.delete_model(ModelName=model_name)
sm.delete_model(ModelName=model_name2)

Conclusion

In this post, we introduced a new capability of SageMaker inference to compare the performance of new version of a model serving stack with the currently deployed version prior to a full production rollout using a deployment safety practice known as shadow testing. We walked you through the advantages of using shadow variants and methods to configure the variants with an end-to-end example. To learn more about shadow variants, refer to shadow tests documentation.


About the Authors

Raghu Ramesha is a Machine Learning Solutions Architect with the Amazon SageMaker Service team. He focuses on helping customers build, deploy, and migrate ML production workloads to SageMaker at scale. He specializes in machine learning, AI, and computer vision domains, and holds a master’s degree in Computer Science from UT Dallas. In his spare time, he enjoys traveling and photography.

Qingwei Li is a Machine Learning Specialist at Amazon Web Services. He received his Ph.D. in Operations Research after he broke his advisor’s research grant account and failed to deliver the Nobel Prize he promised. Currently he helps customers in the financial service and insurance industry build machine learning solutions on AWS. In his spare time, he likes reading and teaching.

Qiyun Zhao is a Senior Software Development Engineer with the Amazon SageMaker Inference Platform team. He is the lead developer of the Deployment Guardrails and Shadow Deployments, and he focuses on helping customers to manage ML workloads and deployments at scale with high availability. He also works on platform architecture evolutions for fast and secure ML jobs deployment and running ML online experiments at ease. In his spare time, he enjoys reading, gaming and traveling.

Tarun Sairam is a Senior Product Manager for Amazon SageMaker Inference. He is interested in learning about the latest trends in machine learning and helping customers leverage them. In his spare time, he enjoys biking, skiing, and playing tennis.

Read More

Improve governance of your machine learning models with Amazon SageMaker

Improve governance of your machine learning models with Amazon SageMaker

As companies are increasingly adopting machine learning (ML) for their mainstream enterprise applications, more of their business decisions are influenced by ML models. As a result of this, having simplified access control and enhanced transparency across all your ML models makes it easier to validate that your models are performing well and take action when they are not.

In this post, we explore how companies can improve visibility into their models with centralized dashboards and detailed documentation of their models using two new features: SageMaker Model Cards and the SageMaker Model Dashboard. Both these features are available at no additional charge to SageMaker customers.

Overview of model governance

Model governance is a framework that gives systematic visibility into model development, validation, and usage. Model governance is applicable across the end-to-end ML workflow, starting from identifying the ML use case to ongoing monitoring of a deployed model through alerts, reports, and dashboards. A well-implemented model governance framework should minimize the number of interfaces required to view, track, and manage lifecycle tasks to make it easier to monitor the ML lifecycle at scale.

Today, organizations invest significant technical expertise into building tooling to automate large portions of their governance and auditability workflow. For example, model builders need to proactively record model specifications such as intended use for a model, risk rating, and performance criteria a model should be measured against. Furthermore, they also need to record observations on model behavior, and document the reason they made certain key decisions such as the objective function they optimized the model against.

It’s common for companies to use tools like Excel or email to capture and share such model information for use in approvals for production usage. But as the scale of ML development increases, information can be easily lost or misplaced, and keeping track of these details becomes infeasible quickly. Furthermore, after these models are deployed, you might stitch together data from various sources to gain end-to-end visibility into all your models, endpoints, monitoring history, and lineage. Without such a view, you can easily lose track of your models, and may not be aware of when you need to take action on them. This issue is intensified in highly regulated industries because you’re subject to regulations that require you to keep such measures in place.

As the volume of models starts to scale, managing custom tooling can become a challenge and gives organizations less time to focus on core business needs. In the following sections, we explore how SageMaker Model Cards and the SageMaker Model Dashboard can help you scale your governance efforts.

SageMaker Model Cards

Model cards enable you to standardize how models are documented, thereby achieving visibility into the lifecycle of a model, from designing, building, training, and evaluation. Model cards are intended to be a single source of truth for business and technical metadata about the model that can reliably be used for auditing and documentation purposes. They provide a factsheet of the model that is important for model governance.

Model cards allow users to author and store decisions such as why an objective function was chosen for optimization, and details such as intended usage and risk rating. You can also attach and review evaluation results, and jot down observations for future reference.

For models trained on SageMaker, Model cards can discover and auto-populate details such as training job, training datasets, model artifacts, and inference environment, thereby accelerating the process of creating the cards. With the SageMaker Python SDK, you can seamlessly update the Model card with evaluation metrics.

Model cards provide model risk managers, data scientists, and ML engineers the ability to perform the following tasks:

  • Document model requirements such as risk rating, intended usage, limitations, and expected performance
  • Auto-populate Model cards for SageMaker trained models
  • Bring your own info (BYOI) for non-SageMaker models
  • Upload and share model and data evaluation results
  • Define and capture custom information
  • Capture Model card status (draft, pending review, or approved for production)
  • Access the Model card hub from the AWS Management Console
  • Create, edit, view, export, clone, and delete Model cards
  • Trigger workflows using Amazon EventBridge integration for Model card status change events

Create SageMaker Model Cards using the console

You can easily create Model cards using the SageMaker console. Here you can see all the existing Model cards and create new ones as needed.

When creating a Model card, you can document critical model information such as who built the model, why it was developed, how it is performing for independent evaluations, and any observations that need to be considered prior to using the model for a business application.

To create a Model card on the console, complete the following steps:

  1. Enter model overview details.
  2. Enter training details (auto-populated if the model was trained on SageMaker).
  3. Upload evaluation results.
  4. Add additional details such as recommendations and ethical considerations.

After you create the Model card, you can choose a version to view it.

The following screenshot shows the details of our Model card.

You can also export the Model card to be shared as a PDF.

Create and explore SageMaker Model Cards through the SageMaker Python SDK

Interacting with Model cards isn’t limited to the console. You can also use the SageMaker Python SDK to create and explore Model cards. The SageMaker Python SDK allows data scientists and ML engineers to easily interact with SageMaker components. The following code snippets showcase the process to create a Model card using the newly added SageMaker Python SDK functionality.

Make sure you have the latest version of the SageMaker Python SDK installed:

$ pip install --upgrade "sagemaker>=2"

Once you have trained and deployed a model using SageMaker, you can use the information from the SageMaker model and the training job to automatically populate information into the Model card.

Using the SageMaker Python SDK and passing the SageMaker model name, we can automatically collect basic model information. Information such as the SageMaker model ARN, training environment, and model output Amazon Simple Storage Service (Amazon S3) location is all automatically populated. We can add other model facts, such as description, problem type, algorithm type, model creator, and owner. See the following code:

model_overview = ModelOverview.from_name(
    model_name=model_name,
    sagemaker_session=sagemaker_session,
    model_description="This is a simple binary classification model used for Model Card demo",
    problem_type="Binary Classification",
    algorithm_type="Logistic Regression",
    model_creator="DEMO-ModelCard",
    model_owner="DEMO-ModelCard",
)
print(model_overview.model_id) # Provides us with the SageMaker Model ARN
print(model_overview.inference_environment.container_image) # Provides us with the SageMaker inference container URI
print(model_overview.model_artifact) # Provides us with the S3 location of the model artifacts

We can also automatically collect basic training information like training job ARN, training environment and training metrics. Additional training details can be added, like training objective function and observations. See the following code:

objective_function = ObjectiveFunction(
    function=Function(
        function=ObjectiveFunctionEnum.MINIMIZE,
        facet=FacetEnum.LOSS,
    ),
    notes="This is a example objective function.",
)
training_details = TrainingDetails.from_model_overview(
    model_overview=model_overview,
    sagemaker_session=sagemaker_session,
    objective_function=objective_function,
    training_observations="Additional training observations could be put here."
)

print(training_details.training_job_details.training_arn) # Provides us with the SageMaker Model ARN
print(training_details.training_job_details.training_environment.container_image) # Provides us with the SageMaker training container URI
print([{"name": i.name, "value": i.value} for i in training_details.training_job_details.training_metrics]) # Provides us with the SageMaker Training Job metrics

If we have evaluation metrics available, we can add those to the Model card as well:

my_metric_group = MetricGroup(
    name="binary classification metrics",
    metric_data=[Metric(name="accuracy", type=MetricTypeEnum.NUMBER, value=0.5)]
)
evaluation_details = [
    EvaluationJob(
        name="Example evaluation job",
        evaluation_observation="Evaluation observations.",
        datasets=["s3://path/to/evaluation/data"],
        metric_groups=[my_metric_group],
    )
]

We can also add additional information about the model that can help with model governance:

intended_uses = IntendedUses(
    purpose_of_model="Test Model Card.",
    intended_uses="Not used except this test.",
    factors_affecting_model_efficiency="No.",
    risk_rating=RiskRatingEnum.LOW,
    explanations_for_risk_rating="Just an example.",
)
additional_information = AdditionalInformation(
    ethical_considerations="You model ethical consideration.",
    caveats_and_recommendations="Your model's caveats and recommendations.",
    custom_details={"custom details1": "details value"},
)

After we have provided all the details we require, we can create the Model card using the preceding configuration:

model_card_name = "sample-notebook-model-card"
my_card = ModelCard(
    name=model_card_name,
    status=ModelCardStatusEnum.DRAFT,
    model_overview=model_overview,
    training_details=training_details,
    intended_uses=intended_uses,
    evaluation_details=evaluation_details,
    additional_information=additional_information,
    sagemaker_session=sagemaker_session,
)
my_card.create()

The SageMaker SDK also provides the ability to update, load, list, export, and delete a Model card.

To learn more about Model cards, refer to the developer guide and follow this example notebook to get started.

SageMaker Model Dashboard

The Model dashboard is a centralized repository of all models that have been created in the account. The models are usually created by training on SageMaker, or you can bring your models trained elsewhere to host on SageMaker.

The Model dashboard provides a single interface for IT administrators, model risk managers, or business leaders to view all deployed models and how they’re performing. You can view your endpoints, batch transform jobs, and monitoring jobs to get insights into model performance. Organizations can dive deep to identify which models have missing or inactive monitors and add them using SageMaker APIs to ensure all models are being checked for data drift, model drift, bias drift, and feature attribution drift.

The following screenshot shows an example of the Model dashboard.

The Model dashboard provides an overview of all your models, what their risk rating is, and how those models are performing in production. It does this by pulling information from across SageMaker. The performance monitoring information is captured through Amazon SageMaker Model Monitor, and you can also see information on models invoked for batch predictions through SageMaker batch transform jobs. Lineage information such as how the model was trained, data used, and more is captured, and information from Model cards is pulled as well.

Model Monitor monitors the quality of SageMaker models used in production for batch inference or real-time endpoints. You can set up continuous monitoring or scheduled monitors via SageMaker APIs, and edit the alert settings through the Model dashboard. You can set alerts that notify you when there are deviations in the model quality. Early and proactive detection of these deviations enables you to take corrective actions, such as retraining models, auditing upstream systems, or fixing quality issues without having to monitor models manually or build additional tooling. The Model dashboard gives you quick insight into which models are being monitored and how they are performing. For more information on Model Monitor, visit Monitor models for data and model quality, bias, and explainability.

When you choose a model in the Model dashboard, you can get deeper insights into the model, such as the Model card (if one exists), model lineage, details about the endpoint the model has been deployed to, and the monitoring schedule for the model.

This view allows you to create a Model card if needed. The monitoring schedule can be activated, deactivated, or edited as well through the Model dashboard.

For models that don’t have a monitoring schedule, you can set this up by enabling Model Monitor for the endpoint the model has been deployed to. Through the alert details and status, you will be notified of models that are showing data drift, model drift, bias drift, or feature drift, depending on which monitors you set up.

Let’s look at an example workflow of how to set up model monitoring. The key steps of this process are:

  1. Capture data sent to the endpoint (or batch transform job).
  2. Establish a baseline (for each of the monitoring types).
  3. Create a Model Monitor schedule to compare the live predictions against the baseline to report violations and trigger alerts.

Based on the alerts, you can take actions like rolling back the endpoint to a previous version or retraining the model with new data. While doing this, it may be necessary to trace how the model was trained, which can be done by visualizing the model’s lineage.

The Model dashboard offers a rich set of information regarding the overall model ecosystem in an account, in addition to the ability to drill into the specific details of a model. To learn more about the Model dashboard, refer to developer guide.

Conclusion

Model governance is complex and often involves lots of customized needs specific to an organization or an industry. This could be based on the regulatory requirements your organization needs to comply with, the types of personas present in the organization, and the types of models being used. There’s no one-size-fits-all approach to governance, and it’s important to have the right tools available so that a robust governance process can be put into place.

With the purpose-built ML governance tools in SageMaker, organizations can implement the right mechanisms to improve control and visibility over ML projects for their specific use cases. Give Model cards and the Model dashboard a try, and leave your comments with questions and feedback. To learn more about Model cards and the Model dashboard, refer to developer guide.


About the authors

Kirit Thadaka is an ML Solutions Architect working in the SageMaker Service SA team. Prior to joining AWS, Kirit worked in early-stage AI startups followed by some time consulting in various roles in AI research, MLOps, and technical leadership.

Marc Karp is a ML Architect with the SageMaker Service team. He focuses on helping customers design, deploy and manage ML workloads at scale. In his spare time, he enjoys traveling and exploring new places.

Raghu Ramesha is an ML Solutions Architect with the Amazon SageMaker Service team. He focuses on helping customers build, deploy, and migrate ML production workloads to SageMaker at scale. He specializes in machine learning, AI, and computer vision domains, and holds a master’s degree in Computer Science from UT Dallas. In his free time, he enjoys traveling and photography.

Ram Vittal is an ML Specialist Solutions Architect at AWS. He has over 20 years of experience architecting and building distributed, hybrid, and cloud applications. He is passionate about building secure and scalable AI/ML and big data solutions to help enterprise customers with their cloud adoption and optimization journey to improve their business outcomes. In his spare time, he enjoys tennis, photography, and action movies.

Sahil Saini is an ISV Solution Architect at Amazon Web Services . He works with AWS strategic customers product and engineering teams to help them with technology solutions using AWS services for AI/ML, Containers, HPC and IoT. He has helped set up AI/ML platforms for enterprise customers.

Read More

Define customized permissions in minutes with Amazon SageMaker Role Manager

Define customized permissions in minutes with Amazon SageMaker Role Manager

Administrators of machine learning (ML) workloads are focused on ensuring that users are operating in the most secure manner, striving towards a principal of least privilege design. They have a wide variety of personas to account for, each with their own unique sets of needs, and building the right sets of permissions policies to meet those needs can sometimes be an inhibitor to agility. In this post, we look at how to use Amazon SageMaker Role Manager to quickly build out a set of persona-based roles that can be further customized to your specific requirements in minutes, right on the Amazon SageMaker console.

Role Manager offers predefined personas and ML activities combined with a wizard to streamline your permission generation process, allowing your ML practitioners to perform their responsibilities with the minimal necessary permissions. If you require additional customization, SageMaker Role Manager allows you to specify networking and encryption permissions for Amazon Virtual Private Cloud (Amazon VPC) resources and AWS Key Management Service (AWS KMS) encryption keys, and attach your custom policies.

In this post, you walk through how to use SageMaker Role Manager to create a data scientist role for accessing Amazon SageMaker Studio, while maintaining a set of minimal permissions to perform their necessary activities.

Solution overview

In this walkthrough, you perform all the steps to grant permissions to an ML administrator, create a service role for accessing required dependencies for building and training models, and create execution roles for users to assume inside of Studio to perform their tasks. If your ML practitioners access SageMaker via the AWS Management Console, you can create the permissions to allow access or grant access through IAM Identity Center (Successor to AWS Single Sign-On).

Personas

A persona is an entity that needs to perform a set of ML activities and uses a role to grant them permissions. SageMaker Role Manager provides you with a set of predefined persona templates for common use cases, or you can build your own custom persona.

There are several personas currently supported, including:

  • Data scientist – A persona that performs ML activities from within a SageMaker environment. They’re permitted to process Amazon Simple Storage Service (Amazon S3) data, perform experiments, and produce models.
  • MLOps – A persona that deals with operational activities from within a SageMaker environment. They’re permitted to manage models, endpoints, and pipelines, and audit resources.
  • SageMaker compute role – A persona used by SageMaker compute resources such as jobs and endpoints. They’re permitted to access Amazon S3 resources, Amazon Elastic Container Registry (Amazon ECR) repositories, Amazon CloudWatch, and other services for ML computation.
  • Custom role settings – This persona has no pre-selected settings or default options. It offers complete customization starting with empty settings.

For a comprehensive list of personas and additional details, refer to the persona reference of the SageMaker Role Manager Developer Guide.

ML activities

ML activities are predefined sets of permissions tailored to common ML tasks. Personas are composed of one or more ML activities to grant permissions.

For example, the data scientist persona uses the following ML activities:

  • Run Studio Applications – Permissions to operate within a Studio environment. Required for domain and user-profile execution roles.
  • Manage Experiments – Permissions to manage experiments and trials.
  • Manage ML Jobs – Permissions to audit, query lineage, and visualize experiments.
  • Manage Models – Permissions to manage SageMaker jobs across their lifecycles.
  • Manage Pipelines – Permissions to manage SageMaker pipelines and pipeline executions.
  • S3 Bucket Access – Permissions to perform operations on specified buckets.

There are many more ML activities available than the ones that are listed here. To see the full list along with template policy details, refer to the ML Activity reference of the SageMaker Role Manager Developer Guide.

The following figure demonstrates the entire scope of this post, where you first create a service execution role to allow users to PassRole for access to underlying services and then create a user execution role to grant permissions for your ML practitioners to perform their required ML activities.

Prerequisites

You need to ensure that you have a role for your ML administrator to create and manage personas, as well as the AWS Identity and Access Management (IAM) permissions for those users.

An example IAM policy for an ML administrator may look like the following code. Note that the following policy locks down Studio domain creation to VPC only. Although this is a best practice for controlling network access, you need to remove the LockDownStudioDomainCreateToVPC statement if your implementation doesn’t use a VPC-based Studio domain.

{
    "Version": "2012-10-17",
    "Statement":
    [
        {
            "Sid": "LockDownStudioDomainCreateToVPC",
            "Effect": "Allow",
            "Action":
            [
                "sagemaker:CreateDomain"
            ],
            "Resource":
            [
                "arn:aws:sagemaker:<REGION>:<ACCOUNT-ID>:domain/*"
            ],
            "Condition":
            {
                "StringEquals":
                {
                    "sagemaker:AppNetworkAccessType": "VpcOnly"
                }
            }
        },
        {
            "Sid": "StudioUserProfilePerm",
            "Effect": "Allow",
            "Action":
            [
                "sagemaker:CreateUserProfile"
            ],
            "Resource":
            [
                "arn:aws:sagemaker:<REGION>:<ACCOUNT-ID>:user-profile/*"
            ]
        },
        {
            "Sid": "AllowFileSystemPermissions",
            "Effect": "Allow",
            "Action":
            [
                "elasticfilesystem:CreateFileSystem"
            ],
            "Resource": "arn:aws:elasticfilesystem:<REGION>:<ACCOUNT-ID>:file-system/*"
        },
        {
            "Sid": "KMSPermissionsForSageMaker",
            "Effect": "Allow",
            "Action":
            [
                "kms:CreateGrant",
                "kms:Decrypt",
                "kms:DescribeKey",
                "kms:Encrypt",
                "kms:GenerateDataKey",
                "kms:RetireGrant",
                "kms:ReEncryptTo",
                "kms:ListGrants",
                "kms:RevokeGrant",
                "kms:GenerateDataKeyWithoutPlainText"
            ],
            "Resource":
            [
                "arn:aws:kms:<REGION>:<ACCOUNT-ID>:key/<KMS-KEY-ID>"
            ]
        },
        {
            "Sid": "AmazonSageMakerPresignedUrlPolicy",
            "Effect": "Allow",
            "Action":
            [
                "sagemaker:CreatePresignedDomainUrl"
            ],
            "Resource":
            [
                "arn:aws:sagemaker:<REGION>:<ACCOUNT-ID>:user-profile/*"
            ]
        },
        {
            "Sid": "AllowRolePerm",
            "Effect": "Allow",
            "Action":
            [
                "iam:PassRole",
                "iam:GetRole"
            ],
            "Resource":
            [
                "arn:aws:iam::<ACCOUNT-ID>:role/*"
            ]
        },
        {
            "Sid": "ListExecutionRoles",
            "Effect": "Allow",
            "Action":
            [
                "iam:ListRoles"
            ],
            "Resource":
            [
                "arn:aws:iam::<ACCOUNT-ID>:role/*"
            ]
        },
        {
            "Sid": "SageMakerApiListDomain",
            "Effect": "Allow",
            "Action":
            [
                "sagemaker:ListDomains"
            ],
            "Resource": "arn:aws:sagemaker:<REGION>:<ACCOUNT-ID>:domain/*"
        },
        {
            "Sid": "VpcConfigurationForCreateForms",
            "Effect": "Allow",
            "Action":
            [
                "ec2:DescribeVpcs",
                "ec2:DescribeSubnets",
                "ec2:DescribeSecurityGroups"
            ],
            "Resource": "*"
        },
        {
            "Sid": "KmsKeysForCreateForms",
            "Effect": "Allow",
            "Action":
            [
                "kms:DescribeKey",
                "kms:ListAliases"
            ],
            "Resource":
            [
                "arn:aws:kms:<REGION>:<ACCOUNT-ID>:key/*"
            ]
        },
        {
            "Sid": "KmsKeysForCreateForms2",
            "Effect": "Allow",
            "Action":
            [
                "kms:ListAliases"
            ],
            "Resource":
            [
                "*"
            ]
        },
        {
            "Sid": "StudioReadAccess",
            "Effect": "Allow",
            "Action":
            [
                "sagemaker:ListDomains",
                "sagemaker:ListApps",
                "sagemaker:DescribeDomain",
                "sagemaker:DescribeUserProfile",
                "sagemaker:ListUserProfiles",
                "sagemaker:EnableSagemakerServicecatalogPortfolio",
                "sagemaker:GetSagemakerServicecatalogPortfolioStatus"
            ],
            "Resource": "*"
        },
        {
            "Sid": "SageMakerProjectsSC",
            "Effect": "Allow",
            "Action":
            [
                "servicecatalog:AcceptPortfolioShare",
                "servicecatalog:ListAcceptedPortfolioShares",
                "servicecatalog:Describe*",
                "servicecatalog:List*",
                "servicecatalog:ScanProvisionedProducts",
                "servicecatalog:SearchProducts",
                "servicecatalog:SearchProvisionedProducts",
                "cloudformation:GetTemplateSummary",
                "servicecatalog:ProvisionProduct",
                "cloudformation:ListStackResources",
                "servicecatalog:AssociatePrincipalWithPortfolio"
            ],
            "Resource": "*"
        },
        {
            "Action":
            [
                "s3:CreateBucket",
                "s3:ListAllMyBuckets",
                "s3:GetBucketLocation",
                "s3:GetObject",
                "s3:PutObject",
                "s3:DeleteObject",
                "s3:AbortMultipartUpload",
                "s3:GetBucketCors",
                "s3:PutBucketCors",
                "s3:GetBucketAcl",
                "s3:PutObjectAcl"
            ],
            "Effect": "Allow",
            "Resource":
            [
                "arn:aws:s3:::<S3-BUCKET-NAME>",
                "arn:aws:s3:::<S3-BUCKET-NAME>/*"
            ]
        }
    ]
}

Create a service role for passing to jobs and endpoints

When creating roles for your ML practitioners to perform activities in SageMaker, they need to pass permissions to an service role that has access to manage the underlying infrastructure. This service role can be reused, and doesn’t need to be created for every use case. In this section, you create a service role and then reference it when you create your other personas via PassRole. If you already have an appropriate service role, you can use it instead of creating another one.

  1. On the SageMaker console, choose Getting Started in the navigation bar.
  2. Under Configure role, choose Create a role.

  3. For Role name suffix, give your role a name, which becomes the suffix of the IAM role name created for you. For this post, we enter SageMaker-demoComputeRole.
  4. Choose SageMaker Compute Role as your persona.
  5. Optionally, configure the networking and encryption settings to use your desired resources.
  6. Choose Next.

    In the Configure ML activities section, you can see that the ML activity for Access Required AWS Services is already preselected for the SageMaker Compute Role persona.
    Because the Access Required AWS Services ML activity is selected, further options appear.
  7. Enter the appropriate S3 bucket ARNs and Amazon ECR ARNs that this service role will be able to access.
    You can add multiple values by choosing Add in each section.
  8. After you have filled in the required values, choose Next.
  9. In the Add additional policies & tags section, choose any other policies your service role might need.
  10. Choose Next.
  11. In the Review role section, verify that your configuration is correct, then choose Submit.
    The last thing you need to do for the service role is note down the role ARN so you can use it later in your data scientist persona role creation process.
  12. To view the role in IAM, choose Go to Role in the success banner or alternatively search for the name you gave your service role persona on the IAM console.
  13. On the IAM console, note the role’s ARN in the ARN section.

You enter this ARN later when creating your other persona-based roles.

Create an execution role for data scientists

Now that you have created the base service roles for your other personas to use, you can create your role for data scientists.

  1. On the SageMaker console, choose Getting Started in the navigation bar.
  2. Under Configure role, choose Create a role.
  3. For Role name suffix, give your role a name, for example, SageMaker-dataScientistRole.
    Note that this resulting name needs to be unique across your existing roles, or persona creation will fail.
  4. Optionally, add a description.
  5. Choose a base persona template to give your persona a baseline set of ML activities. In this example, we choose Data Scientist.
  6. Optionally, in the Network setup section, specify the specific VPC subnets and security groups that the persona can access for resources that support them.
  7. In the Encryption setup, you can optionally choose one or more data encryption and volume encryption keys for services that support encryption at rest.
  8. After you have completed customizing your persona, choose Next.

    In the Configure ML activities section, one or more ML activities are pre-selected based on your baseline persona template.
  9. In this section, you can add or remove additional ML activities to tailor this role to your specific use case.

    Certain ML activities require additional information to complete the role setup. For example, selecting the S3 Bucket Access ML activity requires you to specify a list of S3 buckets to grant access to.Other ML activities may require a PassRoles entry to allow this persona to pass its permissions to a service role to perform actions on behalf of the persona. In our example, the Manage ML Jobs ML activity requires a PassRoles entry.
  10. Enter the role ARN for the service role you created earlier.
    You can add multiple entries by choosing Add, which creates an array of the specified values in the resulting role.
  11. After you have selected all the appropriate ML activities and supplied the necessary values, choose Next.
  12. In the Add additional policies section, choose any other policies your execution role might need. You can also add tags to your execution role.
  13. Choose Next.
  14. In the Review Role section, verify that the persona configuration details are accurate, then choose Submit.

View and add final customizations to your new role

After submitting your persona, you can go to the IAM console and see the resulting role and policies that were created for you, as well as make further modifications. To get to the new role in IAM, choose Go to role in the success banner.

On the IAM console, you can view your newly created role along with the attached policies that map the ML activities you selected in Role Manager. You can change the existing policies here by selecting the policy and editing the document. This role can also be recreated via Infrastructure as Code (IaC) by simply taking the contents of the policy documents and inserting them into your existing solution.

Link the new role to a user

In order for your users to access Studio, they need to be associated with the user execution role you created (in this example, based on the data scientist persona). The method of associating the user with the role varies based on the authentication method you set up for your Studio domain, either IAM or IAM Identity Center. You can find the authentication method under the Domain section in the Studio Control Panel, as shown in the following screenshots.

Depending on your authentication method, proceed to the appropriate subsection.

Access Studio via IAM

Note that if you’re using the IAM Identity Center integration with Studio, the IAM role in this section isn’t necessary. Proceed to the next section.

SageMaker Role Manager creates execution roles for access to AWS services. To allow your data scientists to assume their given persona via the console, they require a console role to get to the Studio environment.

The following example role gives the necessary permissions to allow a data scientist to access the console and assume their persona’s role inside of Studio:

{
    "Version": "2012-10-17",
    "Statement":
    [
        {
            "Sid": "DescribeCurrentDomain",
            "Effect": "Allow",
            "Action": "sagemaker:DescribeDomain",
            "Resource": "arn:aws:sagemaker:<REGION>:<ACCOUNT-ID>:domain/<STUDIO-DOMAIN-ID>"
        },
        {
            "Sid": "RemoveErrorMessagesFromConsole",
            "Effect": "Allow",
            "Action":
            [
                "servicecatalog:ListAcceptedPortfolioShares",
                "sagemaker:GetSagemakerServicecatalogPortfolioStatus",
                "sagemaker:ListModels",
                "sagemaker:ListTrainingJobs",
                "servicecatalog:ListPrincipalsForPortfolio",
                "sagemaker:ListNotebookInstances",
                "sagemaker:ListEndpoints"
            ],
            "Resource": "*"
        },
        {
            "Sid": "RequiredForAccess",
            "Effect": "Allow",
            "Action":
            [
                "sagemaker:ListDomains",
                "sagemaker:ListUserProfiles"
            ],
            "Resource": "*"
        },
        {
            "Sid": "CreatePresignedURLForAccessToDomain",
            "Effect": "Allow",
            "Action": "sagemaker:CreatePresignedDomainUrl",
            "Resource": "arn:aws:sagemaker:<REGION>:<ACCOUNT-ID>:user-profile/<STUDIO-DOMAIN-ID>/<PERSONA_NAME>"
        }
    ]
}

The statement labeled RemoveErrorMessagesFromConsole can be removed without affecting the ability to get into Studio, but will result in API errors on the console UI.

Sometimes administrators give access to the console for ML practitioners to debug issues with their Studio environment. In this scenario, you want to grant additional permissions to view CloudWatch and AWS CloudTrail logs.

The following code is an example of a read-only CloudWatch Logs access policy:

{
"Version": "2012-10-17",
    "Statement": [
        {
        "Action": [
                "logs:Describe*",
                "logs:Get*",
                "logs:List*",
                "logs:StartQuery",
                "logs:StopQuery",
                "logs:TestMetricFilter",
                "logs:FilterLogEvents"
            ],
            "Effect": "Allow",
            "Resource": "*"
        }
    ]
}

For additional information on CloudWatch Logs policies, refer to Customer managed policy examples.

The following code is an example read-only CloudTrail access policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "cloudtrail:Get*",
                "cloudtrail:Describe*",
                "cloudtrail:List*",
                "cloudtrail:LookupEvents"
            ],
            "Resource": "*"
        }
    ]
}

For more details and example policies, refer to Identity and Access Management for AWS CloudTrail.

  1. In the Studio Control Panel, choose Add User to create your new data scientist user.
  2. For Name, give your user a name.
  3. For Default execution role, choose the persona role that you created earlier.
  4. Choose Next.
  5. Choose the appropriate Jupyter Lab version, and whether to enable Amazon SageMaker JumpStart and SageMaker project templates.
  6. Choose Next.
  7. This post assumes you’re not using RStudio, so choose Next again to skip RStudio configuration.
  8. Choose whether to enable Amazon SageMaker Canvas support, and additionally whether to allow for time series forecasting in Canvas.
  9. Choose Submit.
    You can now see your new data science user in the Studio Control Panel.
  10. To test this user, on the Launch app menu, choose Studio.
    This redirects you to the Studio console as the selected user with their persona’s permissions.

Access Studio via IAM Identity Center

Assigning IAM Identity Center users to execution roles requires them to first exist in the IAM Identity Center directory. If they don’t exist, contact your identity administrator or refer to Manage identities in IAM Identity Center for instructions.

Note that in order to use the IAM Identity Center authentication method, its directory and your Studio domain must be in the same AWS Region.

  1. To assign IAM Identity Center users to your Studio domain, choose Assign users and Groups in the Studio Control Panel.
  2. Select your data scientist user, then choose Assign users and groups.
  3. After the user has been added to the Studio Control panel, choose the user to open the user details screen.
  4. On the User details page, choose Edit.
  5. On the Edit user profile page, under General settings, change the Default execution role to match the user execution role you created for your data scientists.
  6. Choose Next.
  7. Choose Next through the rest of the settings pages, then choose Submit to save your changes.

Now, when your data scientist logs into the IAM Identity Center portal, they will see a tile for this Studio domain. Choosing that tile logs them in to Studio with the user execution role you assigned to them.

Test your new persona

After you’re logged in to Studio, you can use the following example notebook to validate the permissions that you granted to your data science user.

You can observe that the data scientist user can only perform the actions in the notebook that their role has been permitted. For example:

  • The user is blocked from running jobs without VPC or AWS KMS configuration, if the role were customized to do so
  • The user only has access to Amazon S3 resources if the role had the ML activity included
  • The user is only able to deploy endpoints if the role had the ML activity included

Clean up

To clean up the resources you created in this walkthrough, complete the following steps:

  1. Remove the mapping of your new role to your users:
    1. If using Studio with IAM, delete any new Studio users you created.
    2. If using Studio with IAM Identity Center, detach the created execution role from your Studio users.
  2. On the IAM console, find your user execution role and delete it.
  3. On the IAM console, find your service role and delete it.
  4. If you created a new role for an ML administrator:
    1. Log out of your account as the ML administrator role, and back in as another administrator that has IAM permissions.
    2. Delete the ML administrator role that you created.

Conclusion

Until recently, in order to build out SageMaker roles with customized permissions, you had to start from scratch. With the new SageMaker Role Manager, you can use the combination of personas, pre-built ML activities, and custom policies to quickly generate customized roles in minutes. This allows your ML practitioners to start working in SageMaker faster.

To learn more about how to use SageMaker Role Manager, refer to the SageMaker Role Manager Developer Guide.


About the authors

Giuseppe Zappia is a Senior Solutions Architect at AWS, with over 20 years of experience in full stack software development, distributed systems design, and cloud architecture. In his spare time, he enjoys playing video games, programming, watching sports, and building things.

Ram VittalRam Vittal is a Principal ML Solutions Architect at AWS. He has over 20 years of experience architecting and building distributed, hybrid, and cloud applications. He is passionate about building secure and scalable AI/ML and big data solutions to help enterprise customers with their cloud adoption and optimization journey to improve their business outcomes. In his spare time, he enjoys riding motorcycle, playing tennis, and photography.

Arvind Sowmyan is a Senior Software Development Engineer on the SageMaker Model Governance team where he specializes in building scalable webservices with a focus on enterprise security. Prior to this, he worked on the Training Jobs platform where he was a part of the SageMaker launch team. In his spare time, he enjoys illustrating comics, exploring virtual reality and tinkering with large language models.

Ozan Eken is a Senior Product Manager at Amazon Web Services. He is passionate about building governance products in Machine Learning for enterprise customers. Outside of work, he likes exploring different outdoor activities and watching soccer.

Read More

Build an agronomic data platform with Amazon SageMaker geospatial capabilities

Build an agronomic data platform with Amazon SageMaker geospatial capabilities

The world is at increasing risk of global food shortage as a consequence of geopolitical conflict, supply chain disruptions, and climate change. Simultaneously, there’s an increase in overall demand from population growth and shifting diets that focus on nutrient- and protein-rich food. To meet the excess demand, farmers need to maximize crop yield and effectively manage operations at scale, using precision farming technology to stay ahead.

Historically, farmers have relied on inherited knowledge, trial and error, and non-prescriptive agronomic advice to make decisions. Key decisions include what crops to plant, how much fertilizer to apply, how to control pests, and when to harvest. However, with an increasing demand for food and the need to maximize harvest yield, farmers need more information in addition to inherited knowledge. Innovative technologies like remote sensing, IoT, and robotics have the potential to help farmers move past legacy decision-making. Data-driven decisions fueled by near-real-time insights can enable farmers to close the gap on increased food demand.

Although farmers have traditionally collected data manually from their operations by recording equipment and yield data or taking notes of field observations, builders of agronomic data platforms on AWS help farmers work with their trusted agronomic advisors use that data at scale. Small fields and operations more easily allow a farmer to see the entire field to look for issues affecting the crop. However, scouting each field on a frequent basis for large fields and farms is not feasible, and successful risk mitigation requires an integrated agronomic data platform that can bring insights at scale. These platforms help farmers make sense of their data by integrating information from multiple sources for use in visualization and analytics applications. Geospatial data, including satellite imagery, soil data, weather, and topography data, are layered together with data collected by agricultural equipment during planting, nutrient application, and harvest operations. By unlocking insights through enhanced geospatial data analytics, advanced data visualizations, and automation of workflows via AWS technology, farmers can identify specific areas of their fields and crops that are experiencing an issue and take action to protect their crops and operations. These timely insights help farmers better work with their trusted agronomists to produce more, reduce their environmental footprint, improve their profitability, and keep their land productive for generations to come.

In this post, we look at how you can use the predictions generated from Amazon SageMaker geospatial capabilities into a user interface of an agronomic data platform. Furthermore, we discuss how software development teams are adding advanced machine learning (ML)-driven insights, including remote sensing algorithms, cloud masking (automatically detecting clouds within satellite imagery) and automated image processing pipelines, to their agronomic data platforms. Together, these additions help agronomists, software developers, ML engineers, data scientists, and remote sensing teams provide scalable, valuable decision-making support systems to farmers. This post also provides an example end-to-end notebook and GitHub repository that demonstrates SageMaker geospatial capabilities, including ML-based farm field segmentation and pre-trained geospatial models for agriculture.

Adding geospatial insights and predictions into agronomic data platforms

Established mathematical and agronomic models combined with satellite imagery enable visualization of the health and status of a crop by satellite image, pixel by pixel, over time. However, these established models require access to satellite imagery that is not obstructed by clouds or other atmospheric interference that reduces the quality of the image. Without identifying and removing clouds from each processed image, predictions and insights will have significant inaccuracies and agronomic data platforms will lose the trust of the farmer. Because agronomic data platform providers commonly serve customers comprising thousands of farm fields across varying geographies, agronomic data platforms require computer vision and an automated system to analyze, identify, and filter out clouds or other atmospheric issues within each satellite image before further processing or providing analytics to customers.

Developing, testing, and improving ML computer vision models that detect clouds and atmospheric issues in satellite imagery presents challenges for builders of agronomic data platforms. First, building data pipelines to ingest satellite imagery requires time, software development resources, and IT infrastructure. Each satellite imagery provider can differ greatly from each other. Satellites frequently collect imagery at different spatial resolutions; resolutions can range from many meters per pixel to very high-resolution imagery measured in centimeters per pixel. Additionally, each satellite may collect imagery with different multi-spectral bands. Some bands have been thoroughly tested and show strong correlation with plant development and health indicators, and other bands can be irrelevant for agriculture. Satellite constellations revisit the same spot on earth at different rates. Small constellations may revisit a field every week or more, and larger constellations may revisit the same area multiple times per day. These differences in satellite images and frequencies also lead to differences in API capabilities and features. Combined, these differences mean agronomic data platforms may need to maintain multiple data pipelines with complex ingestion methodologies.

Second, after the imagery is ingested and made available to remote sensing teams, data scientists, and agronomists, these teams must engage in a time-consuming process of accessing, processing, and labeling each region within each image as cloudy. With thousands of fields spread across varying geographies, and multiple satellite images per field, the labeling process can take a significant amount of time and must be continually trained to account for business expansion, new customer fields, or new sources of imagery.

Integrated access to Sentinel satellite imagery and data for ML

By using SageMaker geospatial capabilities for remote sensing ML model development, and by consuming satellite imagery from the AWS Data Exchange conveniently available public Amazon Simple Storage Service (Amazon S3) bucket, builders of agronomic data platforms on AWS can achieve their goals faster and more easily. Your S3 bucket always has the most up-to-date satellite imagery from Sentinel-1 and Sentinel-2 because Open Data Exchange and the Amazon Sustainability Data Initiative provide you with automated built-in access to satellite imagery.

The following diagram illustrates this architecture.

The following diagram illustrates this architecture

SageMaker geospatial capabilities include built-in pre-trained deep neural network models such as land use classification and cloud masking, with an integrated catalog of geospatial data sources including satellite imagery, maps, and location data from AWS and third parties. With an integrated geospatial data catalog, SageMaker geospatial customers have easier access to satellite imagery and other geospatial datasets that remove the burden of developing complex data ingestion pipelines. This integrated data catalog can accelerate your own model building and the processing and enrichment of large-scale geospatial datasets with purpose-built operations such as time statistics, resampling, mosaicing, and reverse geocoding. The ability to easily ingest imagery from Amazon S3 and use SageMaker geospatial pre-trained ML models that automatically identify clouds and score each Sentinel-2 satellite image removes the need to engage remote sensing, agronomy, and data science teams to ingest, process, and manually label thousands of satellite images with cloudy regions.

SageMaker geospatial capabilities support the ability to define an area of interest (AOI) and a time of interest (TOI), search within the Open Data Exchange S3 bucket archive for images with a geospatial intersect that meets the request, and return true color images, Normalized Difference Vegetation Index (NDVI), cloud detection and scores, and land cover. NDVI is a common index used with satellite imagery to understand the health of crops by visualizing measurements of the amount of chlorophyll and photosynthetic activity via a newly processed and color-coded image.

Users of SageMaker geospatial capabilities can use the pre-built NDVI index or develop their own. SageMaker geospatial capabilities make it easier for data scientists and ML engineers to build, train, and deploy ML models faster and at scale using geospatial data and with less effort than before.

Farmers and agronomists need fast access to insights in the field and at home

Promptly delivering processed imagery and insights to farmers and stakeholders is important for agribusinesses and decision-making at the field. Identifying areas of poor crop health across each field during critical windows of time allows the farmer to mitigate risks by applying fertilizers, herbicides, and pesticides where needed, and even identify areas of potential crop insurance claims. It is common for agronomic data platforms to comprise a suite of applications, including web applications and mobile applications. These applications provide intuitive user interfaces that help farmers and their trusted stakeholders securely review each of their fields and images while at home, in the office, or standing in the field itself. These web and mobile applications, however, need to consume and quickly display processed imagery and agronomic insights via APIs.

Amazon API Gateway makes it easy for developers to create, publish, maintain, monitor, and secure RESTful and WebSocket APIs at scale. With API Gateway, API access and authorization is integrated with AWS Identity Access Management (IAM), and offers native OIDC and OAuth2 support, as well as Amazon Cognito. Amazon Cognito is a cost-effective customer identity and access management (CIAM) service supporting a secure identity store with federation options that can scale to millions of users.

Raw, unprocessed satellite imagery can be very large, in some instances hundreds of megabytes or even gigabytes per image. Because many agricultural areas of the world have poor or no cellular connectivity, it’s important to process and serve imagery and insights in smaller formats and in ways that limit required bandwidth. Therefore, by using AWS Lambda to deploy a tile server, smaller sized GeoTIFFs, JPEGs, or other imagery formats can be returned based on the current map view being displayed to a user, as opposed to much larger file sizes and types that decrease performance. By combining a tile server deployed through Lambda functions with API Gateway to manage requests for web and mobile applications, farmers and their trusted stakeholders can consume imagery and geospatial data from one or hundreds of fields at once, with reduced latency, and achieve an optimal user experience.

SageMaker geospatial capabilities can be accessed via an intuitive user interface that enables you to gain easy access to a rich catalog of geospatial data, transform and enrich data, train or use purpose-build models, deploy models for predictions, and visualize and explore data on integrated maps and satellite images. To read more about the SageMaker geospatial user experience, refer to How Xarvio accelerated pipelines of spatial data for digital farming with Amazon SageMaker geospatial capabilities.

Agronomic data platforms provide several layers of data and insights at scale

The following example user interface demonstrates how a builder of agronomic data platforms may integrate insights delivered by SageMaker geospatial capabilities.

SageMaker geospatial capabilities

This example user interface depicts common geospatial data overlays consumed by farmers and agricultural stakeholders. Here, the consumer has selected three separate data overlays. First, the underlying Sentinel-2 natural color satellite image taken from October, 2020, and made available via the integrated SageMaker geospatial data catalog. This image was filtered using the SageMaker geospatial pre-trained model that identifies cloud cover. The second data overlay is a set of field boundaries, depicted with a white outline. A field boundary is commonly a polygon of latitude and longitude coordinates that reflects the natural topography of a farm field, or operational boundary differentiating between crop plans. The third data overlay is processed imagery data in the form of Normalized Difference Vegetation Index (NDVI). Further, the NDVI imagery is overlaid on the respective field boundary, and an NDVI color classification chart is depicted on the left side of the page.

The following image depicts the results using a SageMaker pre-trained model that identifies cloud cover.

SageMaker pre-trained model that identifies cloud cover

In this image, the model identifies clouds within the satellite image and applies a yellow mask over each cloud within the image. By removing masked pixels (clouds) from further image processing, downstream analytics and products have improved accuracy and provide value to farmers and their trusted advisors.

In areas of poor cellular coverage, reducing latency improves the user experience

To address the need for low latency when evaluating geospatial data and remote sensing imagery, you can use Amazon ElastiCache to cache processed images retrieved from tile requests made via Lambda. By storing the requested imagery into a cache memory, latency is further reduced and there is no need to re-process imagery requests. This can improve application performance and reduce pressure on databases. Because Amazon ElastiCache supports many configuration options for caching strategies, cross-region replication, and auto scaling, agronomic data platform providers can scale up quickly based upon application needs, and continue to achieve cost efficiency by paying for only what is needed.

Conclusion

This post focused on geospatial data processing, implementing ML-enabled remote sensing insights, and ways to streamline and simplify the development and enhancement of agronomic data platforms on AWS. It illustrated several methods and services that builders of agronomic data platforms on AWS services can use to achieve their goals, including SageMaker, Lambda, Amazon S3, Open Data Exchange, and ElastiCache.

To follow an end-to-end example notebook that demonstrates SageMaker geospatial capabilities, access the example notebook available in the following GitHub repository. You can review how to identify agricultural fields through ML segmentation models, or explore the preexisting SageMaker geospatial models and the bring your own model (BYOM) functionality on geospatial tasks such as land use and land cover classification. The end-to-end example notebook is discussed in detail in the companion post How Xarvio accelerated pipelines of spatial data for digital farming with Amazon SageMaker Geospatial.

Please contact us to learn more about how the agricultural industry is solving important problems related to global food supply, traceability, and sustainability initiatives by using the AWS Cloud.


About the authors

Will Conrad is the Head of Solutions for the Agriculture Industry at AWS. He is passionate about helping customers use technology to improve the livelihoods of farmers, the environmental impact of agriculture, and the consumer experience for people who eat food. In his spare time, he fixes things, plays golf, and takes orders from his four children.

Bishesh Adhikari is a Machine Learning Prototyping Architect at the AWS Prototyping team. He works with AWS customers to build solutions on various AI & Machine Learning use-cases to accelerate their journey to production. In his free time, he enjoys hiking, travelling, and spending time with family and friends.

Priyanka Mahankali is a Guidance Solutions Architect at AWS for more than 5 years building cross-industry solutions including technology for global agriculture customers. She is passionate about bringing cutting-edge use cases to the forefront and helping customers build strategic solutions on AWS.

Ron Osborne is AWS Global Technology Lead for Agriculture – WWSO and a Senior Solution Architect. Ron is focused on helping AWS agribusiness customers and partners develop and deploy secure, scalable, resilient, elastic, and cost-effective solutions. Ron is a cosmology enthusiast, an established innovator within ag-tech, and is passionate about positioning customers and partners for business transformation and sustainable success.

Read More

Separate lines of business or teams with multiple Amazon SageMaker domains

Separate lines of business or teams with multiple Amazon SageMaker domains

Amazon SageMaker Studio is a fully integrated development environment (IDE) for machine learning (ML) that enables data scientists and developers to perform every step of the ML workflow, from preparing data to building, training, tuning, and deploying models.

To access SageMaker Studio, Amazon SageMaker Canvas, or other Amazon ML environments like RStudio on Amazon SageMaker, you must first provision a SageMaker domain. A SageMaker domain includes an associated Amazon Elastic File System (Amazon EFS) volume; a list of authorized users; and a variety of security, application, policy, and Amazon Virtual Private Cloud (Amazon VPC) configurations.

Administrators can now provision multiple SageMaker domains in order to separate different lines of business or teams within a single AWS account. This creates a logical separation between the users, files storage, and configuration settings for various groups in your organization. As an example, your organization may want to separate your financial line of business from the sustainability research division, as shown in the following multi-domain console.

domains

Creating multiple SageMaker domains also allows you to granularly set domain-level configurations such as VPC configurations in order to permit public internet access for some groups’ research, while enforcing that traffic goes through a specified VPC for business units with greater restriction.

Automated tagging

In addition to separating users, file storage, and domain configurations, administrators can also separate SageMaker resources that are created within their domain. By default, SageMaker now automatically tags new SageMaker resources such as training jobs, processing jobs, experiments, pipelines, and model registry entries with their respective sagemaker:domain-arn. SageMaker also tags the resource with the sagemaker:user-profile-arn or sagemaker:space-arn to designate the resource creation at an even more granular level.

Cost allocation

Administrators can use automated tagging to easily monitor costs associated with their line of business, teams, individual users, or individual business problems by using tools such as AWS Budgets and AWS Cost Explorer. As an example, an administrator can attach a cost allocation tag for the sagemaker:domain-arn tag.

cost allocation tags

This allows them to utilize Cost Explorer to visualize the notebook spend for a given domain.

AWS cost management

Domain-level resource isolation

Administrators can attach AWS Identity and Access Management (IAM) policies that ensure a domain’s user can only create and open SageMaker resources that are originating from their respective domain. The following code is an example of such a policy:

{
    "Version": "2012-10-17",
    "Statement":
    [
        {
            "Sid": "CreateRequireDomainTag",
            "Effect": "Allow",
            "Action":
            [
                "SageMaker:Create*",
                "SageMaker:Update*"
            ],
            "Resource": "*",
            "Condition":
            {
                "ForAllValues:StringEquals":
                {
                    "aws:TagKeys":
                    [
                        "sagemaker:domain-arn"
                    ]
                }
            }
        },
        {
            "Sid": "ResourceAccessRequireDomainTag",
            "Effect": "Allow",
            "Action":
            [
                "SageMaker:Update*",
                "SageMaker:Delete*",
                "SageMaker:Describe*"
            ],
            "Resource": "*",
            "Condition":
            {
                "StringEquals":
                {
                    "aws:ResourceTag/sagemaker:domain-arn": "arn:aws:sagemaker:<REGION>:<ACCOUNT_ID>:domain/<DOMAIN_ID>"
                }
            }
        }
    ]
}

For more information, see Multiple domains overview.

Backfilling existing resources with domain tags

Since the launch of the multi-domain capability, new resources are automatically tagged with aws:ResourceTag/sagemaker:domain-arn. However, if you want to update existing resources to facilitate resource isolation, administrations can use the add-tag SageMaker API call in a script. The below example shows how to tag all existing experiments to a domain:

domain_arn=arn:aws:sagemaker:<REGION>:<ACCOUNT_ID>:domain/<DOMAIN_ID>
experiments=`aws --region $REGION 
sagemaker list-experiments`
for row in $(echo "${experiments}" | jq -r '.ExperimentSummaries[] | @base64'); do
    _jq() {
     echo ${row} | base64 --decode | jq -r ${1}
    }

    exp_name=$(_jq '.ExperimentName')
    exp_arn=$(_jq '.ExperimentArn')

    echo "Tagging resource name: $exp_name and arn: $exp_arn with "sagemaker:domain-arn=$domain_arn""
    echo `aws sagemaker 
        add-tags 
        --resource-arn $exp_arn 
        --tags "Key=sagemaker:domain-arn,Value=$domain_arn" 
        --region $REGION`
    echo "Tagging done for: $exp_name"
    sleep 1
done

You can verify that any individual resource was correctly tagged with the following code sample:

aws sagemaker 
list-tags 
--resource-arn <SAGEMAKER-RESOURCE-ARN> 
--region <REGION> 

Solution overview

In this section, we outline how you can set up multiple SageMaker domains in your own AWS account. You can either use the AWS Command Line Interface (AWS CLI) or the SageMaker console. Refer to Onboard to Amazon SageMaker Domain for the most up-to-date instructions on creating a domain.

Create a domain using the AWS CLI

There are no necessary API changes from the previous aws sagemaker create-domain CLI call, but there is now support for --default-space-settings if you intend to use shared spaces in SageMaker Studio. For more information, see shared spaces in Amazon SageMaker Studio.

Create a new domain with your specified configurations using aws sagemaker create-domain, and then you’re ready to populate it with users.

Create a domain using the SageMaker console

On the updated SageMaker console, you can administer your domains via the new option called SageMaker Domains in the navigation pane.

Here you’ll be presented with the options to open existing domains, or create a new one using the graphical interface.

create domain

Conclusion

Utilizing multiple SageMaker domains provides flexibility to meet your organizational needs. Whether you need to isolate users and their business groups, or you want to run separate domains due to configuration differences, we encourage you to stand up multiple SageMaker domains within a single AWS account!


About the Authors

Sean MorganSean Morgan is an AI/ML Solutions Architect at AWS. He has experience in the semiconductor and academic research fields, and uses his experience to help customers reach their goals on AWS. In his free time, Sean is an active open-source contributor/maintainer and is the special interest group lead for TensorFlow Add-ons.

Arkaprava De is a Senior Software Engineer at AWS. He has been at Amazon for over 7 years and is currently working on improving the Amazon SageMaker Studio IDE experience. You can find him on LinkedIn.

Kunal Jha is a Senior Product Manager at AWS. He is focused on building Amazon SageMaker Studio as the IDE of choice for all ML development steps. In his spare time, Kunal enjoys skiing and exploring the Pacific Northwest. You can find him on LinkedIn.

Han Zhang is a Senior Software Engineer at Amazon Web Services. She is part of the launch team for Amazon SageMaker Notebooks and Amazon SageMaker Studio, and has been focusing on building secure machine learning environments for customers. In her spare time, she enjoys hiking and skiing in the Pacific Northwest.

Read More

Operationalize your Amazon SageMaker Studio notebooks as scheduled notebook jobs

Operationalize your Amazon SageMaker Studio notebooks as scheduled notebook jobs

Amazon SageMaker Studio provides a fully managed solution for data scientists to interactively build, train, and deploy machine learning (ML) models. In addition to the interactive ML experience, data workers also seek solutions to run notebooks as ephemeral jobs without the need to refactor code as Python modules or learn DevOps tools and best practices to automate their deployment infrastructure. Some common use cases for doing this include:

  • Regularly running model inference to generate reports
  • Scaling up a feature engineering step after having tested in Studio against a subset of data on a small instance
  • Retraining and deploying models on some cadence
  • Analyzing your team’s Amazon SageMaker usage on a regular cadence

Previously, when data scientists wanted to take the code they built interactively on notebooks and run them as batch jobs, they were faced with a steep learning curve using Amazon SageMaker Pipelines, AWS Lambda, Amazon EventBridge, or other solutions that are difficult to set up, use, and manage.

With SageMaker notebook jobs, you can now run your notebooks as is or in a parameterized fashion with just a few simple clicks from the SageMaker Studio or SageMaker Studio Lab interface. You can run these notebooks on a schedule or immediately. There’s no need for the end-user to modify their existing notebook code. When the job is complete, you can view the populated notebook cells, including any visualizations!

In this post, we share how to operationalize your SageMaker Studio notebooks as scheduled notebook jobs.

Solution overview

The following diagram illustrates our solution architecture. We utilize the pre-installed SageMaker extension to run notebooks as a job immediately or on a schedule.

In the following sections, we walk through the steps to create a notebook, parameterize cells, customize additional options, and schedule your job. We also include a sample use case.

Prerequisites

To use SageMaker notebook jobs, you need to be running a JupyterLab 3 JupyterServer app within Studio. For more information on how to upgrade to JupyterLab 3, refer to View and update the JupyterLab version of an app from the console. Be sure to Shut down and Update SageMaker Studio in order to pick up the latest updates.

To define job definitions that run notebooks on a schedule, you may need to add additional permissions to your SageMaker execution role.

First, add a trust relationship to your SageMaker execution role that allows events.amazonaws.com to assume your role:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "sagemaker.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        },
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "events.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}

Additionally, you may need to create and attach an inline policy to your execution role. The below policy is supplementary to the very permissive AmazonSageMakerFullAccess policy. For a complete and minimal set of permissions see Install Policies and Permissions.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "events:TagResource",
                "events:DeleteRule",
                "events:PutTargets",
                "events:DescribeRule",
                "events:PutRule",
                "events:RemoveTargets",
                "events:DisableRule",
                "events:EnableRule"
            ],
            "Resource": "*",
            "Condition": {
              "StringEquals": {
                "aws:ResourceTag/sagemaker:is-scheduling-notebook-job": "true"
              }
            }
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": "iam:PassRole",
            "Resource": "arn:aws:iam::*:role/*",
            "Condition": {
                "StringLike": {
                    "iam:PassedToService": "events.amazonaws.com"
                }
            }
        },
        {
            "Sid": "VisualEditor2",
            "Effect": "Allow",
            "Action": "sagemaker:ListTags",
            "Resource": "arn:aws:sagemaker:*:*:user-profile/*/*"
        }
    ]
}

Create a notebook job

To operationalize your notebook as a SageMaker notebook job, choose the Create a notebook job icon.

Alternatively, you can choose (right-click) your notebook on the file system and choose Create Notebook Job.

In the Create job section, simply choose the right instance type for your scheduled job based on your workload: standard instances, compute optimized instances, or accelerated computing instances that contain GPUs. You can choose any of the instances available for SageMaker training jobs. For the complete list of instances available, refer to Amazon SageMaker Pricing.

When a job is complete, you can view the output notebook file with its populated cells, as well as the underlying logs from the job runs.

Parameterize cells

When moving a notebook to a production workflow, it’s important to be able to reuse the same notebook with different sets of parameters for modularity. For example, you may want to parameterize the dataset location or the hyperparameters of your model so that you can reuse the same notebook for many distinct model trainings. SageMaker notebook jobs support this through cell tags. Simply choose the double gear icon in the right pane and choose Add Tag. Then label the tag as parameters.

By default, the notebook job run uses the parameter values specified in the notebook, but alternatively, you can modify these as a configuration for your notebook job.

Configure additional options

When creating a notebook job, you can expand the Additional options section in order to customize your job definition. Studio will automatically detect the image or kernel you’re using in your notebook and pre-select it for you. Ensure that you have validated this selection.

You can also specify environment variables or startup scripts to customize your notebook run environment. For the full list of configurations, see Additional Options.

Schedule your job

To schedule your job, choose Run on a schedule and set an appropriate interval and time. Then you can choose the Notebook Jobs tab that is visible after choosing the home icon. After the notebook is loaded, choose the Notebook Job Definitions tab to pause or remove your schedule.

Example use case

For our example, we showcase an end-to-end ML workflow that prepares data from a ground truth source, trains a refreshed model from that time period, and then runs inference on the most recent data to generate actionable insights. In practice, you might run a complete end-to-end workflow, or simply operationalize one step of your workflow. You can schedule an AWS Glue interactive session for daily data preparation, or run a batch inference job that generates graphical results directly in your output notebook.

The full notebook for this example can be found in our SageMaker Examples GitHub repository. The use case assumes that we’re a telecommunications company that is looking to schedule a notebook that predicts probable customer churn based on a model trained with the most recent data we have available.

To start, we gather the most recently available customer data and perform some preprocessing on it:

import pandas as pd
from synthetic_data import generate_data

previous_two_weeks_data = generate_data(5000, label_known=True)
todays_data = generate_data(300, label_known=False)

processed_prior_data = process_data(previous_two_weeks_data, label_known=True)
processed_todays_data = process_data(todays_data, label_known=False)

We train our refreshed model on this updated training data in order to make accurate predictions on todays_data:

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score, confusion_matrix, ConfusionMatrixDisplay

y = np.ravel(processed_prior_data[["Churn"]])
x = processed_prior_data.drop(["Churn"], axis=1)

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25)

clf = RandomForestClassifier(n_estimators=int(number_rf_estimators), criterion="gini")
clf.fit(x_train, y_train)

Because we’re going to schedule this notebook as a daily report, we want to capture how good our refreshed model performed on our validation set so that we can be confident in its future predictions. The results in the following screenshot are from our scheduled inference report.

Lastly, you want to capture the predicted results of today’s data into a database so that actions can be taken based on the results of this model.

After the notebook is understood, feel free to run this as an ephemeral job using the Run now option described earlier or test out the scheduling functionality.

Clean up

If you followed along with our example, be sure to pause or delete your notebook job’s schedule to avoid incurring ongoing charges.

Conclusion

Bringing notebooks to production with SageMaker notebook jobs vastly simplifies the undifferentiated heavy lifting required by data workers. Whether you’re scheduling end-to-end ML workflows or a piece of the puzzle, we encourage you to put some notebooks in production using SageMaker Studio or SageMaker Studio Lab! To learn more, see Notebook-based Workflows.


About the authors

Sean MorganSean Morgan is a Senior ML Solutions Architect at AWS. He has experience in the semiconductor and academic research fields, and uses his experience to help customers reach their goals on AWS. In his free time Sean is an activate open source contributor/maintainer and is the special interest group lead for TensorFlow Addons.

Sumedha Swamy is a Principal Product Manager at Amazon Web Services. He leads SageMaker Studio team to build it into the IDE of choice for interactive data science and data engineering workflows. He has spent the past 15 years building customer-obsessed consumer and enterprise products using Machine Learning. In his free time he likes photographing the amazing geology of the American Southwest.

Edward Sun is a Senior SDE working for SageMaker Studio at Amazon Web Services. He is focused on building interactive ML solution and simplifying the customer experience to integrate SageMaker Studio with popular technologies in data engineering and ML ecosystem. In his spare time, Edward is big fan of camping, hiking and fishing and enjoys the time spending with his family.

Read More

Talking to Robots in Real Time

Talking to Robots in Real Time

A grand vision in robot learning, going back to the SHRDLU experiments in the late 1960s, is that of helpful robots that inhabit human spaces and follow a wide variety of natural language commands. Over the last few years, there have been significant advances in the application of machine learning (ML) for instruction following, both in simulation and in real world systems. Recent Palm-SayCan work has produced robots that leverage language models to plan long-horizon behaviors and reason about abstract goals. Code as Policies has shown that code-generating language models combined with pre-trained perception systems can produce language conditioned policies for zero shot robot manipulation. Despite this progress, an important missing property of current “language in, actions out” robot learning systems is real time interaction with humans.

Ideally, robots of the future would react in real time to any relevant task a user could describe in natural language. Particularly in open human environments, it may be important for end users to customize robot behavior as it is happening, offering quick corrections (“stop, move your arm up a bit”) or specifying constraints (“nudge that slowly to the right”). Furthermore, real-time language could make it easier for people and robots to collaborate on complex, long-horizon tasks, with people iteratively and interactively guiding robot manipulation with occasional language feedback.

The challenges of open-vocabulary language following. To be successfully guided through a long horizon task like “put all the blocks in a vertical line”, a robot must respond precisely to a wide variety of commands, including small corrective behaviors like “nudge the red circle right a bit”.

However, getting robots to follow open vocabulary language poses a significant challenge from a ML perspective. This is a setting with an inherently large number of tasks, including many small corrective behaviors. Existing multitask learning setups make use of curated imitation learning datasets or complex reinforcement learning (RL) reward functions to drive the learning of each task, and this significant per-task effort is difficult to scale beyond a small predefined set. Thus, a critical open question in the open vocabulary setting is: how can we scale the collection of robot data to include not dozens, but hundreds of thousands of behaviors in an environment, and how can we connect all these behaviors to the natural language an end user might actually provide?

In Interactive Language, we present a large scale imitation learning framework for producing real-time, open vocabulary language-conditionable robots. After training with our approach, we find that an individual policy is capable of addressing over 87,000 unique instructions (an order of magnitude larger than prior works), with an estimated average success rate of 93.5%. We are also excited to announce the release of Language-Table, the largest available language-annotated robot dataset, which we hope will drive further research focused on real-time language-controllable robots.

Guiding robots with real time language.

Real Time Language-Controllable Robots

Key to our approach is a scalable recipe for creating large, diverse language-conditioned robot demonstration datasets. Unlike prior setups that define all the skills up front and then collect curated demonstrations for each skill, we continuously collect data across multiple robots without scene resets or any low-level skill segmentation. All data, including failure data (e.g., knocking blocks off a table), goes through a hindsight language relabeling process to be paired with text. Here, annotators watch long robot videos to identify as many behaviors as possible, marking when each began and ended, and use freeform natural language to describe each segment. Importantly, in contrast to prior instruction following setups, all skills used for training emerge bottom up from the data itself rather than being determined upfront by researchers.

Our learning approach and architecture are intentionally straightforward. Our robot policy is a cross-attention transformer, mapping 5hz video and text to 5hz robot actions, using a standard supervised learning behavioral cloning objective with no auxiliary losses. At test time, new spoken commands can be sent to the policy (via speech-to-text) at any time up to 5hz.

Interactive Language: an imitation learning system for producing real time language-controllable robots.

Open Source Release: Language-Table Dataset and Benchmark

This annotation process allowed us to collect the Language-Table dataset, which contains over 440k real and 180k simulated demonstrations of the robot performing a language command, along with the sequence of actions the robot took during the demonstration. This is the largest language-conditioned robot demonstration dataset of its kind, by an order of magnitude. Language-Table comes with a simulated imitation learning benchmark that we use to perform model selection, which can be used to evaluate new instruction following architectures or approaches.

Dataset # Trajectories (k)     # Unique (k)     Physical Actions     Real     Available
Episodic Demonstrations
BC-Z 25 0.1
SayCan 68 0.5
Playhouse 1,097 779
Hindsight Language Labeling
BLOCKS 30 n/a
LangLFP 10 n/a
LOREL 6 1.7
CALVIN 20 0.4
Language-Table (real + sim) 623 (442+181) 206 (127+79)

We compare Language-Table to existing robot datasets, highlighting proportions of simulated (red) or real (blue) robot data, the number of trajectories collected, and the number of unique language describable tasks.

Learned Real Time Language Behaviors

Examples of short horizon instructions the robot is capable of following, sampled randomly from the full set of over 87,000.

Short-Horizon Instruction Success
(87,000 more…)
push the blue triangle to the top left corner    80.0%
separate the red star and red circle 100.0%
nudge the yellow heart a bit right 80.0%
place the red star above the blue cube 90.0%
point your arm at the blue triangle 100.0%
push the group of blocks left a bit 100.0%
Average over 87k, CI 95% 93.5% +- 3.42%

95% Confidence interval (CI) on the average success of an individual Interactive Language policy over 87,000 unique natural language instructions.

We find that interesting new capabilities arise when robots are able to follow real time language. We show that users can walk robots through complex long-horizon sequences using only natural language to solve for goals that require multiple minutes of precise, coordinated control (e.g., “make a smiley face out of the blocks with green eyes” or “place all the blocks in a vertical line”). Because the robot is trained to follow open vocabulary language, we see it can react to a diverse set of verbal corrections (e.g., “nudge the red star slightly right”) that might otherwise be difficult to enumerate up front.

Examples of long horizon goals reached under real time human language guidance.

Finally, we see that real time language allows for new modes of robot data collection. For example, a single human operator can control four robots simultaneously using only spoken language. This has the potential to scale up the collection of robot data in the future without requiring undivided human attention for each robot.

One operator controlling multiple robots at once with spoken language.

Conclusion

While currently limited to a tabletop with a fixed set of objects, Interactive Language shows initial evidence that large scale imitation learning can indeed produce real time interactable robots that follow freeform end user commands. We open source Language-Table, the largest language conditioned real-world robot demonstration dataset of its kind and an associated simulated benchmark, to spur progress in real time language control of physical robots. We believe the utility of this dataset may not only be limited to robot control, but may provide an interesting starting point for studying language- and action-conditioned video prediction, robot video-conditioned language modeling, or a host of other interesting active questions in the broader ML context. See our paper and GitHub page to learn more.

Acknowledgements

We would like to thank everyone who supported this research. This includes robot teleoperators: Alex Luong, Armando Reyes, Elio Prado, Eric Tran, Gavin Gonzalez, Jodexty Therlonge, Joel Magpantay, Rochelle Dela Cruz, Samuel Wan, Sarah Nguyen, Scott Lehrer, Norine Rosales, Tran Pham, Kyle Gajadhar, Reece Mungal, and Nikauleene Andrews; robot hardware support and teleoperation coordination: Sean Snyder, Spencer Goodrich, Cameron Burns, Jorge Aldaco, Jonathan Vela; data operations and infrastructure: Muqthar Mohammad, Mitta Kumar, Arnab Bose, Wayne Gramlich; and the many who helped provide language labeling of the datasets. We would also like to thank Pierre Sermanet, Debidatta Dwibedi, Michael Ryoo, Brian Ichter and Vincent Vanhoucke for their invaluable advice and support.

Read More

How xarvio Digital Farming Solutions accelerates its development with Amazon SageMaker geospatial capabilities

How xarvio Digital Farming Solutions accelerates its development with Amazon SageMaker geospatial capabilities

This is a guest post co-written by Julian Blau, Data Scientist at xarvio Digital Farming Solutions; BASF Digital Farming GmbH, and Antonio Rodriguez, AI/ML Specialist Solutions Architect at AWS

xarvio Digital Farming Solutions is a brand from BASF Digital Farming GmbH, which is part of BASF Agricultural Solutions division. xarvio Digital Farming Solutions offers precision digital farming products to help farmers optimize crop production. Available globally, xarvio products use machine learning (ML), image recognition technology, and advanced crop and disease models, in combination with data from satellites and weather station devices, to deliver accurate and timely agronomic recommendations to manage the needs of individual fields. xarvio products are tailored to local farming conditions, can monitor growth stages, and recognize diseases and pests. They increase efficiency, save time, reduce risks, and provide higher reliability for planning and decision-making—all while contributing to sustainable agriculture.

We work with different geospatial data, including satellite imagery of the areas where our users’ fields are located, for some of our use cases. Therefore, we use and process hundreds of large image files daily. Initially, we had to invest a lot of manual work and effort to ingest, process, and analyze this data using third-party tools, open-source libraries, or general-purpose cloud services. In some instances, this could take up to 2 months for us to build the pipelines for each specific project. Now, by utilizing the geospatial capabilities of Amazon SageMaker, we have reduced this time to just 1–2 weeks.

This time-saving is the result of automating the geospatial data pipelines to deliver our use cases more efficiently, along with using built-in reusable components for speeding up and improving similar projects in other geographical areas, while applying the same proven steps for other use cases based on similar data.

In this post, we go through an example use case to describe some of the techniques we commonly use, and show how implementing these using SageMaker geospatial functionalities in combination with other SageMaker features delivers measurable benefits. We also include code examples so you can adapt these to your own specific use cases.

Overview of solution

A typical remote sensing project for developing new solutions requires a step-by-step analysis of imagery taken by optical satellites such as Sentinel or Landsat, in combination with other data, including weather forecasts or specific field properties. The satellite images provide us with valuable information used in our digital farming solutions to help our users accomplish various tasks:

  • Detecting diseases early in their fields
  • Planning the right nutrition and treatments to be applied
  • Getting insights on weather and water for planning irrigation
  • Predicting crop yield
  • Performing other crop management tasks

To achieve these goals, our analyses typically require preprocessing of the satellite images with different techniques that are common in the geospatial domain.

To demonstrate the capabilities of SageMaker geospatial, we experimented with identifying agricultural fields through ML segmentation models. Additionally, we explored the preexisting SageMaker geospatial models and the bring your own model (BYOM) functionality on geospatial tasks such as land use and land cover classification, or crop classification, often requiring panoptic or semantic segmentation techniques as additional steps in the process.

In the following sections, we go through some examples of how to perform these steps with SageMaker geospatial capabilities. You can also follow these in the end-to-end example notebook available in the following GitHub repository.

As previously mentioned, we selected the land cover classification use case, which consists of identifying the type of physical coverage that we have on a given geographical area on the earth’s surface, organized on a set of classes including vegetation, water, or snow. This high-resolution classification allows us to detect the details for the location of the fields and its surroundings with high accuracy, which can be later chained with other analyses such as change detection in crop classification.

Client setup

First, let’s assume we have users with crops being cultivated in a given geographical area that we can identify within a polygon of geospatial coordinates. For this post, we define an example area over Germany. We can also define a given time range, for example in the first months of 2022. See the following code:

### Coordinates for the polygon of your area of interest...
coordinates = [
    [9.181602157004177, 53.14038825707946],
    [9.181602157004177, 52.30629767547948],
    [10.587520893823973, 52.30629767547948],
    [10.587520893823973, 53.14038825707946],
    [9.181602157004177, 53.14038825707946],
]
### Time-range of interest...
time_start = "2022-01-01T12:00:00Z"
time_end = "2022-05-01T12:00:00Z"

In our example, we work with the SageMaker geospatial SDK through programmatic or code interaction, because we’re interested in building code pipelines that can be automated with the different steps required in our process. Note you could also work with an UI through the graphical extensions provided with SageMaker geospatial in Amazon SageMaker Studio if you prefer this approach, as shown in the following screenshots. For accessing the Geospatial Studio UI, open the SageMaker Studio Launcher and choose Manage Geospatial resources. You can check more details in the documentation to Get Started with Amazon SageMaker Geospatial Capabilities.

Geospatial UI launcher

Geospatial UI main

Geospatial UI list of jobs

Here you can graphically create, monitor, and visualize the results of the Earth Observation jobs (EOJs) that you run with SageMaker geospatial features.

Back to our example, the first step for interacting with the SageMaker geospatial SDK is to set up the client. We can do this by establishing a session with the botocore library:

session = botocore.session.get_session()
gsClient = session.create_client(
    service_name='sagemaker-geospatial',
    region_name=region) #Replace with your region, e.g., 'us-east-1'

From this point on, we can use the client for running any EOJs of interest.

Obtaining data

For this use case, we start by collecting satellite imagery for our given geographical area. Depending on the location of interest, there might be more or less frequent coverage by the available satellites, which have its imagery organized in what is usually referred to as raster collections.

With the geospatial capabilities of SageMaker, you have direct access to high-quality data sources for obtaining the geospatial data directly, including those from AWS Data Exchange and the Registry of Open Data on AWS, among others. We can run the following command to list the raster collections already provided by SageMaker:

list_raster_data_collections_resp = gsClient.list_raster_data_collections()

This returns the details for the different raster collections available, including the Landsat C2L2 Surface Reflectance (SR), Landsat C2L2 Surface Temperature (ST), or the Sentinel 2A & 2B. Conveniently, Level 2A imagery is already optimized into Cloud-Optimized GeoTIFFs (COGs). See the following code:

…
{'Name': 'Sentinel 2 L2A COGs',
  'Arn': 'arn:aws:sagemaker-geospatial:us-west-2:378778860802:raster-data-collection/public/nmqj48dcu3g7ayw8',
  'Type': 'PUBLIC',
  'Description': 'Sentinel-2a and Sentinel-2b imagery, processed to Level 2A (Surface Reflectance) and converted to Cloud-Optimized GeoTIFFs'
…

Let’s take this last one for our example, by setting our data_collection_arn parameter to the Sentinel 2 L2A COGs’ collection ARN.

We can also search the available imagery for a given geographical location by passing the coordinates of a polygon we defined as our area of interest (AOI). This allows you to visualize the image tiles available that cover the polygon you submit for the specified AOI, including the Amazon Simple Storage Service (Amazon S3) URIs for these images. Note that satellite imagery is typically provided in different bands according to the wavelength of the observation; we discuss this more later in the post.

response = gsClient.search_raster_data_collection(**eoj_input_config, Arn=data_collection_arn)

The preceding code returns the S3 URIs for the different image tiles available, that you can directly visualize with any library compatible with GeoTIFFs such as rasterio. For example, let’s visualize two of the True Color Image (TCI) tiles.

…
'visual': {'Href': 'https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/32/U/NC/2022/3/S2A_32UNC_20220325_0_L2A/TCI.tif'},
…

True Color Image 1True Color Image 2

Processing techniques

Some of the most common preprocessing techniques that we apply include cloud removal, geo mosaic, temporal statistics, band math, or stacking. All of these processes can now be done directly through the use of EOJs in SageMaker, without the need to perform manual coding or using complex and expensive third-party tools. This makes it 50% faster to build our data processing pipelines. With SageMaker geospatial capabilities, we can run these processes over different input types. For example:

  • Directly run a query for any of the raster collections included with the service through the RasterDataCollectionQuery parameter
  • Pass imagery stored in Amazon S3 as an input through the DataSourceConfig parameter
  • Simply chain the results of a previous EOJ through the PreviousEarthObservationJobArn parameter

This flexibility allows you to build any kind of processing pipeline you need.

The following diagram illustrates the processes we cover in our example.

Geospatial Processing tasks

In our example, we use a raster data collection query as input, for which we pass the coordinates of our AOI and time range of interest. We also specify a percentage of maximum cloud coverage of 2%, because we want clear and noise-free observations of our geographical area. See the following code:

eoj_input_config = {
    "RasterDataCollectionQuery": {
        "RasterDataCollectionArn": data_collection_arn,
        "AreaOfInterest": {
            "AreaOfInterestGeometry": {"PolygonGeometry": {"Coordinates": [coordinates]}}
        },
        "TimeRangeFilter": {"StartTime": time_start, "EndTime": time_end},
        "PropertyFilters": {
            "Properties": [
                {"Property": {"EoCloudCover": {"LowerBound": 0, "UpperBound": 2}}}
            ]
        },
    }
}

For more information on supported query syntax, refer to Create an Earth Observation Job.

Cloud gap removal

Satellite observations are often less useful due to high cloud coverage. Cloud gap filling or cloud removal is the process of replacing the cloudy pixels from the images, which can be done with different methods to prepare the data for further processing steps.

With SageMaker geospatial capabilities, we can achieve this by specifying a CloudRemovalConfig parameter in the configuration of our job.

eoj_config =  {
    'CloudRemovalConfig': {
        'AlgorithmName': 'INTERPOLATION',
        'InterpolationValue': '-9999'
    }
}

Note that we’re using an interpolation algorithm with a fixed value in our example, but there are other configurations supported, as explained in the Create an Earth Observation Job documentation. The interpolation allows it estimating a value for replacing the cloudy pixels, by considering the surrounding pixels.

We can now run our EOJ with our input and job configurations:

response = gsClient.start_earth_observation_job(
    Name =  'cloudremovaljob',
    ExecutionRoleArn = role,
    InputConfig = eoj_input_config,
    JobConfig = eoj_config,
)

This job takes a few minutes to complete depending on the input area and processing parameters.

When it’s complete, the results of the EOJ are stored in a service-owned location, from where we can either export the results to Amazon S3, or chain these as input for another EOJ. In our example, we export the results to Amazon S3 by running the following code:

response = gsClient.export_earth_observation_job(
    Arn = cr_eoj_arn,
    ExecutionRoleArn = role,
    OutputConfig = {
        'S3Data': {
            'S3Uri': f's3://{bucket}/{prefix}/cloud_removal/',
            'KmsKeyId': ''
        }
    }
)

Now we’re able to visualize the resulting imagery stored in our specified Amazon S3 location for the individual spectral bands. For example, let’s inspect two of the blue band images returned.

Alternatively, you can also check the results of the EOJ graphically by using the geospatial extensions available in Studio, as shown in the following screenshots.

Cloud Removal UI 1   Cloud Removal UI 2

Temporal statistics

Because the satellites continuously orbit around earth, the images for a given geographical area of interest are taken at specific time frames with a specific temporal frequency, such as daily, every 5 days, or 2 weeks, depending on the satellite. The temporal statistics process enables us to combine different observations taken at different times to produce an aggregated view, such as a yearly mean, or the mean of all observations in a specific time range, for the given area.

With SageMaker geospatial capabilities, we can do this by setting the TemporalStatisticsConfig parameter. In our example, we obtain the yearly mean aggregation for the Near Infrared (NIR) band, because this band can reveal vegetation density differences below the top of the canopies:

eoj_config =  {
    'TemporalStatisticsConfig': {
        'GroupBy': 'YEARLY',
        'Statistics': ['MEAN'],
        'TargetBands': ['nir']
    }
}

After a few minutes running an EOJ with this config, we can export the results to Amazon S3 to obtain imagery like the following examples, in which we can observe the different vegetation densities represented with different color intensities. Note the EOJ can produce multiple images as tiles, depending on the satellite data available for the time range and coordinates specified.

Temporal Statistics 1Temporal Statistics 2

Band math

Earth observation satellites are designed to detect light in different wavelengths, some of which are invisible to the human eye. Each range contains specific bands of the light spectrum at different wavelengths, which combined with arithmetic can produce images with rich information about characteristics of the field such as vegetation health, temperature, or presence of clouds, among many others. This is performed in a process commonly called band math or band arithmetic.

With SageMaker geospatial capabilities, we can run this by setting the BandMathConfig parameter. For example, let’s obtain the moisture index images by running the following code:

eoj_config =  {
    'BandMathConfig': {
        'CustomIndices': {
            'Operations': [
                {
                    'Name': 'moisture',
                    'Equation': '(B8A - B11) / (B8A + B11)'
                }
            ]
        }
    }
}

After a few minutes running an EOJ with this config, we can export the results and obtain images, such as the following two examples.

Moisture index 1Moisture index 2Moisture index legend

Stacking

Similar to band math, the process of combining bands together to produce composite images from the original bands is called stacking. For example, we could stack the red, blue, and green light bands of a satellite image to produce the true color image of the AOI.

With SageMaker geospatial capabilities, we can do this by setting the StackConfig parameter. Let’s stack the RGB bands as per the previous example with the following command:

eoj_config =  {
    'StackConfig': {
        'OutputResolution': {
            'Predefined': 'HIGHEST'
        },
        'TargetBands': ['red', 'green', 'blue']
    }
}

After a few minutes running an EOJ with this config, we can export the results and obtain images.

Stacking TCI 1Stacking TCI 2

Semantic segmentation models

As part of our work, we commonly use ML models to run inferences over the preprocessed imagery, such as detecting cloudy areas or classifying the type of land in each area of the images.

With SageMaker geospatial capabilities, you can do this by relying on the built-in segmentation models.

For our example, let’s use the land cover segmentation model by specifying the LandCoverSegmentationConfig parameter. This runs inferences on the input by using the built-in model, without the need to train or host any infrastructure in SageMaker:

response = gsClient.start_earth_observation_job(
    Name =  'landcovermodeljob',
    ExecutionRoleArn = role,
    InputConfig = eoj_input_config,
    JobConfig = {
        'LandCoverSegmentationConfig': {},
    },
)

After a few minutes running a job with this config, we can export the results and obtain images.

Land Cover 1Land Cover 2Land Cover 3Land Cover 4

In the preceding examples, each pixel in the images corresponds to a land type class, as shown in the following legend.

Land Cover legend

This allows us to directly identify the specific types of areas in the scene such as vegetation or water, providing valuable insights for additional analyses.

Bring your own model with SageMaker

If the state-of-the-art geospatial models provided with SageMaker aren’t enough for our use case, we can also chain the results of any of the preprocessing steps shown so far with any custom model onboarded to SageMaker for inference, as explained in this SageMaker Script Mode example. We can do this with any of the inference modes supported in SageMaker, including synchronous with real-time SageMaker endpoints, asynchronous with SageMaker asynchronous endpoints, batch or offline with SageMaker batch transforms, and serverless with SageMaker serverless inference. You can check further details about these modes in the Deploy Models for Inference documentation. The following diagram illustrates the workflow in high-level.

Inference flow options

For our example, let’s assume we have onboarded two models for performing a land cover classification and crop type classification.

We just have to point towards our trained model artifact, in our example a PyTorch model, similar to the following code:

from sagemaker.pytorch import PyTorchModel
import datetime

model = PyTorchModel(
    name=model_name, ### Set a model name
    model_data=MODEL_S3_PATH, ### Location of the custom model in S3
    role=role,
    entry_point='inference.py', ### Your inference entry-point script
    source_dir='code', ### Folder with any dependencies
    image_uri=image_uri, ### URI for your AWS DLC or custom container
    env={
        'TS_MAX_REQUEST_SIZE': '100000000',
        'TS_MAX_RESPONSE_SIZE': '100000000',
        'TS_DEFAULT_RESPONSE_TIMEOUT': '1000',
    }, ### Optional – Set environment variables for max size and timeout
)

predictor = model.deploy(
    initial_instance_count = 1, ### Your number of instances
    instance_type = 'ml.g4dn.8xlarge', ### Your instance type
    async_inference_config=sagemaker.async_inference.AsyncInferenceConfig(
        output_path=f"s3://{bucket}/{prefix}/output",
        max_concurrent_invocations_per_instance=2,
    ), ### Optional – Async config if using SageMaker Async Endpoints
)

predictor.predict(data) ### Your images for inference

This allows you to obtain the resulting images after inference, depending on the model you’re using.

In our example, when running a custom land cover segmentation, the model produces images similar to the following, where we compare the input and prediction images with its corresponding legend.

Land Cover Segmentation 1  Land Cover Segmentation 2. Land Cover Segmentation legend

The following is another example of a crop classification model, where we show the comparison of the original vs. resulting panoptic and semantic segmentation results, with its corresponding legend.

Crop Classification

Automating geospatial pipelines

Finally, we can also automate the previous steps by building geospatial data processing and inference pipelines with Amazon SageMaker Pipelines. We simply chain each preprocessing step required through the use of Lambda Steps and Callback Steps in Pipelines. For example, you could also add a final inference step using a Transform Step, or directly through another combination of Lambda Steps and Callback Steps, for running an EOJ with one of the built-in semantic segmentation models in SageMaker geospatial features.

Note we’re using Lambda Steps and Callback Steps in Pipelines because the EOJs are asynchronous, so this type of step allows us to monitor the run of the processing job and resume the pipeline when it’s complete through messages in an Amazon Simple Queue Service (Amazon SQS) queue.

Geospatial Pipeline

You can check the notebook in the GitHub repository for a detailed example of this code.

Now we can visualize the diagram of our geospatial pipeline through Studio and monitor the runs in Pipelines, as shown in the following screenshot.

Geospatial Pipeline UI

Conclusion

In this post, we presented a summary of the processes we implemented with SageMaker geospatial capabilities for building geospatial data pipelines for our advanced products from xarvio Digital Farming Solutions. Using SageMaker geospatial increased the efficiency of our geospatial work by more than 50%, through the use of pre-built APIs that accelerate and simplify our preprocessing and modeling steps for ML.

As a next step, we’re onboarding more models from our catalog to SageMaker to continue the automation of our solution pipelines, and will continue utilizing more geospatial features of SageMaker as the service evolves.

We encourage you to try SageMaker geospatial capabilities by adapting the end-to-end example notebook provided in this post, and learning more about the service in What is Amazon SageMaker Geospatial Capabilities?.


About the Authors

Julian BlauJulian Blau is a Data Scientist at BASF Digital Farming GmbH, located in Cologne, Germany. He develops digital solutions for agriculture, addressing the needs of BASF’s global customer base by using geospatial data and machine learning. Outside work, he enjoys traveling and being outdoors with friends and family.

Antonio RodriguezAntonio Rodriguez is an Artificial Intelligence and Machine Learning Specialist Solutions Architect in Amazon Web Services, based out of Spain. He helps companies of all sizes solve their challenges through innovation, and creates new business opportunities with AWS Cloud and AI/ML services. Apart from work, he loves to spend time with his family and play sports with his friends.

Read More

Unfolding the Universe using TensorFlow

Unfolding the Universe using TensorFlow

A guest post by Roberta Duarte, IAG/USP

Astronomy is the science of trying to answer the Universe’s biggest mysteries. How did the Universe begin? How will it end? What is a black hole? What are galaxies and how did they form? Is life a common piece in the Universe’s puzzle? There are so many questions without answers. Machine learning can be a vital tool to answer those questions and help us unfold the Universe.

Astronomy is one of the oldest sciences. The reason is simple: we just have to look at the sky and start questioning what we are seeing. It is what astronomers have been doing for centuries. Galileo discovered a series of celestial objects after he observed the sky through the lenses of his new invention: the telescope. A few years later, Isaac Newton used Galileo’s contributions to find the Law of Universal Gravitation. With Newton’s results, we could not only understand better how the Sun affects Earth and other planets but also why we are trapped on Earth’s surface. Centuries later, Edwin Hubble found that galaxies are moving away from us and that further galaxies are moving faster than closer ones. Hubble’s findings showed that the Universe is expanding and is accelerated. These are a few examples of how studying the sky can give us some answers about the universe.

What all of them have in common is that they record data obtained from observations. The data can be a star’s luminosity, planets’ positions, or even galaxies’ distances. With technology improving the observations, more data is available to help us understand the Universe around us. Recently, the most advanced telescope, James Webb Space Telescope (JWST), was launched to study the early Universe in infrared. JWST is expected to transmit 57.2 gigabytes per day of data containing information about early galaxies, exoplanets, and the Universe’s structure.

While this is excellent news for astronomers, it also comes with a high cost. A high computational cost. In 2020, Nature published an article about big data and how Astronomy is now in an era of big data. JWST is one of the examples of how those powerful telescopes are producing huge amounts of data every day. Vera Rubin Observatory is expected to collect 20 terabytes per night. Large Arrays collect petabytes of data every year, and next-generation Large Arrays will collect hundreds of petabytes per year. In 2019, several Astro White Papers were published with the goals and obstacles in the Astronomy field predicted for the 2020s. They outlined how Astronomy needs to change in order to be prepared for the huge volume of data expected during the 2020s. New methods are required since the traditional cannot deal with the expressive number. We see problems showing up when talking about storage, software, and processing.

The storage problem may have a solution in cloud computing, eg. GCP, as noted by Nature. However, processing does not have a simple solution. The methods used to process and analyze the data need to change. It is important to note that Astronomy is a science based on finding patterns. Stars with the same redshift – an estimation of the distance of stars in space relative to us by measuring the shift of the star’s light waves towards higher frequencies – and similar composition can be considered candidates for the same population. Galaxies with the same morphology and activity or spectrum originating in the nucleus usually show the presence of black holes with similar behavior. We can even calculate the Universe’s expansion rate by studying the pattern in the spectra of different Type I Supernovae. And, what is the best tool we have to learn patterns in a lot of data? Machine Learning.

Machine learning is a tool that Astronomy can use to deal with the computational problems cited above. A data-driven approach offered by machine learning techniques may help to get analysis and results faster than traditional methods such as numerical simulations or MCMC – a statistical method of sampling from a probability distribution. In the past few years, we are seeing an interesting increase in the interaction between Astronomy and machine learning. To quantify, the keyword machine learning presented in Astronomy’s papers increased four times from 2015 to 2020 while deep learning increased 3 times each year. More specifically, machine learning was widely used to classify celestial objects and to predict spectra from given properties. Today, we see a large range of applications since discovering exoplanets, simulations of the Universe’s cosmic web, and searching for gravitational waves.

Since machine learning offers a data-driven approach, it can accelerate scientific research in the field. An interesting example is the research around black holes. Black holes have been a hot topic for the past few years with amazing results and pictures from the Event Horizon Telescope (EHT). To understand a black hole, we need the help of computational tools. A black hole is a region of spacetime extremely curved that nothing, not even light, can escape. When matter gets trapped around its gravitational field, the matter will create a disk called accretion disk. The accretion disk dynamics are chaotic and turbulent. To understand the accretion disk physics, we need to simulate complex fluid equations. 
A common method to solve this and gain insight into black hole physics is to use numerical simulations. The environment around a black hole can be described using a set of conservative equations – usually, mass conservation, energy conservation, and angular momentum conservation. The set of equations can be solved using numerical and mathematical methods that iteratively solve each parameter for each time. The result is a set of dumps – or frames – with information about density, pressure, velocity field, and magnetic field for each (x, y, t) in the 2D case or (x, y, z, t) in the 3D case. However, numerical simulations are very time-consuming. A simple hydrodynamical treatment around a black hole can go up to 7 days running on 400 CPU cores.

If you start adding complexity, such as electromagnetism equations to understand the magnetic fields around a black hole and general relativity equations to realistically explain the space-time there, the time can increase significantly. We are slowly reaching a barrier in black hole physics due to computational limitations where it is becoming harder and harder to realistically simulate a black hole.

Black hole research

That is where my advisor, Rodrigo Nemmen, and I started to think about a new method to accelerate black hole physics. In other words, a new method that could accelerate the numerical simulations we needed to study these extreme objects. From the beginning machine learning seems like the method with the best perspective for us. We had the data to feed into a machine learning algorithm and there were successful cases in the literature simulating fluids using machine learning. But never around a black hole. It was worth giving it a shot. We began a collaboration with João Navarro from Nvidia Brazil and then we started solving the problem. Carefully, we chose an architecture that we would be based on while building our own scheme. Since we wanted a data-driven approach, we decided to go with supervised learning, more specifically, we decided to use deep learning linked with the great performance of convolutional neural networks.

How we built it

Everything was built using TensorFlow and Keras. We started using TensorFlow 1 since it was the version available at the time. Back then, Keras was not added to TensorFlow yet but funny enough, during that time I attended the TensorFlow Roadshow 2019 in São Paulo, Brazil. It was during that event that I found out about TensorFlow and Keras joining forces in TensorFlow version 2 to create the powerful framework. I even took a picture of the announcement. Also, it was the first time I heard about the strategy scope implemented in TensorFlow 2, I did not know back then that I would be using the same function today.
It needed weeks to deal with the data and to know the best way to prepare before we could feed them to ConvNets. The data described the density of a fluid around a black hole. In our case, we got the data from sub-fed black holes, in other words, black holes with low accretion rates. Back in 2019, the simulations we used were the longest simulations of this kind – 2D profiles using a hydrodynamical treatment. The process that we went through is described in Duarte et al. 2022. We trained our ConvNet with 2D spatial + 1D temporal dimensions. A cluster with two GPUs (NVIDIA G100 and NVIDIA P6000) was our main hardware to train our neural network.
After a few hours of training, our model was ready to simulate black holes. First, we tested the capacity by testing how much the model can learn the rest of the learned simulation. The video shows the target and prediction for a case that we called a direct case: we feed a simulation frame to the model as input and we analyze how well the model can predict the next step.
But we also want to see how much of the Physics the model could learn by only looking at some simulations. We test the model capacity to simulate a never-seen system. During the training process, we hid a simulation from the model. After the training, we input the initial conditions and a single frame so we could test how the model would perform while simulating by itself. The results are great news: the model can simulate a system by only learning Physics from other ones. And the news gets better: you have a 32000x speed-up compared to traditional methods.

Just out of curiosity, we tested a direct prediction from a system where the accretion flow around the black hole has high variability. It’s a really beautiful result to see how the model could follow the turbulent behavior of the accretion flow.

If you are interested in more details and results, they are available at Duarte et al. 2022.

This work demonstrates the power of using deep learning techniques in Astronomy to speed up scientific research. All the work was done using only TensorFlow tools to preprocess, train and predict. How great is that?

Conclusion

As we discussed in this post, AI is already an essential part of Astronomy and we can expect that it will only continue to grow. We have already seen that Astronomy can achieve big wins with the help of AI. It is a field with a lot of data and patterns that are perfect to build and test AI tools using real-world data. There will come a day that AI will be discovering and unfolding the Universe and hopefully, this day is soon!

Read More