Build a cross-account MLOps workflow using the Amazon SageMaker model registry

Build a cross-account MLOps workflow using the Amazon SageMaker model registry

A well-designed CI/CD pipeline is essential to scale any software development workflow effectively. When designing production CI/CD pipelines, AWS recommends leveraging multiple accounts to isolate resources, contain security threats and simplify billing-and data science pipelines are no different. At AWS, we’re continuing to innovate to simplify the MLOps workflow.

In this post, we discuss some of the newer cross-account features to Amazon SageMaker that allow you to better share and manage model groups as well as manage model versions. For an example account structure to follow organizational unit best practices to host models using SageMaker endpoints across accounts, refer to MLOps Workload Orchestrator.

Solution overview

The following diagram illustrates our shared model registry architecture.

Architecture diagram reflecting the cross account MLOps process

Some things to note in the preceding architecture:

The following steps correspond to the diagram:

  1. A data scientist registers a model from the data science account into the shared services SageMaker model registry in a PendingManualApproval state. The model artifact is created in the shared services account Amazon Simple Storage Service (Amazon S3) bucket.
  2. Upon a new model version registration, someone with the authority to approve the model based on the metrics should approve or reject the model.
  3. After the model is approved, the CI/CD pipeline in deployment account is triggered to deploy the updated model details in the QA account and update the stage as QA.
  4. Upon passing the testing process, you can either choose to have a manual approval step within your CI/CD process or have your CI/CD pipeline directly deploy the model to production and update the stage as Prod.
  5. The production environment references the approved model and code, perhaps doing an A/B test in production. In case of an audit or any issue with the model, you can use Amazon SageMaker ML Lineage Tracking. It creates and stores information about the steps of a machine learning (ML) workflow from data preparation to model deployment. With the tracking information, you can reproduce the workflow steps, track the model and dataset lineage, and establish model governance and audit standards.

Throughout the whole process, the shared model registry retains the older model versions. This allows the team to roll back changes, or even host production variants.

Prerequisites

Make sure you have the following prerequisites:

  • A provisioned multi-account structure – For instructions, see Best Practices for Organizational Units with AWS Organizations. For the purposes of this blog we are leveraging the following accounts:
    • Data science account – An account where data scientists have access to the training data and create the models.
    • Shared services account – A central account for storing the model artifacts (as shown in the architecture diagram) to be accessed across the different workload accounts.
    • Deployment account – An account responsible for deploying changes to the various accounts.
    • Workload accounts – These are commonly QA and prod environments where software engineers are able to build applications to consume the ML model.
  • A deployment account with appropriate permissions – For more information about best practices with a multi-account OU structure, refer to Deployments OU. This account is responsible for pointing the workload accounts to the desired model in the shared services account’s model registry.

Define cross-account policies

In following the principle of least privilege, first we need to add cross-account resource policies to the shared services resources to grant access from the other accounts.

Because the model artifacts are stored in the shared services account’s S3 bucket, the data science account needs Amazon S3 read/write access to push trained models to Amazon S3. The following code illustrates this policy, but don’t add it to the shared services account yet:

#Data Science account's policy to access Shared Services' S3 bucket
 {
    'Version': '2012-10-17',
    'Statement': [{
        'Sid': 'AddPerm',
        'Effect': 'Allow',
        'Principal': {
            'AWS': 'arn:aws:iam::<data_science_account_id>:root'
        }, 
        "Action": [ 
            's3:PutObject', 
            's3:PutObjectAcl',
            's3:GetObject', 
            's3:GetObjectVersion'
        ], #read/write
        'Resource': 'arn:aws:s3:::<shared_bucket>/*'
    }]
}

The deployment account only needs to be granted read access to the S3 bucket, so that it can use the model artifacts to deploy to SageMaker endpoints. We also need to attach the following policy to the shared services S3 bucket:

#Deployment account's policy to access Shared Services' S3 bucket
 {
    'Version': '2012-10-17',
    'Statement': [{
        'Sid': 'AddPerm',
        'Effect': 'Allow',
        'Principal': {
            'AWS': 'arn:aws:iam::<deployment_account_id>:root'
        },
        'Action': [ 
            's3:GetObject', 
            's3:GetObjectVersion'
        ], #read
        'Resource': 'arn:aws:s3:::<shared_bucket>/*'
    }]
}

We combine both policies to get the following final policy. Create this policy in the shared services account after replacing the appropriate account IDs:

{
  "Version": "2012-10-17",
  "Statement": [{
    "Sid": "AddPerm",
    "Effect": "Allow",
    "Principal": {
      "AWS": "arn:aws:iam::<data_science_account_id>:root"    
    },
    "Action": [
      "s3:PutObject",
      "s3:PutObjectAcl",
      "s3:GetObject",
      "s3:GetObjectVersion"    ],
    "Resource": "arn:aws:s3:::<shared_bucket>/*"  
    },
    {
      "Sid": "AddPermDeployment",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::<deployment_account_id>:root"      
      },
      "Action": [
        "s3:GetObject",
        "s3:GetObjectVersion"      ], 
      "Resource": "arn:aws:s3:::<shared_bucket>/*"    
    }
  ]
}

To be able to deploy a model created in a different account, the user must have a role that has access to SageMaker actions, such as a role with the AmazonSageMakerFullAccess managed policy. Refer to Deploy a Model Version from a Different Account for additional details.

We need to define the model group that contains the model versions we want to deploy. Also, we want to grant permissions to the data science account. This can be accomplished in the following steps. We refer to the accounts as follows:

  • shared_services_account_id – The account where the model registry is and where we want the model to be
  • data_science_account_id – The account where we will be training and therefore creating the actual model artifact
  • deployment_account_id – The account where we want to host the endpoint for this model

First we need to ensure the model package groups exists. You can use Boto3 APIs as shown the following example, or you can use the AWS Management Console to create the model package. Refer to Create Model Package Group for more details. This assumes you have the Boto3 installed.

model_package_group_name = "cross-account-example-model"
sm_client = boto3.Session().client("sagemaker")

create_model_package_group_response = sm_client.create_model_package_group(
    ModelPackageGroupName=model_package_group_name,
    ModelPackageGroupDescription="Cross account model package group",
    Tags=[
          {
              'Key': 'Name',
              'Value': 'cross_account_test'
          },
      ]

)

print('ModelPackageGroup Arn : {}'.format(create_model_package_group_response['ModelPackageGroupArn']))

For the permissions for this model package group, you can create a JSON document resembling the following code. Replace the actual account IDs and model package group name with your own values.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AddPermModelPackageGroupCrossAccount",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::<data_science_account_id>:root"      
      },
      "Action": [
        "sagemaker:DescribeModelPackageGroup"      
        ],
      "Resource": "arn:aws:sagemaker:<region>:<shared_services_account_id>:model-package-group/<model_package_group_name>"    
    },
    {
      "Sid": "AddPermModelPackageVersionCrossAccount",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::<data_science_account_id>:root"      
      },
      "Action": [
        "sagemaker:DescribeModelPackage",
        "sagemaker:ListModelPackages",
        "sagemaker:UpdateModelPackage",
        "sagemaker:CreateModelPackage",
        "sagemaker:CreateModel"      
      ],
      "Resource": "arn:aws:sagemaker:<region>:<shared_services_account_id>:model-package/<model_package_group_name>/*"    
    }
  ]
}

Finally, apply the policy to the model package group. You can’t associate this policy with the package group via the console. You need the SDK or AWS Command Line Interface (AWS CLI) access. For example, the following code uses Boto3:

# Convert the policy from JSON dict to string
model_package_group_policy = dict(<put-above-json-policy-after-subsitute> )
model_package_group_policy = json.dumps(model_package_group_policy)

# Set the new policy
sm_client = boto3.Session().client("sagemaker")
response = sm_client.put_model_package_group_policy(
    ModelPackageGroupName = model_package_group_name,
    ResourcePolicy = model_package_group_policy)

We also need a custom AWS Key Management Service (AWS KMS) key to encrypt the model while storing it in Amazon S3. This needs to be done using the data science account. On the AWS KMS console, navigate to the Define key usage permissions page. In the Other AWS accounts section, choose Add another AWS account. Enter the AWS account number for the deployment account. You use this KMS key for the SageMaker training job. If you don’t specify a KMS key for the training job, SageMaker defaults to an Amazon S3 server-side encryption key. A default Amazon S3 server-side encryption key can’t be shared with or used by another AWS account.

The policy and permissions follow this pattern:

  • The Amazon S3 policy specified in shared_services_account gives permissions to the data science account and deployments account
  • The KMS key policy specified in shared_services_account gives permissions to the data science account and deployments account

We need to ensure that the shared services account and deployment account have access to the Docker images that were used for training the model. These images are generally hosted in AWS accounts, and your account admin can help you get access, if you don’t have access already. For this post, we don’t create any custom Docker images after training the model and therefore we don’t need any specific Amazon ECR policies for the images.

In the workload accounts (QA or prod), we need to create two AWS Identity and Access Management (IAM) policies similar to the following. These are inline policies, which means that they’re embedded in an IAM identity. This gives these accounts access to model registry.

The first inline policy allows a role to access the Amazon S3 resource in the shared services account that contains the model artifact. Provide the name of the S3 bucket and your model:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "s3:GetObject",
            "Resource": "arn:aws:s3:::<bucket-name>/sagemaker/<cross-account-example-model>/output/model.tar.gz"
        }
    ]
}

The second inline policy allows a role, which we create later, to use the KMS key in the shared services account. Specify the account ID for the shared services account and KMS key ID:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowUseOfTheKey",
            "Effect": "Allow",
            "Action": [
                "kms:Decrypt"
            ],
            "Resource": [
                "arn:aws:kms:us-east-1:<data_science_account_id>:key/{xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx}"
            ]
        }
    ]
}

Finally, we need to create an IAM role for SageMaker. This role has the AmazonSageMakerFullAccess policy attached. We then attach these two inline policies to the role we created. If you’re using an existing SageMaker execution role, attach these two policies to that role. For instructions, refer to Creating roles and attaching policies (console).

Now that we have defined the policies of each account, let’s use an example to see it in action.

Build and train a model using a SageMaker pipeline

We first create a SageMaker pipeline in the data science account for carrying out data processing, model training, and evaluation. We use the California housing dataset obtained from the StatLib library. In the following code snippet, we use a custom preprocessing script preprocess.py to perform some simple feature transformation such as feature scaling, which can be generated using the following notebook. This script also splits the dataset into training and test datasets.

We create a SKLearnProcessor object to run this preprocessing script. In the SageMaker pipeline, we create a processing step (ProcessingStep) to run the processing code using SKLearnProcessor. This processing code is called when the SageMaker pipeline is initialized. The code creating the SKLearnProcessor and ProcessingStep are shown in the following code. Note that all the code in this section is run in the data science account.

# Useful SageMaker variables - Create a Pipeline session which will lazy init resources
session = PipelineSession()

framework_version = "0.23-1"

# Create SKlearn processor object,
# The object contains information about what instance type to use, the IAM role to use etc.
# A managed processor comes with a preconfigured container, so only specifying version is required.
sklearn_processor = SKLearnProcessor(
    framework_version=framework_version,
    role=role,
    instance_type=processing_instance_type,
    instance_count=1,
    base_job_name="tf2-california-housing-processing-job",
    sagemaker_session=session
)

# Use the sklearn_processor in a SageMaker pipelines ProcessingStep
step_preprocess_data = ProcessingStep(
    name="Preprocess-California-Housing-Data",
    processor=sklearn_processor,
    inputs=[
        ProcessingInput(source=input_data, destination="/opt/ml/processing/input"),
    ],
    outputs=[
        ProcessingOutput(output_name="train", source="/opt/ml/processing/train"),
        ProcessingOutput(output_name="test", source="/opt/ml/processing/test"),
    ],
    code="preprocess.py",
)

We need a custom KMS key to encrypt the model while storing it to Amazon S3. See the following code:

kms_client = boto3.client('kms')
response = kms_client.describe_key(
    KeyId='alias/sagemaker/outkey',
)
key_id = response['KeyMetadata']['KeyId']

To train the model, we create a TensorFlow estimator object. We pass it the KMS key ID along with our training script train.py, training instance type, and count. We also create a TrainingStep to be added to our pipeline, and add the TensorFlow estimator to it. See the following code:

model_path = f"s3://{bucket}/{prefix}/model/"

hyperparameters = {"epochs": training_epochs}
tensorflow_version = "2.4.1"
python_version = "py37"

tf2_estimator = TensorFlow(
    source_dir="code",
    entry_point="train.py",
    instance_type=training_instance_type,
    instance_count=1,
    framework_version=tensorflow_version,
    role=role,
    base_job_name="tf2-california-housing-train",
    output_path=model_path,
    output_kms_key=key_id,
    hyperparameters=hyperparameters,
    py_version=python_version,
    sagemaker_session=session
)

# Use the tf2_estimator in a SageMaker pipelines ProcessingStep.
# NOTE how the input to the training job directly references the output of the previous step.
step_train_model = TrainingStep(
    name="Train-California-Housing-Model",
    estimator=tf2_estimator,
    inputs={
        "train": TrainingInput(
            s3_data=step_preprocess_data.properties.ProcessingOutputConfig.Outputs[
                "train"
            ].S3Output.S3Uri,
            content_type="text/csv",
        ),
        "test": TrainingInput(
            s3_data=step_preprocess_data.properties.ProcessingOutputConfig.Outputs[
                "test"
            ].S3Output.S3Uri,
            content_type="text/csv",
        ),
    },
)

In addition to training, we need to carry out model evaluation, for which we use mean squared error (MSE) as the metric in this example. The earlier notebook also generates evaluate.py, which we use to evaluate our a model using MSE. We also create a ProcessingStep to initialize the model evaluation script using a SKLearnProcessor object. The following code creates this step:

from sagemaker.workflow.properties import PropertyFile

# Create SKLearnProcessor object.
# The object contains information about what container to use, what instance type etc.
evaluate_model_processor = SKLearnProcessor(
    framework_version=framework_version,
    instance_type=processing_instance_type,
    instance_count=1,
    base_job_name="tf2-california-housing-evaluate",
    role=role,
    sagemaker_session=session
)

# Create a PropertyFile
# A PropertyFile is used to be able to reference outputs from a processing step, for instance to use in a condition step.
# For more information, visit https://docs.aws.amazon.com/sagemaker/latest/dg/build-and-manage-propertyfile.html
evaluation_report = PropertyFile(
    name="EvaluationReport", output_name="evaluation", path="evaluation.json"
)

# Use the evaluate_model_processor in a SageMaker pipelines ProcessingStep.
step_evaluate_model = ProcessingStep(
    name="Evaluate-California-Housing-Model",
    processor=evaluate_model_processor,
    inputs=[
        ProcessingInput(
            source=step_train_model.properties.ModelArtifacts.S3ModelArtifacts,
            destination="/opt/ml/processing/model",
        ),
        ProcessingInput(
            source=step_preprocess_data.properties.ProcessingOutputConfig.Outputs[
                "test"
            ].S3Output.S3Uri,
            destination="/opt/ml/processing/test",
        ),
    ],
    outputs=[
        ProcessingOutput(output_name="evaluation", source="/opt/ml/processing/evaluation"),
    ],
    code="evaluate.py",
    property_files=[evaluation_report],
)

After model evaluation, we also need a step to register our model with the model registry, if the model performance meets the requirements. This is shown in the following code using the RegisterModel step. Here we need to specify the model package that we had declared in the shared services account. Replace the Region, account, and model package with your values. The model name used here is modeltest, but you can use any name of your choice.

# Create ModelMetrics object using the evaluation report from the evaluation step
# A ModelMetrics object contains metrics captured from a model.
model_metrics = ModelMetrics(
    model_statistics=MetricsSource(
        s3_uri=evaluation_s3_uri,
        content_type="application/json",
    )
)

# Create a RegisterModel step, which registers the model with SageMaker Model Registry.
model = Model(
    image_uri=tf2_estimator.training_image_uri(),
    model_data=training_step.properties.ModelArtifacts.S3ModelArtifacts,
    source_dir=tf2_estimator.source_dir,
    entry_point=tf2_estimator.entry_point,
    role=role_arn,
    sagemaker_session=session
)

model_registry_args = model.register(
    content_types=['text/csv'],
    response_types=['application/json'],
    inference_instances=['ml.t2.medium', 'ml.m5.xlarge'],
    transform_instances=['ml.m5.xlarge'],
    model_package_group_name=model_package_group_name,
    approval_status='PendingManualApproval',
    model_metrics=model_metrics
)

 step_register_model= ModelStep(
    name='RegisterModel',
    step_args=model_registry_args
)

We also need to create the model artifacts so that it can be deployed (using the other account). For creating the model, we create a CreateModelStep, as shown in the following code:

from sagemaker.inputs import CreateModelInput 
from sagemaker.workflow.model_step import ModelStep 
step_create_model = ModelStep( 
    name="Create-California-Housing-Model", 
    step_args=model.create(instance_type="ml.m5.large",accelerator_type="ml.eia1.medium"),
 )

Adding conditions to the pipeline is done with a ConditionStep. In this case, we only want to register the new model version with the model registry if the new model meets an accuracy condition. See the following code:

from sagemaker.workflow.conditions import ConditionLessThanOrEqualTo
from sagemaker.workflow.condition_step import (
    ConditionStep,
    JsonGet,
)

# Create accuracy condition to ensure the model meets performance requirements.
# Models with a test accuracy lower than the condition will not be registered with the model registry.
cond_lte = ConditionLessThanOrEqualTo(
    left=JsonGet(
        step=step_evaluate_model,
        property_file=evaluation_report,
        json_path="regression_metrics.mse.value",
    ),
    right=accuracy_mse_threshold,
)

# Create a SageMaker Pipelines ConditionStep, using the preceding condition.
# Enter the steps to perform if the condition returns True / False.
step_cond = ConditionStep(
    name="MSE-Lower-Than-Threshold-Condition",
    conditions=[cond_lte],
    if_steps=[step_register_model, step_create_model],
    else_steps=[step_higher_mse_send_email_lambda],
)

Finally, we want to orchestrate all the pipeline steps so that the pipeline can be initialized:

from sagemaker.workflow.pipeline import Pipeline

# Create a SageMaker Pipeline.
# Each parameter for the pipeline must be set as a parameter explicitly when the pipeline is created.
# Also pass in each of the preceding steps.
# Note that the order of execution is determined from each step's dependencies on other steps,
# not on the order they are passed in.
pipeline = Pipeline(
    name=pipeline_name,
    parameters=[
        processing_instance_type,
        training_instance_type,
        input_data,
        training_epochs,
        accuracy_mse_threshold,
        endpoint_instance_type,
    ],
    steps=[step_preprocess_data, step_train_model, step_evaluate_model, step_cond],
)

Deploy a model version from a different account

Now that the model has been registered in the shared services account, we need to deploy into our workload accounts using the CI/CD pipeline in the deployment account. We have already configured the role and the policy in an earlier step. We use the model package ARN to deploy the model from the model registry. The following code runs in the deployment account and is used to deploy approved models to QA and prod:

from sagemaker import ModelPackage
from time import gmtime, strftime

sagemaker_session = sagemaker.Session(boto_session=sess)

model_package_arn = 'arn:aws:sagemaker:<region>:<shared_services_account>:<model_group_package>/modeltest/version_number'
model = ModelPackage(role=role, 
                     model_package_arn=model_package_arn, 
                     sagemaker_session=sagemaker_session)
model.deploy(initial_instance_count=1, instance_type='ml.m5.xlarge')

Conclusion

In this post, we demonstrated how to set up the policies needed for a multi-account setup for ML based on the principle of least privilege. Then we showed the process of building and training the models in the data science account. Finally, we used the CI/CD pipeline in the deployment account to deploy the latest version of approved models to QA and production accounts. Additionally, you can view the deployment history of models and build triggers in AWS CodeBuild.

You can scale the concepts in this post to host models in Amazon Elastic Compute Cloud (Amazon EC2) or Amazon Elastic Kubernetes Service (Amazon EKS), as well as build out a batch inference pipeline.

To learn more about having separate accounts that build ML models in AWS, see Best Practices for Organizational Units with AWS Organizations and Safely update models in production.


About the Authors

Sandeep Verma is a Sr. Prototyping Architect with AWS. He enjoys diving deep into customer challenges and building prototypes for customers to accelerate innovation. He has a background in AI/ML, founder of New Knowledge, and generally passionate about tech. In his free time, he loves traveling and skiing with his family.

Mani Khanuja  Mani Khanuja is an Artificial Intelligence and Machine Learning Specialist SA at Amazon Web Services (AWS). She helps customers using machine learning to solve their business challenges using the AWS. She spends most of her time diving deep and teaching customers on AI/ML projects related to computer vision, natural language processing, forecasting, ML at the edge, and more. She is passionate about ML at edge, therefore, she has created her own lab with self-driving kit and prototype manufacturing production line, where she spend lot of her free time.

Saumitra Vikram is a Software Developer on the Amazon SageMaker team and is based in Chennai, India. Outside of work, he loves spending time running, trekking and motor bike riding through the Himalayas.

Sreedevi Srinivasan is an engineering leader in AWS SageMaker. She is passionate and excited about enabling ML as a platform that is set to transform every day lives. She currently focusses on SageMaker Feature Store. In her free time, she likes to spend time with her family.

Rupinder Grewal is a Sr Ai/ML Specialist Solutions Architect with AWS. He currently focuses on serving of models and MLOps on SageMaker. Prior to this role he has worked as Machine Learning Engineer building and hosting models. Outside of work he enjoys playing tennis and biking on mountain trails.

Farooq Sabir is a Senior Artificial Intelligence and Machine Learning Specialist Solutions Architect at AWS. He holds PhD and MS degrees in Electrical Engineering from The University of Texas at Austin and a MS in Computer Science from Georgia Institute of Technology. At AWS, he helps customers formulate and solve their business problems in data science, machine learning, computer vision, artificial intelligence, numerical optimization and related domains. He has over 16 years of work experience and is also an adjunct faculty member at The University of Texas at Dallas, where he teaches a graduate course on Applied Machine Learning. Based in Dallas, Texas, he and his family love to travel and make long road trips.

Read More

Enabling hybrid ML workflows on Amazon EKS and Amazon SageMaker with one-click Kubeflow on AWS deployment

Enabling hybrid ML workflows on Amazon EKS and Amazon SageMaker with one-click Kubeflow on AWS deployment

Today, many AWS customers are building enterprise-ready machine learning (ML) platforms on Amazon Elastic Kubernetes Service (Amazon EKS) using Kubeflow on AWS (an AWS-specific distribution of Kubeflow) across many use cases, including computer vision, natural language understanding, speech translation, and financial modeling.

With the latest release of open-source Kubeflow v1.6.1, the Kubeflow community continues to support this large-scale adoption of Kubeflow for enterprise use cases. The latest release includes many new exciting features like support for Kubernetes v1.22, combined Python SDK for PyTorch, MXNet, MPI, XGBoost in Kubeflow’s distributed Training Operator, new ClusterServingRuntime and ServingRuntime CRDs for model service, and many more.

AWS contributions to Kubeflow with the recent launch of Kubeflow on AWS 1.6.1 support all upstream open-source Kubeflow features and include many new integrations with the highly optimized, cloud-native, enterprise-ready AWS services that will help you build highly reliable, secure, portable, and scalable ML systems.

In this post, we discuss new Kubeflow on AWS v1.6.1 features and highlight three important integrations that have been bundled on one platform to offer you::

  • Infrastructure as Code (IaaC) one-click solution that automates the end-to-end installation of Kubeflow, including EKS cluster creation
  • Support for distributed training on Amazon SageMaker using Amazon SageMaker Operators for Kubernetes (ACK) and SageMaker components for Kubeflow pipelines and locally on Kubernetes using Kubeflow Training Operators. Many customers are using this capability to build hybrid machine learning architectures where they are leveraging both Kubernetes compute for experimentation phase and SageMaker to run production scale workloads.
  • Enhanced monitoring and observability for ML workloads including Amazon EKS, Kubeflow metrics, and application logs using Prometheus, Grafana, and Amazon CloudWatch integrations

The use case in this blog will specifically focus on SageMaker integration with Kubeflow on AWS that could be added to your existing Kubernetes workflows enabling you to build hybrid machine learning architectures.

Kubeflow on AWS

Kubeflow on AWS 1.6.1 provides a clear path to use Kubeflow, with the addition of the following AWS services on top of existing capabilities:

  • SageMaker Integration with Kubeflow to run hybrid ML workflows using SageMaker Operators for Kubernetes (ACK) and SageMaker Components for Kubeflow Pipelines.
  • Automated deployment options have been improved and simplified using Kustomize scripts and Helm charts.
  • Added support for Infrastructure as Code (IaC) one-click deployment for Kubeflow on AWS using Terraform for all the available deployment options. This script automates creation of the following AWS resources:
  • Support for AWS PrivateLink for Amazon S3 enabling non-commercial Region users to connect to their respective S3 endpoints.
  • Added integration with Amazon Managed Service for Prometheus (AMP) and Amazon Managed Grafana to monitor metrics with Kubeflow on AWS.
  • Updated Kubeflow notebook server containers with the latest deep learning container images based on TensorFlow 2.10.0 and PyTorch 1.12.1.
  • Integration with AWS DLCs to run distributed training and inference workloads.

The following architecture diagram is a quick snapshot of all the service integrations (including the ones already mentioned) that are available for Kubeflow control and data plane components in Kubeflow on AWS. The Kubeflow control plane is installed on top of Amazon EKS, which is a managed container service used to run and scale Kubernetes applications in the cloud. These AWS service integrations allow you to decouple critical parts of the Kubeflow control plane from Kubernetes, providing a secure, scalable, resilient, and cost-optimized design. For more details on the value that these service integrations add over open-source Kubeflow, refer to Build and deploy a scalable machine learning system on Kubernetes with Kubeflow on AWS.

Let’s discuss in more detail on how the Kubeflow on AWS 1.6.1 key features could be helpful to your organization.

Kubeflow on AWS feature details

With the Kubeflow 1.6.1 release, we tried to provide better tools for different kinds of customers that make it easy to get started with Kubeflow no matter which options you choose. These tools provide a good starting point and can be modified to fit your exact needs.

Deployment options

We provide different deployment options for different customer use cases. Here you get to choose which AWS services you want to integrate your Kubeflow deployment with. If you decide to change deployment options later, we recommend that you do a fresh installation for the new deployment. The following deployment options are available:

If you want to deploy Kubeflow with minimal changes, consider the vanilla deployment option. All available deployment options can be installed using Kustomize, Helm, or Terraform.

We also have different add-on deployments that can be installed on top of any of these deployment options:

Installation options

After you have decided which deployment option best suits your needs, you can choose how you want to install these deployments. In an effort to serve experts and newcomers alike, we have different levels of automation and configuration.

Option 1: Terraform (IaC)

This creates an EKS cluster and all the related AWS infrastructure resources, and then deploys Kubeflow all in one command using Terraform. Internally, this uses EKS blueprints and Helm charts.

This option has the following advantages:

  • It provides flexibility to enterprises to deploy Amazon EKS and Kubeflow with one command without having to worry about specific Kubeflow component configurations. This will immensely help speed up technology evaluation, prototyping, and the product development lifecycle providing flexibility to use terraform modules and modify it to meet any project-specific needs.
  • Many organizations today who have Terraform as the centre of their cloud strategy can now use Kubeflow on AWS Terraform solution to meet their cloud goals.

Option 2: Kustomize or Helm Charts:

This option allows you to deploy Kubeflow in a two-step process:

  1. Create AWS resources like Amazon EKS, Amazon RDS, Amazon S3, and Amazon Cognito, either through the automated scripts included in the AWS distribution or manually following a step-by-step guide.
  2. Install Kubeflow deployments either using Helm charts or Kustomize.

This option has the following advantages:

  • The main goal of this installation option is to provide Kubeflow-related Kubernetes configurations. Therefore, you can choose to create or bring in existing EKS clusters or any of the related AWS resources like Amazon RDS, Amazon S3, and Amazon Cognito, and configure and manage it to work with Kubeflow on AWS.
  • It’s easier to move from an open-source Kustomize Kubeflow manifest to AWS Kubeflow distribution.

The following diagram illustrates the architectures of both options.

Integration with SageMaker

SageMaker is a fully managed service designed and optimized specifically for managing ML workflows. It removes the undifferentiated heavy lifting of infrastructure management and eliminates the need to invest in IT and DevOps to manage clusters for ML model building, training, and inference.

Many AWS customers who have portability requirements or on-premises standard restrictions use Amazon EKS to set up repeatable ML pipelines running training and inference workloads. However, this requires developers to write custom code to optimize the underlying ML infrastructure, provide high availability and reliability, and comply with appropriate security and regulatory requirements. These customers therefore want to use SageMaker for cost-optimized and managed infrastructure for model training and deployments and continue using Kubernetes for orchestration and ML pipelines to retain standardization and portability.

To address this need, AWS allows you to train, tune, and deploy models in SageMaker from Amazon EKS by using the following two options:

  • Amazon SageMaker ACK Operators for Kubernetes, which are based on the AWS Controllers for Kubernetes (ACK) framework. ACK is the AWS strategy that brings in standardization for building Kubernetes custom controllers that allow Kubernetes users to provision AWS resources like databases or message queues simply by using the Kubernetes API. SageMaker ACK Operators make it easier for ML developers and data scientists who use Kubernetes as their control plane to train, tune, and deploy ML models in SageMaker without signing in to the SageMaker console.
  • The SageMaker Components for Kubeflow Pipelines, which allow you to integrate SageMaker with the portability and orchestration of Kubeflow Pipelines. With the SageMaker components, each job in the pipeline workflow runs on SageMaker instead of the local Kubernetes cluster. This allows you to create and monitor native SageMaker training, tuning, endpoint deployment, and batch transform jobs from your Kubeflow Pipelines hence allowing you to move complete compute including data processing and training jobs from the Kubernetes cluster to SageMaker’s machine learning-optimized managed service.

Starting with Kubeflow on AWS v1.6.1, all of the available Kubeflow deployment options bring together both Amazon SageMaker integration options by default on one platform. That means, you can now submit SageMaker jobs using SageMaker ACK operators from a Kubeflow Notebook server itself by submitting the custom SageMaker resource or from the Kubeflow pipeline step using SageMaker components.

There are two versions of SageMaker Components – Boto3 (AWS SDK for AWS SDK for Python) based version 1 components and SageMaker Operator for K8s (ACK) based version 2 components. The new SageMaker components version 2 support latest SageMaker training apis and we will continue to add more SageMaker features to this version of the component. You however have the flexibility to combine Sagemaker components version 2 for training and version 1 for other SageMaker features like hyperparameter tuning, processing jobs, hosting and many more.

Integration with Prometheus and Grafana

Prometheus is an open-source metrics aggregation tool that you can configure to run on Kubernetes clusters. When running on Kubernetes clusters, a main Prometheus server periodically scrapes pod endpoints.

Kubeflow components, such as Kubeflow Pipelines (KFP) and Notebook, emit Prometheus metrics to allow monitoring component resources such as the number of running experiments or notebook count.

These metrics can be aggregated by a Prometheus server running in the Kubernetes cluster and queried using Prometheus Query Language (PromQL). For more details on the features that Prometheus supports, check out the Prometheus documentation.

The Kubeflow on AWS distribution provides support for the integration with following AWS managed services:

  1. Amazon Managed Prometheus (AMP) that is a Prometheus-compatible monitoring service for container infrastructure and application metrics for containers that makes it easy for customers to securely monitor container environments at scale. Using AMP, you can visualize, analyze, and alarm on your metrics, logs, and traces collected from multiple data sources in your observability system, including AWS, third-party ISVs, and other resources across your IT portfolio.
  2. Amazon Managed Grafana, a fully managed and secure data visualization service based on the open source Grafana project, that enables customers to instantly query, correlate, and visualize operational metrics, logs, and traces for their applications from multiple data sources. Amazon Managed Grafana offloads the operational management of Grafana by automatically scaling compute and database infrastructure as usage demands increase, with automated version updates and security patching.

The Kubeflow on AWS distribution provides support for the integration of Amazon Managed Service for Prometheus and Amazon Managed Grafana to facilitate the ingestion and visualization of Prometheus metrics securely at scale.

The following metrics are ingested and can be visualized:

  • Metrics emitted from Kubeflow components such as Kubeflow Pipelines and the Notebook server
  • Kubeflow control plane metrics

To configure Amazon Managed Service for Prometheus and Amazon Managed Grafana for your Kubeflow cluster, refer to Use Prometheus, Amazon Managed Service for Prometheus, and Amazon Managed Grafana to monitor metrics with Kubeflow on AWS.

Solution overview

In this use case, we use the Kubeflow vanilla deployment using Terraform installation option. When installation is complete, we log in to the Kubeflow dashboard. From the dashboard, we spin up a Kubeflow Jupyter notebook server to build a Kubeflow pipeline that uses SageMaker to run distributed training for an image classification model and a SageMaker endpoint for model deployment.

Prerequisites

Make sure you meet the following prerequisites:

  • You have an AWS account.
  • Make sure you’re in the us-west-2 Region to run this example.
  • Use Google Chrome for interacting with the AWS Management Console and Kubeflow.
  • Make sure your account has SageMaker Training resource type limit for ml.p3.2xlarge increased to 2 using the Service Quotas console.
  • Optionally, you can use AWS Cloud9, a cloud-based integrated development environment (IDE) that enables completing all the work from your web browser. For setup instructions, refer to Setup Cloud9 IDE. Select Ubuntu Server 18.04 as a platform in the AWS Cloud9 settings.Then from your AWS Cloud9 environment, choose the plus sign and open new terminal.

You also configure an AWS Command Line Interface (AWS CLI) profile. To do so, you need an access key ID and secret access key of an AWS Identity and Access Management (IAM) user account with administrative privileges (attach the existing managed policy) and programmatic access. See the following code:

aws configure --profile=kubeflow
AWS Access Key ID [None]: <enter access key id>
AWS Secret Access Key [None]: <enter secret access key>
Default region name [None]: us-west-2
Default output format [None]: json

# (In Cloud9, select “Cancel” and “Permanently disable” when the AWS managed temporary credentials dialog pops up)
export AWS_PROFILE=kubeflow

Verify the permissions that cloud9 will use to call AWS resources.

aws sts get-caller-identity

Verify from the below output that you see arn of the admin user that you have configured in AWS CLI profile. In this example it is “kubeflow-user”

{
    "UserId": "*******",
    "Account": "********",
    "Arn": "arn:aws:iam::*******:user/kubeflow-user"
}

Install Amazon EKS and Kubeflow on AWS

To install Amazon EKS and Kubeflow on AWS, complete the following steps:

  1. Set up your environment for deploying Kubeflow on AWS:
    #Clone the awslabs/kubeflow-manifests and the kubeflow/manifests repositories and check out the release branches of your choosing
    export KUBEFLOW_RELEASE_VERSION=v1.6.1
    export AWS_RELEASE_VERSION=v1.6.1-aws-b1.0.0
    git clone https://github.com/awslabs/kubeflow-manifests.git && cd kubeflow-manifests
    git checkout ${AWS_RELEASE_VERSION}
    git clone --branch ${KUBEFLOW_RELEASE_VERSION} https://github.com/kubeflow/manifests.git upstream
    
    export MANIFEST_DIR=$PWD

    #Install the necessary tools with the following command:
    make install-tools
    source ~/.bash_profile

  2. Deploy the vanilla version of Kubeflow on AWS and related AWS resources like EKS using Terraform. Please note that EBS volumes used in EKS nodegroup are not encrypted by default:
    #Define the following environment variables
    
    #Region to create the cluster in
    export CLUSTER_REGION=us-west-2
    #Name of the cluster to create
    export CLUSTER_NAME=<enter-cluster-name>

    cd deployments/vanilla/terraform
    
    #Save the variables to a .tfvars file
    cat <<EOF > sample.auto.tfvars
    cluster_name="${CLUSTER_NAME}"
    cluster_region="${CLUSTER_REGION}"
    EOF
    
    #Run the following one-click command to deploy terraform to install EKS infrastructure and Kubeflow
    make deploy

Set up the Kubeflow Permissions

  1. Add permissions to Notebook pod and Pipeline component pod to make SageMaker, S3 and IAM api calls using kubeflow_iam_permissions.sh script.
    export NAMESPACE=kubeflow-user-example-com
    
    wget https://raw.githubusercontent.com/aws-samples/eks-kubeflow-cloudformation-quick-start/9e46662d97e1be7edb0be7fc31166e545655636a/utils/kubeflow_iam_permissions.sh
    chmod +x kubeflow_iam_permissions.sh
    ./kubeflow_iam_permissions.sh $NAMESPACE $CLUSTER_NAME $CLUSTER_REGION

  2. Create SageMaker execution role to enable SageMaker training job to access training dataset from S3 service using sagemaker_role.sh script.
    wget https://raw.githubusercontent.com/aws-samples/eks-kubeflow-cloudformation-quick-start/9e46662d97e1be7edb0be7fc31166e545655636a/utils/sagemaker_role.sh
    chmod +x sagemaker_role.sh
    ./sagemaker_role.sh

Access the Kubeflow dashboard

To access the Kubeflow dashboard, complete the following steps:

  1. You can run Kubeflow dashboard locally in Cloud9 environment without exposing your URLs to public internet by running below commands.
    # Configure Kubecontext
    $(terraform output -raw configure_kubectl)
    
    cd ${MANIFEST_DIR}
    make port-forward

  2. Choose Preview Running Application.
  3. Choose the icon in the corner of the Kubeflow dashboard to open it as a separate tab in Chrome.
  4. Enter the default credentials (user@example.com/12341234) to log in to the Kubeflow dashboard.

Set up the Kubeflow on AWS environment

Once you’re logged in to the Kubeflow dashboard, ensure you have the right namespace (kubeflow-user-example-com) chosen. Complete the following steps to set up your Kubeflow on AWS environment:

  1. On the Kubeflow dashboard, choose Notebooks in the navigation pane.
  2. Choose New Notebook.
  3. For Name, enter aws-nb.
  4. For Jupyter Docket Image, choose the image jupyter-pytorch:1.12.0-cpu-py38-ubuntu20.04-ec2-2022-09-20 (the latest available jupyter-pytorch DLC image).
  5. For CPU, enter 1.
  6. For Memory, enter 5.
  7. For GPUs, leave as None.
  8. Don’t make any changes to the Workspace and Data Volumes sections.
  9. Select Allow access to Kubeflow Pipelines in the Configurations section and Choose Launch.
  10. Verify that your notebook is created successfully (it may take a couple of minutes).
  11. Choose Connect to log in to JupyterLab.
  12. Clone the repo by entering https://github.com/aws-samples/eks-kubeflow-cloudformation-quick-start.git in the Clone a repo field.
  13. Choose Clone.

Run a distributed training example

After you set up the Jupyter notebook, you can run the entire demo using the following high-level steps from the folder eks-kubeflow-cloudformation-quick-start/workshop/pytorch-distributed-training in the cloned repository:

  1. Run the PyTorch Distributed Data Parallel (DDP) training script – Refer to the PyTorch DDP training script cifar10-distributed-gpu-final.py, which includes a sample convolutional neural network and logic to distribute training on a multi-node CPU and GPU cluster.
  2. Create a Kubeflow pipeline – Run the notebook STEP1.0_create_pipeline_k8s_sagemaker.ipynb to create a pipeline that runs and deploy models on SageMaker. Make sure you install the SageMaker library as part of the first notebook cell and restart the kernel before you run the rest of the notebook cells.
  3. Invoke a SageMaker endpoint – Run the notebook STEP1.1_invoke_sagemaker_endpoint.ipynb to invoke and test the SageMaker model inference endpoint created in the previous notebook.

In the subsequent sections, we discuss each of these steps in detail.

Run the PyTorch DDP training script

As part of the distributed training, we train a classification model created by a simple convolutional neural network that operates on the CIFAR10 dataset. The training script cifar10-distributed-gpu-final.py contains only the open-source libraries and is compatible to run both on Kubernetes and SageMaker training clusters on either GPU devices or CPU instances. Let’s look at a few important aspects of the training script before we run our notebook examples.

We use the torch.distributed module, which contains PyTorch support and communication primitives for multi-process parallelism across nodes in the cluster:

...
import torch
import torch.distributed as dist
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch.utils.data
import torch.utils.data.distributed
import torchvision
from torchvision import datasets, transforms
...

We create a simple image classification model using a combination of convolutional, max pooling, and linear layers to which a relu activation function is applied in the forward pass of the model training:

# Define models
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(3, 6, 5)
self.pool = nn.MaxPool2d(2, 2)
self.conv2 = nn.Conv2d(6, 16, 5)
self.fc1 = nn.Linear(16 * 5 * 5, 120)
self.fc2 = nn.Linear(120, 84)
self.fc3 = nn.Linear(84, 10)

def forward(self, x):
x = self.pool(F.relu(self.conv1(x)))
x = self.pool(F.relu(self.conv2(x)))
x = x.view(-1, 16 * 5 * 5)
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
return x

If the training cluster has GPUs, the script runs the training on CUDA devices and the device variable holds the default CUDA device:

device = "cuda" if torch.cuda.is_available() else "cpu"
...

Before you run distributed training using PyTorch DistributedDataParallel to run distributed processing on multiple nodes, you need to initialize the distributed environment by calling init_process_group. This is initialized on each machine of the training cluster.

dist.init_process_group(backend=args.backend, rank=host_rank, world_size=world_size)
...

We instantiate the classifier model and copy over the model to the target device. If distributed training is enabled to run on multiple nodes, the DistributedDataParallel class is used as a wrapper object around the model object, which allows synchronous distributed training across multiple machines. The input data is split on the batch dimension and a replica of the model is placed on each machine and each device. See the following code:

model = Net().to(device)

if is_distributed:
model = torch.nn.parallel.DistributedDataParallel(model)

...

Create a Kubeflow pipeline

The notebook uses the Kubeflow Pipelines SDK and its provided set of Python packages to specify and run the ML workflow pipelines. As part of this SDK, we use the domain-specific language (DSL) package decorator dsl.pipeline, which decorates the Python functions to return a pipeline.

The Kubeflow pipeline uses SageMaker component V2 for submitting training to SageMaker using SageMaker ACK Operators. SageMaker model creation and model deployment uses SageMaker component V1, which are Boto3-based SageMaker components. We use a combination of both components in this example to demonstrate the flexibility you have in choice.

  1. Load the SageMaker components using the following code:
    # Loads SageMaker training components v2 for Kubeflow pipeline from the URL
    sagemaker_train_ack_op = components.load_component_from_url('https://raw.githubusercontent.com/kubeflow/pipelines/d4aaa03035f221351ebe72fbd74fcfccaf25bb66/components/aws/sagemaker/TrainingJob/component.yaml')
    
    # Loads SageMaker components v1 for Kubeflow pipeline from the URL
    sagemaker_model_op = components.load_component_from_url('https://raw.githubusercontent.com/kubeflow/pipelines/cb36f87b727df0578f4c1e3fe9c24a30bb59e5a2/components/aws/sagemaker/model/component.yaml')
    sagemaker_deploy_op = components.load_component_from_url('https://raw.githubusercontent.com/kubeflow/pipelines/cb36f87b727df0578f4c1e3fe9c24a30bb59e5a2/components/aws/sagemaker/deploy/component.yaml')

    In the following code, we create the Kubeflow pipeline where we run SageMaker distributed training using two ml.p3.2xlarge instances:

    # Create Kubeflow Pipeline using Amazon SageMaker Service
    @dsl.pipeline(name="PyTorch Training pipeline", description="Sample training job test")
    def pytorch_cnn_pipeline(region=target_region,
    train_image=aws_dlc_sagemaker_train_image,
    serving_image=aws_dlc_sagemaker_serving_image,
    learning_rate='0.01',
    pytorch_backend='gloo',
    training_job_name=pytorch_distributed_jobname,
    instance_type='ml.p3.2xlarge',
    instance_count='2',
    network_isolation='False',
    traffic_encryption='False',
    ):
    
    # Step to run training on SageMaker using SageMaker Components V2 for Pipeline.
    training = sagemaker_train_ack_op(
    region=region,
    algorithm_specification=(f'{{ '
    f'"trainingImage": "{train_image}",'
    '"trainingInputMode": "File"'
    f'}}'),
    training_job_name=training_job_name,
    hyper_parameters=(f'{{ '
    f'"backend": "{pytorch_backend}",'
    '"batch-size": "64",'
    '"epochs": "10",'
    f'"lr": "{learning_rate}",'
    '"model-type": "custom",'
    '"sagemaker_container_log_level": "20",'
    '"sagemaker_program": "cifar10-distributed-gpu-final.py",'
    f'"sagemaker_region": "{region}",'
    f'"sagemaker_submit_directory": "{source_s3}"'
    f'}}'),
    resource_config=(f'{{ '
    f'"instanceType": "{instance_type}",'
    f'"instanceCount": {instance_count},'
    '"volumeSizeInGB": 50'
    f'}}'),
    input_data_config=training_input(datasets),
    output_data_config=training_output(bucket_name),
    enable_network_isolation=network_isolation,
    enable_inter_container_traffic_encryption=traffic_encryption,
    role_arn=role,
    stopping_condition={"maxRuntimeInSeconds": 3600}
    )
    
    model_artifact_url = get_s3_model_artifact_op(
    training.outputs["model_artifacts"]
    ).output
    
    # This step creates SageMaker Model which refers to model artifacts and inference script to deserialize the input image
    create_model = sagemaker_model_op(
    region=region,
    model_name=training_job_name,
    image=serving_image,
    model_artifact_url=model_artifact_url,
    network_isolation=network_isolation,
    environment=(f'{{ '
    '"SAGEMAKER_CONTAINER_LOG_LEVEL": "20",'
    '"SAGEMAKER_PROGRAM": "inference.py",'
    f'"SAGEMAKER_REGION": "{region}",'
    f'"SAGEMAKER_SUBMIT_DIRECTORY": "{model_artifact_url}"'
    f'}}'),
    role=role
    )
    
    # This step creates SageMaker Endpoint which will be called to run inference
    prediction = sagemaker_deploy_op(
    region=region,
    model_name_1=create_model.output,
    instance_type_1='ml.c5.xlarge'
    )
    
    #Disable pipeline cache
    training.execution_options.caching_strategy.max_cache_staleness = "P0D"

    After the pipeline is defined, you can compile the pipeline to an Argo YAML specification using the Kubeflow Pipelines SDK’s kfp.compiler package. You can run this pipeline using the Kubeflow Pipelines SDK client, which calls the Pipelines service endpoint and passes in appropriate authentication headers right from the notebook. See the following code:

    # DSL Compiler that compiles pipeline functions into workflow yaml.
    kfp.compiler.Compiler().compile(pytorch_cnn_pipeline, "pytorch_cnn_pipeline.yaml")
    
    # Connect to Kubeflow Pipelines using the Kubeflow Pipelines SDK client
    client = kfp.Client()
    
    experiment = client.create_experiment(name="ml_workflow")
    
    # Run a specified pipeline
    my_run = client.run_pipeline(experiment.id, "pytorch_cnn_pipeline", "pytorch_cnn_pipeline.yaml")
    
    # Please click “Run details” link generated below this cell to view your pipeline. You can click every pipeline step to see logs.

  2. Choose the Run details link under the last cell to view the Kubeflow pipeline. The following screenshot shows our pipeline details for the SageMaker training and deployment component.
  3. Choose the training job step and on the Logs tab, choose the CloudWatch logs link to access the SageMaker logs.
    The following screenshot shows the CloudWatch logs for each of the two ml.p3.2xlarge instances.
  4. Choose any of the groups to see the logs.
  5. Capture the SageMaker endpoint by choosing the Sagemaker – Deploy Model step and copying the endpoint_name output artifact value.

Invoke a SageMaker endpoint

The notebook STEP1.1_invoke_sagemaker_endpoint.ipynb invokes the SageMaker inference endpoint created in the previous step. Ensure you update the endpoint name:

# Invoke SageMaker Endpoint. * Ensure you update the endpoint
# You can grab the SageMaker Endpoint name by either 1) going to the pipeline visualization of Kubeflow console and click the component for deployment, or 2) Go to SageMaker console and go to the list of endpoints, and then substitute the name to the EndpointName='...' in this cell.

endpointName='<update-endpoint-here>'

response = client.invoke_endpoint(EndpointName=endpointName,
ContentType='application/x-image',
Body=payload)

pred = json.loads(response['Body'].read().decode())

output_vector_list=pred['score']

# Get outout vector of 10 classes
output_vector = output_vector_list[0]

# Find the class with highest probability
max=output_vector[0]
index = 0
for i in range(1,len(output_vector)):
if output_vector[i] > max:
max = output_vector[i]
index = i

print(f'Index of the maximum value is : {index}')

labels = ['airplane','automobile','bird','cat','deer','dog','frog','horse','ship','truck']

print(labels[index])

Clean up

To clean up your resources, complete the following steps:

  1. Run the following commands in AWS Cloud9 to delete the AWS resources:
    cd ${MANIFEST_DIR}/deployments/vanilla/terraform
    make delete

  2. Delete IAM role “sagemakerrole” using following AWS CLI command:
    aws iam detach-role-policy --role-name sagemakerrole --policy-arn arn:aws:iam::aws:policy/AmazonSageMakerFullAccess
    aws iam detach-role-policy --role-name sagemakerrole --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess
    aws iam delete-role --role-name sagemakerrole

  3. Delete SageMaker endpoint using the following AWS CLI command:
    aws sagemaker delete-endpoint --endpoint-name <endpoint-name> --region us-west-2

Summary

In this post, we highlighted the value that Kubeflow on AWS 1.6.1 provides through native AWS-managed service integrations to address the need of enterprise-level AI and ML use cases. You can choose from several deployment options to install Kubeflow on AWS with various service integrations using Terraform, Kustomize, or Helm. The use case in this post demonstrated a Kubeflow integration with SageMaker that uses a SageMaker managed training cluster to run distributed training for an image classification model and SageMaker endpoint for model deployment.

We have also made available a sample pipeline example that uses the latest SageMaker components; you can run this directly from the Kubeflow dashboard. This pipeline requires the Amazon S3 data and SageMaker execution IAM role as the required inputs.

To get started with Kubeflow on AWS, refer to the available AWS-integrated deployment options in Kubeflow on AWS. You can follow the AWS Labs repository to track all AWS contributions to Kubeflow. You can also find us on the Kubeflow #AWS Slack Channel; your feedback there will help us prioritize the next features to contribute to the Kubeflow project.


About the authors

Kanwaljit Khurmi is a Senior Solutions Architect at Amazon Web Services. He works with the AWS customers to provide guidance and technical assistance helping them improve the value of their solutions when using AWS. Kanwaljit specializes in helping customers with containerized and machine learning applications.

Kartik Kalamadi is a Software Development Engineer at Amazon AI. Currently focused on Machine Learning Kubernetes open-source projects such as Kubeflow and AWS SageMaker Controller for k8s. In my spare time I like playing PC Games and fiddling with VR using Unity engine.

Rahul Kharse is a Software Development Engineer at Amazon Web Services. His work focuses on integrating AWS services with open source containerized ML Ops platforms to improve their scalability, reliability, and security. In addition to focusing on customer requests for features, Rahul also enjoys experimenting with the latest technological developments in the field.

Read More

Malware detection and classification with Amazon Rekognition

Malware detection and classification with Amazon Rekognition

According to an article by Cybersecurity Ventures, the damage caused by Ransomware (a type of malware that can block users from accessing their data unless they pay a ransom) increased by 57 times in 2021 as compared to 2015. Furthermore, it’s predicted to cost its victims $265 billion (USD) annually by 2031. At the time of writing, the financial toll from Ransomware attacks falls just above the 50th position in a list of countries ranked by their GDP.

Given the threat posed by malware, several techniques have been developed to detect and contain malware attacks. The two most common techniques used today are signature- and behavior-based detection.

Signature-based detection establishes a unique identifier about a known malicious object so that the object can be identified in the future. It may be a unique pattern of code attached to a file, or it may be the hash of a known malware code. If a known pattern identifier (signature) is discovered while scanning new objects, then the object is flagged as malicious. Signature-based detection is fast and requires low compute power. However, it struggles against polymorphic malware types, which continuously change their form to evade detection.

Behavior-based detection judges the suspicious objects based on their behavior. Artifacts that may be considered by anti-malware products are process interactions, DNS queries, and network connections from the object. This technique performs better at detecting polymorphic malware as compared to signature-based, but it does have some downsides. To assess if an object is malicious, it must run on the host and generate enough artifacts for the anti-malware product to detect it. This blind spot can let the malware infect the host and spread through the network.

Existing techniques are far from perfect. As a result, research continues with the aim to develop new alternative techniques that will improve our capabilities to combat against malware. One novel technique that has emerged in recent years is image-based malware detection. This technique proposes to train a deep-learning network with known malware binaries converted in greyscale images. In this post, we showcase how to perform Image-based Malware detection with Amazon Rekognition Custom Labels.

Solution overview

To train a multi-classification model and a malware-detection model, we first prepare the training and test datasets which contain different malware types such as flooder, adware, spyware, etc., as well as benign objects. We then convert the portable executables (PE) objects into greyscale images. Next, we train a model using the images with Amazon Rekognition.

Amazon Rekognition is a service that makes it simple to perform different types of visual analysis on your applications. Rekognition Image helps you build powerful applications to search, verify, and organize millions of images.

Amazon Rekognition Custom Labels builds off of Rekognition’s existing capabilities, which are already trained on tens of millions of images across many categories.

Amazon Rekognition Custom Labels is a fully-managed service that lets users analyze millions of images and utilize them to solve many different machine learning (ML) problems, including image classification, face detection, and content moderations. Behind the scenes, Amazon Rekognition is based on a deep learning technology. The service employs a convolution neural network (CNN), which is pre-trained on a large labeled dataset. By being exposed to such ground truth data, the algorithm can learn to recognize patterns in images from many different domains and can be used across many industry use-cases. Since AWS takes ownership of building and maintaining the model architecture and selecting an appropriate training method to the task at hand, users don’t need to spend time managing the infrastructure required for training tasks.

Solution architecture

The following architecture diagram provides an overview of the solution.

Solution Architecture

The solution is built using AWS Batch, AWS Fargate, and Amazon Rekognition. AWS Batch lets you run hundreds of batch computing jobs on Fargate. Fargate is compatible with both Amazon Elastic Container Service (Amazon ECS) and Amazon Elastic Kubernetes Service (Amazon EKS). Amazon Rekognition custom labels lets you use AutoML for computer vision to train custom models to detect malware and classify various malware categories. AWS Step Functions are used to orchestrate data preprocessing.

For this solution, we create the preprocessing resources via AWS CloudFormation. The CloudFormation stack template and the source code for the AWS Batch, Fargate, and Step functions are available in a GitHub Repository.

Dataset

To train the model in this example, we used the following public datasets to extract the malicious and benign Portable Executable (PE):

We encourage you to read carefully through the datasets documentation (Sophos/Reversing Labs README, PE Malware Machine Learning Dataset) to safely handle the malware objects. Based on your preference, you can also use other datasets as long as they provide malware and benign objects in the binary format.

Next, we’ll walk you through the following steps of the solution:

  • Preprocess objects and convert to images
  • Deploy preprocessing resources with CloudFormation
  • Choose the model
  • Train the model
  • Evaluate the model
  • Cost and performance

Preprocess objects and convert to images

We use Step Functions to orchestrate the object preprocessing workflow which includes the following steps:

  1. Take the meta.db sqllite database from sorel-20m S3 bucket and convert it to a .csv file. This helps us load the .csv file in a Fargate container and refer to the metadata while processing the malware objects.
  2. Take the objects from the sorel-20m S3 bucket and create a list of objects in the csv format. By performing this step, we’re creating a series of .csv files which can be processed in parallel, thereby reducing the time taken for the preprocessing.
  3. Convert the objects from the sorel-20m S3 bucket into images with an array of jobs. AWS Batch array jobs share common parameters for converting the malware objects into images. They run as a collection of image conversion jobs that are distributed across multiple hosts, and run concurrently.
  4. Pick a predetermined number of images for the model training with an array of jobs corresponding to the categories of malware.
  5. Similar to Step 2, we take the benign objects from the benign-160k S3 bucket and create a list of objects in csv format.
  6. Similar to Step 3, we convert the objects from the benign-160k S3 bucket into images with an array of jobs.
  7. Due to the Amazon Rekognition default quota for custom labels training (250K images), pick a predetermined number of benign images for the model training.
  8. As shown in the following image, the images are stored in an S3 bucket partitioned first by malware and benign folders, and then subsequently the malware is partitioned by malware types.
    Training S3 bucket
    Training dataset

Deploy the preprocessing resources with CloudFormation

Prerequisites

The following prerequisites are required before continuing:

Resource deployment

The CloudFormation stack will create the following resources:

Parameters

  • STACK_NAME – CloudFormation stack name
  • AWS_REGION – AWS region where the solution will be deployed
  • AWS_PROFILE – Named profile that will apply to the AWS CLI command
  • ARTEFACT_S3_BUCKET – S3 bucket where the infrastructure code will be stored. (The bucket must be created in the same region where the solution lives).
  • AWS_ACCOUNT – AWS Account ID.

Use the following commands to deploy the resources

Make sure the docker agent is running on the machine. The deployments are done using bash scripts, and in this case we use the following command:

bash malware_detection_deployment_scripts/deploy.sh -s '<STACK_NAME>' -b 'malware-
detection-<ACCOUNT_ID>-artifacts' -p <AWS_PROFILE> -r "<AWS_REGION>" -a
<ACCOUNT_ID>

This builds and deploys the local artifacts that the CloudFormation template (e.g., cloudformation.yaml) is referencing.

Train the model

Since Amazon Rekognition takes care of model training for you, computer vision or highly specialized ML knowledge isn’t required. However, you will need to provide Amazon Rekognition with a bucket filled with appropriately labeled input images.

In this post, we’ll train two independent image classification models via the custom labels feature:

  1. Malware detection model (binary classification) – identify if the given object is malicious or benign
  2. Malware classification model (multi-class classification) – identify the malware family for a given malicious object

Model training walkthrough

The steps listed in the following walkthrough apply to both models. Therefore, you will need to go through the steps two times in order to train both models.

  1. Sign in to the AWS Management Console and open the Amazon Rekognition console.
  2. In the left pane, choose Use Custom Labels. The Amazon Rekognition Custom Labels landing page is shown.
  3. From the Amazon Rekognition Custom Labels landing page, choose Get started.
  4. In the left pane, Choose Projects.
  5. Choose Create Project.
  6. In Project name, enter a name for your project.
  7. Choose Create project to create your project.
  8. In the Projects page, choose the project to which you want to add a dataset. The details page for your project is displayed.
  9. Choose Create dataset. The Create dataset page is shown.
  10. In Starting configuration, choose Start with a single dataset to let Amazon Rekognition split the dataset to training and test. Note that you might end up with different test samples in each model training iteration, resulting in slightly different results and evaluation metrics.
  11. Choose Import images from Amazon S3 bucket.
  12. In S3 URI, enter the S3 bucket location and folder path. The same S3 bucket provided from the preprocessing step is used to create both datasets: Malware detection and Malware classification. The Malware detection dataset points to the root (i.e., s3://malware-detection-training-{account-id}-{region}/) of the S3 bucket, while the Malware classification dataset points to the malware folder (i.e., s3://malware-detection-training-{account-id}-{region}/malware) of the S3 bucket. Training data
  13. Choose Automatically attach labels to images based on the folder.
  14. Choose Create Datasets. The datasets page for your project opens.
  15. On the Train model page, choose Train model. The Amazon Resource Name (ARN) for your project should be in the Choose project edit box. If not, then enter the ARN for your project.
  16. In the Do you want to train your model? dialog box, choose Train model.
  17. After training completes, choose the model’s name. Training is finished when the model status is TRAINING_COMPLETED.
  18. In the Models section, choose the Use model tab to start using the model.

For more details, check the Amazon Rekognition custom labels Getting started guide.

Evaluate the model

When the training models are complete, you can access the evaluation metrics by selecting Check metrics on the model page. Amazon Rekognition provides you with the following metrics: F1 score, average precision, and overall recall, which are commonly used to evaluate the performance of classification models. The latter are averaged metrics over the number of labels.

In the Per label performance section, you can find the values of these metrics per label. Additionally, to get the values for True Positive, False Positive, and False negative, select the View test results.

Malware detection model metrics

On the balanced dataset of 199,750 images with two labels (benign and malware), we received the following results:

  • F1 score – 0.980
  • Average precision – 0.980
  • Overall recall – 0.980

Malware detection model metrics

Malware classification model metrics

On the balanced dataset of 130,609 images with 11 labels (11 malware families), we received the following results:

  • F1 score – 0.921
  • Average precision – 0.938
  • Overall recall – 0.906

Malware classification model metrics

To assess whether the model is performing well, we recommend comparing its performance with other industry benchmarks which have been trained on the same (or at least similar) dataset. Unfortunately, at the time of writing of this post, there are no comparative bodies of research which solve this problem using the same technique and the same datasets. However, within the data science community, a model with an F1 score above 0.9 is considered to perform very well.

Cost and performance

Due to the serverless nature of the resources, the overall cost is influenced by the amount of time that each service is used. On the other hand, performance is impacted by the amount of data being processed and the training dataset size feed to Amazon Rekognition. For our cost and performance estimate exercise, we consider the following scenario:

  • 20 million objects are cataloged and processed from the sorel dataset.
  • 160,000 objects are cataloged and processed from the PE Malware Machine Learning Dataset.
  • Approximately 240,000 objects are written to the training S3 bucket: 160,000 malware objects and 80,000 benign objects.

Based on this scenario, the average cost to preprocess and deploy the models is $510.99 USD. You will be charged additionally $4 USD/h for every hour that you use the model. You may find the detailed cost breakdown in the estimate generated via the AWS Pricing Calculator.

Performance-wise, these are the results from our measurement:

  • ~2 h for the preprocessing flow to complete
  • ~40 h for the malware detecting model training to complete
  • ~40 h for the malware classification model training to complete

Clean-up

To avoid incurring future charges, stop and delete the Amazon Rekognition models, and delete the preprocessing resources via the destroy.sh script. The following parameters are required to run the script successfully:

  • STACK_NAME – The CloudFormation stack name
  • AWS_REGION – The Region where the solution is deployed
  • AWS_PROFILE – The named profile that applies to the AWS CLI command

Use the following commands to run the ./malware_detection_deployment_scripts/destroy.sh script:

bash malware_detection_deployment_scripts/destroy.sh -s <STACK_NAME> -p
<AWS_PROFILE> -r <AWS_REGION>

Conclusion

In this post, we demonstrated how to perform malware detection and classification using Amazon Rekognition. The solutions follow a serverless pattern, leveraging managed services for data preprocessing, orchestration, and model deployment. We hope that this post helps you in your ongoing efforts to combat malware.

In a future post we’ll show a practical use case of malware detection by consuming the models deployed in this post.


About the authors

Edvin HallvaxhiuEdvin Hallvaxhiu is a Senior Global Security Architect with AWS Professional Services and is passionate about cybersecurity and automation. He helps customers build secure and compliant solutions in the cloud. Outside work, he likes traveling and sports.

Rahul ShauryaRahul Shaurya is a Principal Data Architect with AWS Professional Services. He helps and works closely with customers building data platforms and analytical applications on AWS. Outside of work, Rahul loves taking long walks with his dog Barney.

Bruno DheftoBruno Dhefto is a Global Security Architect with AWS Professional Services. He is focused on helping customers building Secure and Reliable architectures in AWS. Outside of work, he is interested in the latest technology updates and traveling.

Nadim MajedNadim Majed is a data architect within AWS professional services. He works side by side with customers building their data platforms on AWS. Outside work, Nadim plays table tennis, and loves watching football/soccer.

Read More

Get more control of your Amazon SageMaker Data Wrangler workloads with parameterized datasets and scheduled jobs

Get more control of your Amazon SageMaker Data Wrangler workloads with parameterized datasets and scheduled jobs

Data is transforming every field and every business. However, with data growing faster than most companies can keep track of, collecting data and getting value out of that data is a challenging thing to do. A modern data strategy can help you create better business outcomes with data. AWS provides the most complete set of services for the end-to-end data journey to help you unlock value from your data and turn it into insight.

Data scientists can spend up to 80% of their time preparing data for machine learning (ML) projects. This preparation process is largely undifferentiated and tedious work, and can involve multiple programming APIs and custom libraries. Amazon SageMaker Data Wrangler helps data scientists and data engineers simplify and accelerate tabular and time series data preparation and feature engineering through a visual interface. You can import data from multiple data sources, such as Amazon Simple Storage Service (Amazon S3), Amazon Athena, Amazon Redshift, or even third-party solutions like Snowflake or DataBricks, and process your data with over 300 built-in data transformations and a library of code snippets, so you can quickly normalize, transform, and combine features without writing any code. You can also bring your custom transformations in PySpark, SQL, or Pandas.

This post demonstrates how you can schedule your data preparation jobs to run automatically. We also explore the new Data Wrangler capability of parameterized datasets, which allows you to specify the files to be included in a data flow by means of parameterized URIs.

Solution overview

Data Wrangler now supports importing data using a parameterized URI. This allows for further flexibility because you can now import all datasets matching the specified parameters, which can be of type String, Number, Datetime, and Pattern, in the URI. Additionally, you can now trigger your Data Wrangler transformation jobs on a schedule.

In this post, we create a sample flow with the Titanic dataset to show how you can start experimenting with these two new Data Wrangler’s features. To download the dataset, refer to Titanic – Machine Learning from Disaster.

Prerequisites

To get all the features described in this post, you need to be running the latest kernel version of Data Wrangler. For more information, refer to Update Data Wrangler. Additionally, you need to be running Amazon SageMaker Studio JupyterLab 3. To view the current version and update it, refer to JupyterLab Versioning.

File structure

For this demonstration, we follow a simple file structure that you must replicate in order to reproduce the steps outlined in this post.

  1. In Studio, create a new notebook.
  2. Run the following code snippet to create the folder structure that we use (make sure you’re in the desired folder in your file tree):
    !mkdir titanic_dataset
    !mkdir titanic_dataset/datetime_data
    !mkdir titanic_dataset/datetime_data/2021
    !mkdir titanic_dataset/datetime_data/2022
    
    !mkdir titanic_dataset/datetime_data/2021/01 titanic_dataset/datetime_data/2021/02 titanic_dataset/datetime_data/2021/03 
    !mkdir titanic_dataset/datetime_data/2021/04 titanic_dataset/datetime_data/2021/05 titanic_dataset/datetime_data/2021/06
    !mkdir titanic_dataset/datetime_data/2022/01 titanic_dataset/datetime_data/2022/02 titanic_dataset/datetime_data/2022/03 
    !mkdir titanic_dataset/datetime_data/2022/04 titanic_dataset/datetime_data/2022/05 titanic_dataset/datetime_data/2022/06
    
    !mkdir titanic_dataset/datetime_data/2021/01/01 titanic_dataset/datetime_data/2021/02/01 titanic_dataset/datetime_data/2021/03/01 
    !mkdir titanic_dataset/datetime_data/2021/04/01 titanic_dataset/datetime_data/2021/05/01 titanic_dataset/datetime_data/2021/06/01
    !mkdir titanic_dataset/datetime_data/2022/01/01 titanic_dataset/datetime_data/2022/02/01 titanic_dataset/datetime_data/2022/03/01 
    !mkdir titanic_dataset/datetime_data/2022/04/01 titanic_dataset/datetime_data/2022/05/01 titanic_dataset/datetime_data/2022/06/01
    
    !mkdir titanic_dataset/train_1 titanic_dataset/train_2 titanic_dataset/train_3 titanic_dataset/train_4 titanic_dataset/train_5
    !mkdir titanic_dataset/train titanic_dataset/test

  3. Copy the train.csv and test.csv files from the original Titanic dataset to the folders titanic_dataset/train and titanic_dataset/test, respectively.
  4. Run the following code snippet to populate the folders with the necessary files:
    import os
    import math
    import pandas as pd
    batch_size = 100
    
    #Get a list of all the leaf nodes in the folder structure
    leaf_nodes = []
    
    for root, dirs, files in os.walk('titanic_dataset'):
        if not dirs:
            if root != "titanic_dataset/test" and root != "titanic_dataset/train":
                leaf_nodes.append(root)
                
    titanic_df = pd.read_csv('titanic_dataset/train/train.csv')
    
    #Create the mini batch files
    for i in range(math.ceil(titanic_df.shape[0]/batch_size)):
        batch_df = titanic_df[i*batch_size:(i+1)*batch_size]
        
        #Place a copy of each mini batch in each one of the leaf folders
        for node in leaf_nodes:
            batch_df.to_csv(node+'/part_{}.csv'.format(i), index=False)

We split the train.csv file of the Titanic dataset into nine different files, named part_x, where x is the number of the part. Part 0 has the first 100 records, part 1 the next 100, and so on until part 8. Every node folder of the file tree contains a copy of the nine parts of the training data except for the train and test folders, which contain train.csv and test.csv.

Parameterized datasets

Data Wrangler users can now specify parameters for the datasets imported from Amazon S3. Dataset parameters are specified at the resources’ URI, and its value can be changed dynamically, allowing for more flexibility for selecting the files that we want to import. Parameters can be of four data types:

  • Number – Can take the value of any integer
  • String – Can take the value of any text string
  • Pattern – Can take the value of any regular expression
  • Datetime – Can take the value of any of the supported date/time formats

In this section, we provide a walkthrough of this new feature. This is available only after you import your dataset to your current flow and only for datasets imported from Amazon S3.

  1. From your data flow, choose the plus (+) sign next to the import step and choose Edit dataset.
  2. The preferred (and easiest) method of creating new parameters is by highlighting a section of you URI and choosing Create custom parameter on the drop-down menu. You need to specify four things for each parameter you want to create:
    1. Name
    2. Type
    3. Default value
    4. Description


    Here we have created a String type parameter called filename_param with a default value of train.csv. Now you can see the parameter name enclosed in double brackets, replacing the portion of the URI that we previously highlighted. Because the defined value for this parameter was train.csv, we now see the file train.csv listed on the import table.

  3. When we try to create a transformation job, on the Configure job step, we now see a Parameters section, where we can see a list of all of our defined parameters.
  4. Choosing the parameter gives us the option to change the parameter’s value, in this case, changing the input dataset to be transformed according to the defined flow.
    Assuming we change the value of filename_param from train.csv to part_0.csv, the transformation job now takes part_0.csv (provided that a file with the name part_0.csv exists under the same folder) as its new input data.
  5. Additionally, if you attempt to export your flow to an Amazon S3 destination (via a Jupyter notebook), you now see a new cell containing the parameters that you defined.
    Note that the parameter takes their default value, but you can change it by replacing its value in the parameter_overrides dictionary (while leaving the keys of the dictionary unchanged).

    Additionally, you can create new parameters from the Parameters UI.
  6. Open it up by choosing the parameters icon ({{}}) located next to the Go option; both of them are located next to the URI path value.
    A table opens with all the parameters that currently exist on your flow file (filename_param at this point).
  7. You can create new parameters for your flow by choosing Create Parameter.

    A pop-up window opens to let you create a new custom parameter.
  8. Here, we have created a new example_parameter as Number type with a default value of 0. This newly created parameter is now listed in the Parameters table. Hovering over the parameter displays the options Edit, Delete, and Insert.
  9. From within the Parameters UI, you can insert one of your parameters to the URI by selecting the desired parameter and choosing Insert.
    This adds the parameter to the end of your URI. You need to move it to the desired section within your URI.
  10. Change the parameter’s default value, apply the change (from the modal), choose Go, and choose the refresh icon to update the preview list using the selected dataset based on the newly defined parameter’s value.Let’s now explore other parameter types. Assume we now have a dataset split into multiple parts, where each file has a part number.
  11. If we want to dynamically change the file number, we can define a Number parameter as shown in the following screenshot.Note that the selected file is the one that matches the number specified in the parameter.
    Now let’s demonstrate how to use a Pattern parameter. Suppose we want to import all the part_1.csv files in all of the folders under the titanic-dataset/ folder. Pattern parameters can take any valid regular expression; there are some regex patterns shown as examples.
  12. Create a Pattern parameter called any_pattern to match any folder or file under the titanic-dataset/ folder with default value .*.Notice that the wildcard is not a single * (asterisk) but also has a dot.
  13. Highlight the titanic-dataset/ part of the path and create a custom parameter. This time we choose the Pattern type.This pattern selects all the files called part-1.csv from any of the folders under titanic-dataset/.
    A parameter can be used more than once in a path. In the following example, we use our newly created parameter any_pattern twice in our URI to match any of the part files in any of the folders under titanic-dataset/.
    Finally, let’s create a Datetime parameter. Datetime parameters are useful when we’re dealing with paths that are partitioned by date and time, like those generated by Amazon Kinesis Data Firehose (see Dynamic Partitioning in Kinesis Data Firehose). For this demonstration, we use the data under the datetime-data folder.
  14. Select the portion of your path that is a date/time and create a custom parameter. Choose the Datetime parameter type.
    When choosing the Datetime data type, you need to fill in more details.
  15. First of all, you must provide a date format. You can choose any of the predefined date/time formats or create a custom one.
    For the predefined date/time formats, the legend provides an example of a date matching the selected format. For this demonstration, we choose the format yyyy/MM/dd.
  16. Next, specify a time zone for the date/time values.
    For example, the current date may be January 1, 2022, in one time zone, but may be January 2, 2022, in another time zone.
  17. Finally, you can select the time range, which lets you select the range of files that you want to include in your data flow.
    You can specify your time range in hours, days, weeks, months, or years. For this example, we want to get all the files from the last year.
  18. Provide a description of the parameter and choose Create.
    If you’re using multiple datasets with different time zones, the time is not converted automatically; you need to preprocess each file or source to convert it to one time zone.The selected files are all the files under the folders corresponding to last year’s data.
  19. Now if we create a data transformation job, we can see a list of all of our defined parameters, and we can override their default values so that our transformation jobs pick the specified files.

Schedule processing jobs

You can now schedule processing jobs to automate running the data transformation jobs and exporting your transformed data to either Amazon S3 or Amazon SageMaker Feature Store. You can schedule the jobs with the time and periodicity that suits your needs.

Scheduled processing jobs use Amazon EventBridge rules to schedule the job’s run. Therefore, as a prerequisite, you have to make sure that the AWS Identity and Access Management (IAM) role being used by Data Wrangler, namely the Amazon SageMaker execution role of the Studio instance, has permissions to create EventBridge rules.

Configure IAM

Proceed with the following updates on the IAM SageMaker execution role corresponding to the Studio instance where the Data Wrangler flow is running:

  1. Attach the AmazonEventBridgeFullAccess managed policy.
  2. Attach a policy to grant permission to create a processing job:
    {
    	"Version": "2012-10-17",
    	"Statement": [
    		{
    			"Effect": "Allow",
    			"Action": "sagemaker:StartPipelineExecution",
    			"Resource": "arn:aws:sagemaker:Region:AWS-account-id:pipeline/data-wrangler-*"
    		}
    	]
    }

  3. Grant EventBridge permission to assume the role by adding the following trust policy:
    {
    	"Effect": "Allow",
    	"Principal": {
    		"Service": "events.amazonaws.com"
    	},
    	"Action": "sts:AssumeRole"
    }

Alternatively, if you’re using a different role to run the processing job, apply the policies outlined in steps 2 and 3 to that role. For details about the IAM configuration, refer to Create a Schedule to Automatically Process New Data.

Create a schedule

To create a schedule, have your flow opened in the Data Wrangler flow editor.

  1. On the Data Flow tab, choose Create job.
  2. Configure the required fields and chose Next, 2. Configure job.
  3. Expand Associate Schedules.
  4. Choose Create new schedule.

    The Create new schedule dialog opens, where you define the details of the processing job schedule.
    The dialog offers great flexibility to help you define the schedule. You can have, for example, the processing job running at a specific time or every X hours, on specific days of the week.
    The periodicity can be granular to the level of minutes.
  5. Define the schedule name and periodicity, then choose Create to save the schedule.
  6. You have the option to start the processing job right away along with the scheduling, which takes care of future runs, or leave the job to run only according to the schedule.
  7. You can also define an additional schedule for the same processing job.
  8. To finish the schedule for the processing job, choose Create.
    You see a “Job scheduled successfully” message. Additionally, if you chose to leave the job to run only according to the schedule, you see a link to the EventBridge rule that you just created.

If you choose the schedule link, a new tab in the browser opens, showing the EventBridge rule. On this page, you can make further modifications to the rule and track its invocation history. To stop your scheduled processing job from running, delete the event rule that contains the schedule name.

The EventBridge rule shows a SageMaker pipeline as its target, which is triggered according to the defined schedule, and the processing job invoked as part of the pipeline.

To track the runs of the SageMaker pipeline, you can go back to Studio, choose the SageMaker resources icon, choose Pipelines, and choose the pipeline name you want to track. You can now see a table with all current and past runs and status of that pipeline.

You can see more details by double-clicking a specific entry.

Clean up

When you’re not using Data Wrangler, it’s recommended to shut down the instance on which it runs to avoid incurring additional fees.

To avoid losing work, save your data flow before shutting Data Wrangler down.

  1. To save your data flow in Studio, choose File, then choose Save Data Wrangler Flow. Data Wrangler automatically saves your data flow every 60 seconds.
  2. To shut down the Data Wrangler instance, in Studio, choose Running Instances and Kernels.
  3. Under RUNNING APPS, choose the shutdown icon next to the sagemaker-data-wrangler-1.0 app.
  4. Choose Shut down all to confirm.

Data Wrangler runs on an ml.m5.4xlarge instance. This instance disappears from RUNNING INSTANCES when you shut down the Data Wrangler app.

After you shut down the Data Wrangler app, it has to restart the next time you open a Data Wrangler flow file. This can take a few minutes.

Conclusion

In this post, we demonstrated how you can use parameters to import your datasets using Data Wrangler flows and create data transformation jobs on them. Parameterized datasets allow for more flexibility on the datasets you use and allow you to reuse your flows. We also demonstrated how you can set up scheduled jobs to automate your data transformations and exports to either Amazon S3 or Feature Store, at the time and periodicity that suits your needs, directly from within Data Wrangler’s user interface.

To learn more about using data flows with Data Wrangler, refer to Create and Use a Data Wrangler Flow and Amazon SageMaker Pricing. To get started with Data Wrangler, see Prepare ML Data with Amazon SageMaker Data Wrangler.


About the authors

David Laredo is a Prototyping Architect for the Prototyping and Cloud Engineering team at Amazon Web Services, where he has helped develop multiple machine learning prototypes for AWS customers. He has been working in machine learning for the last 6 years, training and fine-tuning ML models and implementing end-to-end pipelines to productionize those models. His areas of interest are NLP, ML applications, and end-to-end ML.

Givanildo Alves is a Prototyping Architect with the Prototyping and Cloud Engineering team at Amazon Web Services, helping clients innovate and accelerate by showing the art of possible on AWS, having already implemented several prototypes around artificial intelligence. He has a long career in software engineering and previously worked as a Software Development Engineer at Amazon.com.br.

Adrian Fuentes is a Program Manager with the Prototyping and Cloud Engineering team at Amazon Web Services, innovating for customers in machine learning, IoT, and blockchain. He has over 15 years of experience managing and implementing projects and 1 year of tenure on AWS.

Read More

Detect multicollinearity, target leakage, and feature correlation with Amazon SageMaker Data Wrangler

Detect multicollinearity, target leakage, and feature correlation with Amazon SageMaker Data Wrangler

In machine learning (ML), data quality has direct impact on model quality. This is why data scientists and data engineers spend significant amount of time perfecting training datasets. Nevertheless, no dataset is perfect—there are trade-offs to the preprocessing techniques such as oversampling, normalization, and imputation. Also, mistakes and errors could creep in at various stages of data analytics pipeline.

In this post, you will learn how to use built-in analysis types in Amazon SageMaker Data Wrangler to help you detect the three most common data quality issues: multicollinearity, target leakage, and feature correlation.

Data Wrangler is a feature of Amazon SageMaker Studio which provides an end-to-end solution for importing, preparing, transforming, featurizing, and analyzing data. The transformation recipes created by Data Wrangler can integrate easily into your ML workflows and help streamline data preprocessing as well as feature engineering using little to no coding. You can also add your own Python scripts and transformations to customize the recipes.

Solution overview

To demonstrate Data Wrangler’s functionality in this post we are going to use the popular Titanic dataset. The dataset describes the survival status of individual passengers on the Titanic and has 14 columns, including the target column. These features include pclass, name, survived, age, embarked, home. dest, room, ticket, boat, and sex. The column pclass refers to passenger class (1st, 2nd, 3rd), and is a proxy for socio-economic class. The column survived is the target column.

Prerequisites

To use Data Wrangler, you need an active Studio instance. To learn how to launch a new instance, see Onboard to Amazon SageMaker Domain.

Before you get started, download the Titanic dataset to an Amazon Simple Storage Service (Amazon S3) bucket.

Create a data flow

To access Data Wrangler in Studio, complete the following steps:

  1. Next to the user you want to use to launch Studio, choose Open Studio.
  2. When Studio opens, choose the plus sign on the New data flow card under ML tasks and components.

This creates a new directory in Studio with a .flow file inside, which contains your data flow. The .flow file automatically opens in Studio.

You can also create a new flow by choosing File, then New, and choosing Data Wrangler Flow.

  1. Optionally, rename the new directory and the .flow file.

When you create a new .flow file in Studio, you might see a carousel that introduces you to Data Wrangler. This may take a few minutes.

When the Data Wrangler instance is active, you can see the data flow screen as shown in the following screenshot.

  1. Choose Use sample dataset to load the titanic dataset.

Create a Quick Model analysis

There are two ways to get a sense for a new (previously unseen) dataset. One is to run Data Quality and Insights Report. This report will provide high level statistics – number features, rows, missing values, etc and surface high priority warnings (if present) – duplicate rows, target leakage, anomalous samples, etc.

Another way is to run Quick Model analysis directly. Complete the following steps:

  1. Choose the plus sign and choose Add analysis.

  1. For Analysis type, choose Quick Model.
  2. For Analysis name¸ enter a name.
  3. For Label, choose the target label from the list of your feature columns (Survived).
  4. Choose Save.

The following graph visualizes our findings.

Quick Model trains a random forest with 10 trees on 730 observations and measures prediction quality on the remaining 315 observations. The dataset is automatically sampled and split into training and validation tests (70:30). In this example, you can see that the model achieved an F1 score of 0.777 on the test set. This could be an indicator that the data you’re exploring has the potential of being predictive.

At the same time, a few things stand out right away. The columns name and boat are the highest contributing signals towards your prediction. String columns like name can be both useful and not useful depending on the comprehensive information they carry about the person, like first, middle, and last names alongside the historical time periods and trends they belong to. This column can either be excluded or retained depending on the outcome of the contribution. In this case, a simple preview reveals that passenger names also include their titles (Mr, Dr, etc) which could potentially carry valuable information; therefore, we’re going to keep it. However, we do want to take a closer look at the boat column, which also seems to have a strong predictive power.

Target leakage

First, let’s start with the concept of leakage. Leakage can occur during different stages of the ML lifecycle. Using features that are available only during training but not during inference can also be defined as target leakage. For example, a deployed airbag is not a good predictor for a car crash, because in real life it occurs after the fact.

One of the techniques for identifying target leakage relies on computing ROC values for each feature. The closer the value is to a 1, the more likely the feature is very predictive of the target and therefore the more likely it’s a leaked target. On the other hand, the closer the value is to 0.5 and below (rarely), the less likely this feature contributes anything towards prediction. Finally, values that are above 0.5 and below 1 indicate that the feature doesn’t carry predictive power by itself, but may be a contributor in a group—which is what we’d like to see ideally.

Let’s create a target leakage analysis on your dataset. This analysis together with a set of advanced analyses are offered as built-in analysis types in Data Wrangler. To create the analysis, choose Add Analysis and choose Target Leakage. This is similar to how you previously created a Quick Model analysis.

As you can see in the following figure, your most predictive feature boat is quite close in ROC value to 1, which makes it a possible suspect for target leakage.

If you read the description of the dataset, the boat column contains the lifeboat number in which the passenger managed to escape. Naturally, there is quite a close correlation with the survival label. The lifeboat number is known only after the fact—when the lifeboat was picked up and the survivors on it were identified. This is very similar to the airbag example. Therefore, the boat column is indeed a target leakage.

You can eliminate it from your dataset by applying the drop column transform in the Data Wrangler UI (choose Handle Columns, choose Drop, and indicate boat). Now if you rerun the analysis, you get the following.

Multicollinearity

Multicollinearity occurs when two or more features in a dataset are highly correlated with one another. Detecting the presence of multicollinearity in a dataset is important because multicollinearity can reduce predictive capabilities of an ML model. Multicollinearity can either already be present in raw data received from an upstream system, or it can be inadvertently introduced during feature engineering. For instance, the Titanic dataset contains two columns indicating the number of family members each passenger traveled with: number of siblings (sibsp) and number of parents (parch). Let’s say that somewhere in your feature engineering pipeline, you decided that it would make sense to introduce a simpler measure of each passenger’s family size by combining the two.

A very simple transformation step can help us achieve that, as shown in the following screenshot.

As a result, you now have a column called family_size, which reflects just that. If you didn’t drop the original two columns, you now have very strong correlation between both siblings as well as the parents columns and the family size. By creating another analysis and choosing Multicollinearity, you can now see the following.

In this case, you’re using the Variance Inflation Factor (VIF) approach to identify highly correlated features. VIF scores are calculated by solving a regression problem to predict one variable given the rest, and they can range between 1 and infinity. The higher the value is, the more dependent a feature is. Data Wrangler’s implementation of VIF analysis caps the scores at 50 and in general, a score of 5 means the feature is moderately correlated, whereas anything above 5 is considered highly correlated.

Your newly engineered feature is highly dependent on the original columns, which you can now simply drop by using another transformation by choosing Manage Columns, Drop Column.

An alternative approach to identify features that have less or more predictive power is to use the Lasso feature selection type of the multicollinearity analysis (for Problem type, choose Classification and for Label column, choose survived).

As outlined in the description, this analysis builds a linear classifier that provides a coefficient for each feature. The absolute value of this coefficient can also be interpreted as the importance score for the feature. As you can observe in your case, family_size carries no value in terms of feature importance due to its redundancy, unless you drop the original columns.

After dropping sibsp and parch, you get the following.

Data Wrangler also provides a third option to detect multicollinearity in your dataset facilitated via Principal Component Analysis (PCA). PCA measures the variance of the data along different directions in the feature space. The ordered list of variances, also known as the singular values, can inform about multicollinearity in your data. This list contains non-negative numbers. When the numbers are roughly uniform, the data has very few multicollinearities. However, when the opposite is true, the magnitude of the top values will dominate the rest. To avoid issues related to different scales, the individual features are standardized to have mean 0 and standard deviation 1 before applying PCA.

Before dropping the original columns (sibsp and parch), your PCA analysis is shown as follows.

After dropping sibsp and parch, you have the following.

Feature correlation

Correlation is a measure of the degree of dependence between variables. Correlated features in general don’t improve models but can have an impact on models. There are two types of correlation detection features available in Data Wrangler: linear and non-linear.

Linear feature correlation is based on Pearson’s correlation. Numeric-to-numeric correlation is in the range [-1, 1] where 0 implies no correlation, 1 implies perfect correlation, and -1 implies perfect inverse correlation. Numeric-to-categorical and categorical-to-categorical correlations are in the range [0, 1] where 0 implies no correlation and 1 implies perfect correlation. Features that are not either numeric or categorical are ignored.

The following correlation matrix and score table validate and reinforce your previous findings.

The columns survived and boat are highly correlating with each other. For this example, survived is the target column or the label you’re trying to predict. You saw this previously in your target leakage analysis. On the other hand, columns sibsp and parch are highly correlating with the derived feature family_size. This was confirmed in your previous multicollinearity analysis. We don’t see any strong inverse linear correlation in the dataset.

When two variable changes in a constant proportion, it’s called a linear correlation, whereas when the two variables don’t change in any constant proportion, the relationship is non-linear. Correlation is perfectly positive when proportional change in two variables is in the same direction. In contrast, correlation is perfectly negative when proportional change in two variables is in the opposite direction.

The difference between feature correlation and multi-collinearity (discussed previously) is as follows: feature correlation refers to the linear or non-linear relationship between two variables. With this context, you can define collinearity as a problem where two or more independent variables (predictors) have a strong linear or non-linear relationship. Multicollinearity is a special case of collinearity where a strong linear relationship exists between three or more independent variables even if no pair of variables has a high correlation.

Non-linear feature correlation is based on Spearman’s rank correlation. Numeric-to-categorical correlation is calculated by encoding the categorical features as the floating-point numbers that best predict the numeric feature before calculating Spearman’s rank correlation. Categorical-to-categorical correlation is based on the normalized Cramer’s V test.

Numeric-to-numeric correlation is in the range [-1, 1] where 0 implies no correlation, 1 implies perfect correlation, and -1 implies perfect inverse correlation. Numeric-to-categorical and categorical-to-categorical correlations are in the range [0, 1] where 0 implies no correlation and 1 implies perfect correlation. Features that aren’t numeric or categorical are ignored.

The following table lists for each feature what is the most correlated feature to it. It displays a correlation matrix for a dataset with up to 20 columns.

The results are very similar to what you saw in the previous linear correlation analysis, except you can also see a strong negative non-linear correlation between the pclass and fare numeric columns.

Finally, now that you have identified potential target leakage and eliminated features based on your analyses, let’s rerun the Quick Model analysis to look at the feature importance breakdown again.

The results look quite different than what you started with initially. Therefore, Data Wrangler makes it easy to run advanced ML-specific analysis with a few clicks and derive insights about the relationship between your independent variables (features) among themselves and also with the target variable. It also provides you with the Quick Model analysis type that lets you validate the current state of features by training a quick model and testing how predictive the model is.

Ideally, as a data scientist, you should start with some of the analyses showcased in this post and derive insights into what features are good to retain vs. what to drop.

Summary

In this post, you learned how to use Data Wrangler for exploratory data analysis, focusing on target leakage, feature correlation, and multicollinearity analyses to identify potential issues with training data and mitigate them with the help of built-in transformations. As next steps, we recommend you replicate the example in this post in your Data Wrangler data flow to experience what was discussed here in action.

If you’re new to Data Wrangler or Studio, refer to Get Started with Data Wrangler. If you have any questions related to this post, please add it in the comments section.


About the authors

Vadim Omeltchenko is a Sr. AI/ML Solutions Architect who is passionate about helping AWS customers innovate in the cloud. His prior IT experience was predominantly on the ground.

Arunprasath Shankar is a Sr. AI/ML Specialist Solutions Architect with AWS, helping global customers scale their AI solutions effectively and efficiently in the cloud. In his spare time, Arun enjoys watching sci-fi movies and listening to classical music.

Read More

New Amazon HealthLake capabilities enable next-generation imaging solutions and precision health analytics

New Amazon HealthLake capabilities enable next-generation imaging solutions and precision health analytics

At AWS, we have been investing in healthcare since Day 1 with customers including Moderna, Rush University Medical Center, and the NHS who have built breakthrough innovations in the cloud. From developing public health analytics hubs, to improving health equity and patient outcomes, to developing a COVID-19 vaccine in just 65 days, our customers are utilizing machine learning (ML) and the cloud to address some of healthcare’s biggest challenges and drive change toward more predictive and personalized care.

Last year, we launched Amazon HealthLake, a purpose-built service to store, transform, and query health data in the cloud, allowing you to benefit from a complete view of individual or patient population health data at scale.

Today, we’re excited to announce the launch of two new capabilities in HealthLake that deliver innovations for medical imaging and analytics.

Amazon HealthLake Imaging

Healthcare professionals face a myriad of challenges as the scale and complexity of medical imaging data continues to increase including the following:

  • The volume of medical imaging data has continued to accelerate over the past decade with over 5.5 billion imaging procedures done across the globe each year by a shrinking number of radiologists
  • The average imaging study size has doubled over the past decade to 150 MB as more advanced imaging procedures are being performed due to improvements in resolution and the increasing use of volumetric imaging
  • Health systems store multiple copies of the same imaging data in clinical and research systems, which leads to increased costs and complexity
  • It can be difficult to structure this data, which often takes data scientists and researchers weeks or months to derive important insights with advanced analytics and ML

These compounding factors are slowing down decision-making, which can affect care delivery. To address these challenges, we are excited to announce the preview of Amazon HealthLake Imaging, a new HIPAA-eligible capability that makes it easy to store, access, and analyze medical images at petabyte scale. This new capability is designed for fast, sub-second  medical image retrieval in your clinical workflows that you can access securely from anywhere (e.g., web, desktop, phone) and with high availability. Additionally, you can drive your existing medical viewers and analysis applications from a single encrypted copy of the same data in the cloud with normalized metadata and advanced compression. As a result, it is estimated that HealthLake Imaging helps you reduce the total cost of medical imaging storage by up to 40%.

We are proud to be working with partners on the launch of HealthLake Imaging to accelerate adoption of cloud-native solutions to help transition enterprise imaging workflows to the cloud and accelerate your pace of innovation.

Intelerad and Arterys are among the launch partners utilizing HealthLake Imaging to achieve higher scalability and viewing performance for their next-generation PACS systems and AI platform, respectively. Radical Imaging is providing customers with zero-footprint, cloud-capable medical imaging applications using open-source projects, such as OHIF or Cornerstone.js, built on HealthLake Imaging APIs. And NVIDIA has collaborated with AWS to develop a MONAI connector for HealthLake Imaging. MONAI is an open-source medical AI framework to develop and deploy models into AI applications, at scale.

“Intelerad has always focused on solving complex problems in healthcare, while enabling our customers to grow and provide exceptional patient care to more patients around the globe. In our continuous path of innovation, our collaboration with AWS, including leveraging Amazon HealthLake Imaging, allows us to innovate more quickly and reduce complexity while offering unparalleled scale and performance for our users.”

— AJ Watson, Chief Product Officer at Intelerad Medical Systems

“With Amazon HealthLake Imaging, Arterys was able to achieve noticeable improvements in performance and responsiveness of our applications, and with a rich feature set of future-looking enhancements, offers benefits and value that will enhance solutions looking to drive future-looking value out of imaging data.”

— Richard Moss, Director of Product Management at Arterys

Radboudumc and the University of Maryland Medical Intelligent Imaging Center (UM2ii) are among the customers utilizing HealthLake Imaging to improve the availability of medical images and utilize image streaming.

“At Radboud University Medical Center, our mission is to be a pioneer in shaping a more person-centered, innovative future of healthcare. We are building a collaborative AI solution with Amazon HealthLake Imaging for clinicians and researchers to speed up innovation by putting ML algorithms into the hands of clinicians faster.”

— Bram van Ginneken, Chair, Diagnostic Image Analysis Group at Radboudumc

“UM2ii was formed to unite innovators, thought leaders, and scientists across academics and industry. Our work with AWS will accelerate our mission to push the boundaries of medical imaging AI. We are excited to build the next generation of cloud-based intelligent imaging with Amazon HealthLake Imaging and AWS’s experience with scalability, performance, and reliability.”

— Paul Yi, Director at UM2ii

Amazon HealthLake Analytics

The second capability we’re excited to announce is Amazon HealthLake Analytics. Harnessing multi-modal data, which is highly contextual and complex, is key to making meaningful progress in providing patients highly personalized and precisely targeted diagnostics and treatments.

HealthLake Analytics makes it easy to query and derive insights from multi-modal health data at scale, at the individual or population levels, with the ability to share data securely across the enterprise and enable advanced analytics and ML in just a few clicks. This removes the need for you to execute complex data exports and data transformations.

HealthLake Analytics automatically normalizes raw health data from multiple disparate sources (e.g. medical records, health insurance claims, EHRs, medical devices) into an analytics and interoperability-ready format in a matter of minutes. Integration with other AWS services makes it easy to query the data with SQL using Amazon Athena, as well as share and analyze data to enable advanced analytics and ML. You can create powerful dashboards with Amazon QuickSight for care gap analyses and disease management of an entire patient population. Or you can build and train many ML models quickly and efficiently in Amazon SageMaker for AI-driven predictions, such as risk of hospital readmission or overall effectiveness of a line of treatment. HealthLake Analytics reduces what would take months of engineering effort and allows you to do what you do best—deliver care for patients.

Conclusion

At AWS, our goal is to support you to deliver convenient, personalized, and high-value care – helping you to reinvent how you collaborate, make data-driven clinical and operational decisions, enable precision medicine, accelerate therapy development, and decrease the cost of care.

With these new capabilities in Amazon HealthLake, we along with our partners can help enable next-generation imaging workflows in the cloud and derive insights from multi-modal health data, while complying with HIPAA, GDPR, and other regulations.

To learn more and get started, refer to Amazon HealthLake Analytics and Amazon HealthLake Imaging.


About the authors

Tehsin Syed is General Manager of Health AI at Amazon Web Services, and leads our Health AI engineering and product development efforts including Amazon Comprehend Medical and Amazon Health. Tehsin works with teams across Amazon Web Services responsible for engineering, science, product and technology to develop ground breaking healthcare and life science AI solutions and products. Prior to his work at AWS, Tehsin was Vice President of engineering at Cerner Corporation where he spent 23 years at the intersection of healthcare and technology.

Dr. Taha Kass-Hout is Vice President, Machine Learning, and Chief Medical Officer at Amazon Web Services, and leads our Health AI strategy and efforts, including Amazon Comprehend Medical and Amazon HealthLake. He works with teams at Amazon responsible for developing the science, technology, and scale for COVID-19 lab testing, including Amazon’s first FDA authorization for testing our associates—now offered to the public for at-home testing. A physician and bioinformatician, Taha served two terms under President Obama, including the first Chief Health Informatics officer at the FDA. During this time as a public servant, he pioneered the use of emerging technologies and the cloud (the CDC’s electronic disease surveillance), and established widely accessible global data sharing platforms: the openFDA, which enabled researchers and the public to search and analyze adverse event data, and precisionFDA (part of the Presidential Precision Medicine initiative).

Read More