Access private repos using the @remote decorator for Amazon SageMaker training workloads

Access private repos using the @remote decorator for Amazon SageMaker training workloads

As more and more customers are looking to put machine learning (ML) workloads in production, there is a large push in organizations to shorten the development lifecycle of ML code. Many organizations prefer writing their ML code in a production-ready style in the form of Python methods and classes as opposed to an exploratory style (writing code without using methods or classes) because this helps them ship production-ready code faster.

With Amazon SageMaker, you can use the @remote decorator to run a SageMaker training job simply by annotating your Python code with an @remote decorator. The SageMaker Python SDK will automatically translate your existing workspace environment and any associated data processing code and datasets into a SageMaker training job that runs on the SageMaker training platform.

Running a Python function locally often requires several dependencies, which may not come with the local Python runtime environment. You can install them via package and dependency management tools like pip or conda.

However, organizations operating in regulated industries like banking, insurance, and healthcare operate in environments that have strict data privacy and networking controls in place. These controls often mandate having no internet access available to any of their environments. The reason for such restriction is to have full control over egress and ingress traffic so they can reduce the chances of unscrupulous actors sending or receiving non-verified information through their network. It’s often also mandated to have such network isolation as part of the auditory and industrial compliance rules. When it comes to ML, this restricts data scientists from downloading any package from public repositories like PyPI, Anaconda, or Conda-Forge.

To provide data scientists access to the tools of their choice while also respecting the restrictions of the environment, organizations often set up their own private package repository hosted in their own environment. You can set up private package repositories on AWS in multiple ways:

In this post, we focus on the first option: using CodeArtifact.

Solution overview

The following architecture diagram shows the solution architecture.

Solution-Architecture-vpc-no-internet

The high-level steps to implement the solution are as follows

  • Set up a virtual private cloud (VPC) with no internet access using an AWS CloudFormation template.
  • Use a second CloudFormation template to set up CodeArtifact as a private PyPI repository and provide connectivity to the VPC, and set up an Amazon SageMaker Studio environment to use the private PyPI repository.
  • Train a classification model based on the MNIST dataset using an @remote decorator from the open-source SageMaker Python SDK. All the dependencies will be downloaded from the private PyPI repository.

Note that using SageMaker Studio in this post is optional. You can choose to work in any integrated development environment (IDE) of your choice. You just need to set up your AWS Command Line Interface (AWS CLI) credentials correctly. For more information, refer to Configure the AWS CLI.

Prerequisites

You need an AWS account with an AWS Identity and Access Management (IAM) role with permissions to manage resources created as part of the solution. For details, refer to Creating an AWS account.

Set up a VPC with no internet connection

Create a new CloudFormation stack using the vpc.yaml template. This template creates the following resources:

  • A VPC with two private subnets across two Availability Zones with no internet connectivity
  • A Gateway VPC endpoint for accessing Amazon S3
  • Interface VPC endpoints for SageMaker, CodeArtifact, and a few other services to allow the resources in the VPC to connect to AWS services via AWS PrivateLink

Provide a stack name, such as No-Internet, and complete the stack creation process.

vpc-no-internet-stack

Wait for the stack creation process to complete.

Set up a private repository and SageMaker Studio using the VPC

The next step is to deploy another CloudFormation stack using the sagemaker_studio_codeartifact.yaml template. This template creates the following resources:

Provide a stack name and keep the default values or adjust the parameters for the CodeArtifact domain name, private repository name, user profile name for SageMaker Studio, and name for the upstream public PyPI repository. You also we need to provide the VPC stack name created in the previous step.

Studio-CodeArtifact-stack

When the stack creation is complete, the SageMaker domain should be visible on the SageMaker console.

studio-domain

To verify there is no internet connection available in SageMaker Studio, launch SageMaker Studio. Choose File, New, and Terminal to launch a terminal and try to curl any internet resource. It should fail to connect, as shown in the following screenshot.

terminal-showing-no-internet

Train an image classifier using an @remote decorator with the private PyPI repository

In this section, we use the @remote decorator to run a PyTorch training job that produces a MNIST image classification model. To achieve this, we set up a configuration file, develop the training script, and run the training code.

Set up a configuration file

We set up a config.yaml file and provide the configurations needed to do the following:

  • Run a SageMaker training job in the no-internet VPC created earlier
  • Download the required packages by connecting to the private PyPI repository created earlier

The file looks like the following code:

SchemaVersion: '1.0'
SageMaker:
  PythonSDK:
    Modules:
      RemoteFunction:
        Dependencies: '../config/requirements.txt'
        InstanceType: 'ml.m5.xlarge'
        PreExecutionCommands:
            - 'aws codeartifact login --tool pip --domain <domain-name> --domain-owner <AWS account number> --repository <private repository name> --endpoint-url <VPC-endpoint-url-prefixed with https://>
        RoleArn: '<execution role ARN for running training job>'
        S3RootUri: '<s3 bucket to store the job output>'
        VpcConfig:
            SecurityGroupIds: 
            - '<security group id used by SageMaker Studio>'
            Subnets: 
            - '<VPC subnet id 1>'
            - '<VPC subnet id 2>'

The Dependencies field contains the path to requirements.txt, which contains all the dependencies needed. Note that all the dependencies will be downloaded from the private repository. The requirements.txt file contains the following code:

torch
torchvision
sagemaker>=2.156.0,<3

The PreExecutionCommands section contains the command to connect to the private PyPI repository. To get the CodeArtifact VPC endpoint URL, use the following code:

response = ec2.describe_vpc_endpoints(
    Filters=[
        {
            'Name': 'service-name',
            'Values': [
                f'com.amazonaws.{boto3_session.region_name}.codeartifact.api'
            ]
        },
    ]
)

code_artifact_api_vpc_endpoint = response['VpcEndpoints'][0]['DnsEntries'][0]['DnsName']

endpoint_url = f'https://{code_artifact_api_vpc_endpoint}'
endpoint_url

Generally, we get two VPC endpoints for CodeArtifact, and we can use any of them in the connection commands. For more details, refer to Use CodeArtifact from a VPC.

Additionally, configurations like execution role, output location, and VPC configurations are provided in the config file. These configurations are needed to run the SageMaker training job. To know more about all the configurations supported, refer to Configuration file.

It’s not mandatory to use the config.yaml file in order to work with the @remote decorator. This is just a cleaner way to supply all configurations to the @remote decorator. All the configs could also be supplied directly in the decorator arguments, but that reduces readability and maintainability of changes in the long run. Also, the config file can be created by an admin and shared with all the users in an environment.

Develop the training script

Next, we prepare the training code in simple Python files. We have divided the code into three files:

  • load_data.py – Contains the code to download the MNIST dataset
  • model.py – Contains the code for the neural network architecture for the model
  • train.py – Contains the code for training the model by using load_data.py and model.py

In train.py, we need to decorate the main training function as follows:

@remote(include_local_workdir=True)
def perform_train(train_data,
                  test_data,
                  *,
                  batch_size: int = 64,
                  test_batch_size: int = 1000,
                  epochs: int = 3,
                  lr: float = 1.0,
                  gamma: float = 0.7,
                  no_cuda: bool = True,
                  no_mps: bool = True,
                  dry_run: bool = False,
                  seed: int = 1,
                  log_interval: int = 10,
                  ):
    # pytorch native training code........

Now we’re ready to run the training code.

Run the training code with an @remote decorator

We can run the code from a terminal or from any executable prompt. In this post, we use a SageMaker Studio notebook cell to demonstrate this:

!python ./train.py

Running the preceding command triggers the training job. In the logs, we can see that it’s downloading the packages from the private PyPI repository.

training-job-logs

This concludes the implementation of an @remote decorator working with a private repository in an environment with no internet access.

Clean up

To clean up the resources, follow the instructions in CLEANUP.md.

Conclusion

In this post, we learned how to effectively use the @remote decorator’s capabilities while still working in restrictive environments without any internet access. We also learned how can we integrate CodeArtifact private repository capabilities with the help of configuration file support in SageMaker. This solution makes iterative development much simpler and faster. Another added advantage is that you can still continue to write the training code in a more natural, object-oriented way and still use SageMaker capabilities to run training jobs on a remote cluster with minimal changes in your code. All the code shown as part of this post is available in the GitHub repository.

As a next step, we encourage you to check out the @remote decorator functionality and Python SDK API and use it in your choice of environment and IDE. Additional examples are available in the amazon-sagemaker-examples repository to get you started quickly. You can also check out the post Run your local machine learning code as Amazon SageMaker Training jobs with minimal code changes for more details.


About the author

Vikesh Pandey is a Machine Learning Specialist Solutions Architect at AWS, helping customers from financial industries design and build solutions on generative AI and ML. Outside of work, Vikesh enjoys trying out different cuisines and playing outdoor sports.

Read More

Integrate SaaS platforms with Amazon SageMaker to enable ML-powered applications

Integrate SaaS platforms with Amazon SageMaker to enable ML-powered applications

Amazon SageMaker is an end-to-end machine learning (ML) platform with wide-ranging features to ingest, transform, and measure bias in data, and train, deploy, and manage models in production with best-in-class compute and services such as Amazon SageMaker Data Wrangler, Amazon SageMaker Studio, Amazon SageMaker Canvas, Amazon SageMaker Model Registry, Amazon SageMaker Feature Store, Amazon SageMaker Pipelines, Amazon SageMaker Model Monitor, and Amazon SageMaker Clarify. Many organizations choose SageMaker as their ML platform because it provides a common set of tools for developers and data scientists. A number of AWS independent software vendor (ISV) partners have already built integrations for users of their software as a service (SaaS) platforms to utilize SageMaker and its various features, including training, deployment, and the model registry.

In this post, we cover the benefits for SaaS platforms to integrate with SageMaker, the range of possible integrations, and the process for developing these integrations. We also deep dive into the most common architectures and AWS resources to facilitate these integrations. This is intended to accelerate time-to-market for ISV partners and other SaaS providers building similar integrations and inspire customers who are users of SaaS platforms to partner with SaaS providers on these integrations.

Benefits of integrating with SageMaker

There are a number of benefits for SaaS providers to integrate their SaaS platforms with SageMaker:

  • Users of the SaaS platform can take advantage of a comprehensive ML platform in SageMaker
  • Users can build ML models with data that is in or outside of the SaaS platform and exploit these ML models
  • It provides users with a seamless experience between the SaaS platform and SageMaker
  • Users can utilize foundation models available in Amazon SageMaker JumpStart to build generative AI applications
  • Organizations can standardize on SageMaker
  • SaaS providers can focus on their core functionality and offer SageMaker for ML model development
  • It equips SaaS providers with a basis to build joint solutions and go to market with AWS

SageMaker overview and integration options

SageMaker has tools for every step of the ML lifecycle. SaaS platforms can integrate with SageMaker across the ML lifecycle from data labeling and preparation to model training, hosting, monitoring, and managing models with various components, as shown in the following figure. Depending on the needs, any and all parts of the ML lifecycle can be run in either the customer AWS account or SaaS AWS account, and data and models can be shared across accounts using AWS Identity and Access Management (IAM) policies or third-party user-based access tools. This flexibility in the integration makes SageMaker an ideal platform for customers and SaaS providers to standardize on.

SageMaker overview

Integration process and architectures

In this section, we break the integration process into four main stages and cover the common architectures. Note that there can be other integration points in addition to these, but those are less common.

  • Data access – How data that is in the SaaS platform is accessed from SageMaker
  • Model training – How the model is trained
  • Model deployment and artifacts – Where the model is deployed and what artifacts are produced
  • Model inference – How the inference happens in the SaaS platform

The diagrams in the following sections assume SageMaker is running in the customer AWS account. Most of the options explained are also applicable if SageMaker is running in the SaaS AWS account. In some cases, an ISV may deploy their software in the customer AWS account. This is usually in a dedicated customer AWS account, meaning there still needs to be cross-account access to the customer AWS account where SageMaker is running.

There are a few different ways in which authentication across AWS accounts can be achieved when data in the SaaS platform is accessed from SageMaker and when the ML model is invoked from the SaaS platform. The recommended method is to use IAM roles. An alternative is to use AWS access keys consisting of an access key ID and secret access key.

Data access

There are multiple options on how data that is in the SaaS platform can be accessed from SageMaker. Data can either be accessed from a SageMaker notebook, SageMaker Data Wrangler, where users can prepare data for ML, or SageMaker Canvas. The most common data access options are:

  • SageMaker Data Wrangler built-in connector – The SageMaker Data Wrangler connector enables data to be imported from a SaaS platform to be prepared for ML model training. The connector is developed jointly by AWS and the SaaS provider. Current SaaS platform connectors include Databricks and Snowflake.
  • Amazon Athena Federated Query for the SaaS platformFederated queries enable users to query the platform from a SageMaker notebook via Amazon Athena using a custom connector that is developed by the SaaS provider.
  • Amazon AppFlow – With Amazon AppFlow, you can use a custom connector to extract data into Amazon Simple Storage Service (Amazon S3) which subsequently can be accessed from SageMaker. The connector for a SaaS platform can be developed by AWS or the SaaS provider. The open-source Custom Connector SDK enables the development of a private, shared, or public connector using Python or Java.
  • SaaS platform SDK – If the SaaS platform has an SDK (Software Development Kit), such as a Python SDK, this can be used to access data directly from a SageMaker notebook.
  • Other options – In addition to these, there can be other options depending on whether the SaaS provider exposes their data via APIs, files or an agent. The agent can be installed on Amazon Elastic Compute Cloud (Amazon EC2) or AWS Lambda. Alternatively, a service such as AWS Glue or a third-party extract, transform, and load (ETL) tool can be used for data transfer.

The following diagram illustrates the architecture for data access options.

Data access

Model training

The model can be trained in SageMaker Studio by a data scientist, using Amazon SageMaker Autopilot by a non-data scientist, or in SageMaker Canvas by a business analyst. SageMaker Autopilot takes away the heavy lifting of building ML models, including feature engineering, algorithm selection, and hyperparameter settings, and it is also relatively straightforward to integrate directly into a SaaS platform. SageMaker Canvas provides a no-code visual interface for training ML models.

In addition, Data scientists can use pre-trained models available in SageMaker JumpStart, including foundation models from sources such as Alexa, AI21 Labs, Hugging Face, and Stability AI, and tune them for their own generative AI use cases.

Alternatively, the model can be trained in a third-party or partner-provided tool, service, and infrastructure, including on-premises resources, provided the model artifacts are accessible and readable.

The following diagram illustrates these options.

Model training

Model deployment and artifacts

After you have trained and tested the model, you can either deploy it to a SageMaker model endpoint in the customer account, or export it from SageMaker and import it into the SaaS platform storage. The model can be stored and imported in standard formats supported by the common ML frameworks, such as pickle, joblib, and ONNX (Open Neural Network Exchange).

If the ML model is deployed to a SageMaker model endpoint, additional model metadata can be stored in the SageMaker Model Registry, SageMaker Model Cards, or in a file in an S3 bucket. This can be the model version, model inputs and outputs, model metrics, model creation date, inference specification, data lineage information, and more. Where there isn’t a property available in the model package, the data can be stored as custom metadata or in an S3 file.

Creating such metadata can help SaaS providers manage the end-to-end lifecycle of the ML model more effectively. This information can be synced to the model log in the SaaS platform and used to track changes and updates to the ML model. Subsequently, this log can be used to determine whether to refresh downstream data and applications that use that ML model in the SaaS platform.

The following diagram illustrates this architecture.

Model deployment and artifacts

Model inference

SageMaker offers four options for ML model inference: real-time inference, serverless inference, asynchronous inference, and batch transform. For the first three, the model is deployed to a SageMaker model endpoint and the SaaS platform invokes the model using the AWS SDKs. The recommended option is to use the Python SDK. The inference pattern for each of these is similar in that the predictor’s predict() or predict_async() methods are used. Cross-account access can be achieved using role-based access.

It’s also possible to seal the backend with Amazon API Gateway, which calls the endpoint via a Lambda function that runs in a protected private network.

For batch transform, data from the SaaS platform first needs to be exported in batch into an S3 bucket in the customer AWS account, then the inference is done on this data in batch. The inference is done by first creating a transformer job or object, and then calling the transform() method with the S3 location of the data. Results are imported back into the SaaS platform in batch as a dataset, and joined to other datasets in the platform as part of a batch pipeline job.

Another option for inference is to do it directly in the SaaS account compute cluster. This would be the case when the model has been imported into the SaaS platform. In this case, SaaS providers can choose from a range of EC2 instances that are optimized for ML inference.

The following diagram illustrates these options.

Model inference

Example integrations

Several ISVs have built integrations between their SaaS platforms and SageMaker. To learn more about some example integrations, refer to the following:

Conclusion

In this post, we explained why and how SaaS providers should integrate SageMaker with their SaaS platforms by breaking the process into four parts and covering the common integration architectures. SaaS providers looking to build an integration with SageMaker can utilize these architectures. If there are any custom requirements beyond what has been covered in this post, including with other SageMaker components, get in touch with your AWS account teams. Once the integration has been built and validated, ISV partners can join the AWS Service Ready Program for SageMaker and unlock a variety of benefits.

We also ask customers who are users of SaaS platforms to register their interest in an integration with Amazon SageMaker with their AWS account teams, as this can help inspire and progress the development for SaaS providers.


About the Authors

Mehmet Bakkaloglu is a Principal Solutions Architect at AWS, focusing on Data Analytics, AI/ML and ISV partners.

Raj Kadiyala is a Principal AI/ML Evangelist at AWS.

Read More

Highlight text as it’s being spoken using Amazon Polly

Highlight text as it’s being spoken using Amazon Polly

Amazon Polly is a service that turns text into lifelike speech. It enables the development of a whole class of applications that can convert text into speech in multiple languages.

This service can be used by chatbots, audio books, and other text-to-speech applications in conjunction with other AWS AI or machine learning (ML) services. For example, Amazon Lex and Amazon Polly can be combined to create a chatbot that engages in a two-way conversation with a user and performs certain tasks based on the user’s commands. Amazon Transcribe, Amazon Translate, and Amazon Polly can be combined to transcribe speech to text in the source language, translate it to a different language, and speak it.

In this post, we present an interesting approach for highlighting text as it’s being spoken using Amazon Polly. This solution can be used in many text-to-speech applications to do the following:

  • Add visual capabilities to audio in books, websites, and blogs
  • Increase comprehension when customers are trying to understand the text rapidly as it’s being spoken

Our solution gives the client (the browser, in this example), the ability to know what text (word or sentence) is being spoken by Amazon Polly at any instant. This enables the client to dynamically highlight the text as it’s being spoken. Such a capability is useful for providing visual aid to speech for the use cases mentioned previously.

Our solution can be extended to perform additional tasks besides highlighting text. For example, the browser can show images, play music, or perform other animations on the front end as the text is being spoken. This capability is useful for creating dynamic audio books, educational content, and richer text-to-speech applications.

Solution overview

At its core, the solution uses Amazon Polly to convert a string of text into speech. The text can be input from the browser or through an API call to the endpoint exposed by our solution. The speech generated by Amazon Polly is stored as an audio file (MP3 format) in an Amazon Simple Storage Service (Amazon S3) bucket.

However, using the audio file alone, the browser can’t find what parts of the text are being spoken at any instant because we don’t have granular information on when each word is spoken.

Amazon Polly provides a way to obtain this using speech marks. Speech marks are stored in a text file that shows the time (measured in milliseconds from start of the audio) when each word or sentence is spoken.

Amazon Polly returns speech mark objects in a line-delimited JSON stream. A speech mark object contains the following fields:

  • Time – The timestamp in milliseconds from the beginning of the corresponding audio stream
  • Type – The type of speech mark (sentence, word, viseme, or SSML)
  • Start – The offset in bytes (not characters) of the start of the object in the input text (not including viseme marks)
  • End – The offset in bytes (not characters) of the object’s end in the input text (not including viseme marks)
  • Value – This varies depending on the type of speech mark:
    • SSML – <mark> SSML tag
    • Viseme – The viseme name
    • Word or sentence – A substring of the input text as delimited by the start and end fields

For example, the sentence “Mary had a little lamb” can give you the following speech marks file if you use SpeechMarkTypes = [“word”, “sentence”] in the API call to obtain the speech marks:

{"time":0,"type":"sentence","start":0,"end":23,"value":"Mary had a little lamb."}
{"time":6,"type":"word","start":0,"end":4,"value":"Mary"}
{"time":373,"type":"word","start":5,"end":8,"value":"had"}
{"time":604,"type":"word","start":9,"end":10,"value":"a"}
{"time":643,"type":"word","start":11,"end":17,"value":"little"}
{"time":882,"type":"word","start":18, "end":22,"value":"lamb"}

The word “had” (at the end of line 3) begins 373 milliseconds after the audio stream begins, starts at byte 5, and ends at byte 8 of the input text.

Architecture overview

The architecture of our solution is presented in the following diagram.

Architecture Diagram

Highlight Text as it’s spoken, using Amazon Polly

Our website for the solution is stored on Amazon S3 as static files (JavaScript, HTML), which are hosted in Amazon CloudFront (1) and served to the end-user’s browser (2).

When the user enters text in the browser through a simple HTML form, it’s processed by JavaScript in the browser. This calls an API (3) through Amazon API Gateway, to invoke an AWS Lambda function (4). The Lambda function calls Amazon Polly (5) to generate speech (audio) and speech marks (JSON) files. Two calls are made to Amazon Polly to fetch the audio and speech marks files. The calls are made using JavaScript async functions. The output of these calls is the audio and speech marks files, which are stored in Amazon S3 (6a). To avoid multiple users overwriting each others’ files in the S3 bucket, the files are stored in a folder with a timestamp. This minimizes the chances of two users overwriting each others’ files in Amazon S3. For a production release, we can employ more robust approaches to segregate users’ files based on user ID or timestamp and other unique characteristics.

The Lambda function creates pre-signed URLs for the speech and speech marks files and returns them to the browser in the form of an array (7, 8, 9).

When the browser sends the text file to the API endpoint (3), it gets back two pre-signed URLs for the audio file and the speech marks file in one synchronous invocation (9). This is indicated by the key symbol next to the arrow.

A JavaScript function in the browser fetches the speech marks file and the audio from their URL handles (10). It sets up the audio player to play the audio. (The HTML audio tag is used for this purpose).

When the user clicks the play button, it parses the speech marks retrieved in the earlier step to create a series of timed events using timeouts. The events invoke a callback function, which is another JavaScript function used to highlight the spoken text in the browser. Simultaneously, the JavaScript function streams the audio file from its URL handle.

The result is that the events are run at the appropriate times to highlight the text as it’s spoken while the audio is being played. The use of JavaScript timeouts provides us the synchronization of the audio with the highlighted text.

Prerequisites

To run this solution, you need an AWS account with an AWS Identity and Access Management (IAM) user who has permission to use Amazon CloudFront, Amazon API Gateway, Amazon Polly, Amazon S3, AWS Lambda, and AWS Step Functions.

Use Lambda to generate speech and speech marks

The following code invokes the Amazon Polly synthesize_speech function two times to fetch the audio and speech marks file. They’re run as asynchronous functions and coordinated to return the result at the same time using promises.

const p1 = new Promise(doSynthesizeSpeech marks);
const p2 = new Promise(doSynthesizeSpeech);
var result;

await Promise.all([p1, p2])
.then((values) => {
//return array of presigned urls 
     console.log('Values:', values);
     result = { "output" : values };
})
.catch((err) => {
     console.log("Error:" + err);
     result = err;
});

On the JavaScript side, the text highlighting is done by highlighter(start, finish, word) and the timed events are set by setTimers():

function highlighter(start, finish, word) {
     let textarea = document.getElementById("postText");
     //console.log(start + "," + finish + "," + word);
     textarea.focus();
     textarea.setSelectionRange(start, finish);
}

function setTimers() {
     let speech marksStr = sessionStorage.getItem("speech marks");
     //read through the speech marks file and set timers for every word
     console.log(speech marksStr);
     let speech marks = speech marksStr.split("n");
     for (let i = 0; i < speech marks.length; i++) {
          //console.log(i + ":" + speech marks[i]);
          if (speech marks[i].length == 0) {
               continue;
     }

     smjson = JSON.parse(speech marks[i]);
     t = smjson["time"];
     s = smjson["start"];
     f = smjson["end"]; 
     word = smjson["value"];
     setTimeout(highlighter, t, s, f, word);
     }
}

Alternative approaches

Instead of the previous approach, you can consider a few alternatives:

  • Create both the speech marks and audio files inside a Step Functions state machine. The state machine can invoke the parallel branch condition to invoke two different Lambda functions: one to generate speech and another to generate speech marks. The code for this can be found in the using-step-functions subfolder in the Github repo.
  • Invoke Amazon Polly asynchronously to generate the audio and speech marks. This approach can be used if the text content is large or the user doesn’t need a real-time response. For more details about creating long audio files, refer to Creating Long Audio Files.
  • Have Amazon Polly create the presigned URL directly using the generate_presigned_url call on the Amazon Polly client in Boto3. If you go with this approach, Amazon Polly generates the audio and speech marks newly every time. In our current approach, we store these files in Amazon S3. Although these stored files aren’t accessible from the browser in our version of the code, you can modify the code to play previously generated audio files by fetching them from Amazon S3 (instead of regenerating the audio for the text again using Amazon Polly). We have more code examples for accessing Amazon Polly with Python in the AWS Code Library.

Create the solution

The entire solution is available from our Github repo. To create this solution in your account, follow the instructions in the README.md file. The solution includes an AWS CloudFormation template to provision your resources.

Cleanup

To clean up the resources created in this demo, perform the following steps:

  1. Delete the S3 buckets created to store the CloudFormation template (Bucket A), the source code (Bucket B) and the website (pth-cf-text-highlighter-website-[Suffix]).
  2. Delete the CloudFormation stack pth-cf.
  3. Delete the S3 bucket containing the speech files (pth-speech-[Suffix]). This bucket was created by the CloudFormation template to store the audio and speech marks files generated by Amazon Polly.

Summary

In this post, we showed an example of a solution that can highlight text as it’s being spoken using Amazon Polly. It was developed using the Amazon Polly speech marks feature, which provides us markers for the place each word or sentence begins in an audio file.

The solution is available as a CloudFormation template. It can be deployed as is to any web application that performs text-to-speech conversion. This would be useful for adding visual capabilities to audio in books, avatars with lip-sync capabilities (using viseme speech marks), websites, and blogs, and for aiding people with hearing impairments.

It can be extended to perform additional tasks besides highlighting text. For example, the browser can show images, play music, and perform other animations on the front end while the text is being spoken. This capability can be useful for creating dynamic audio books, educational content, and richer text-to-speech applications.

We welcome you to try out this solution and learn more about the relevant AWS services from the following links. You can extend the functionality for your specific needs.


About the Author

Varad G Varadarajan is a Trusted Advisor and Field CTO for Digital Native Businesses (DNB) customers at AWS. He helps them architect and build innovative solutions at scale using AWS products and services. Varad’s areas of interest are IT strategy consulting, architecture, and product management. Outside of work, Varad enjoys creative writing, watching movies with family and friends, and traveling.

Read More

Predict vehicle fleet failure probability using Amazon SageMaker Jumpstart

Predict vehicle fleet failure probability using Amazon SageMaker Jumpstart

Predictive maintenance is critical in automotive industries because it can avoid out-of-the-blue mechanical failures and reactive maintenance activities that disrupt operations. By predicting vehicle failures and scheduling maintenance and repairs, you’ll reduce downtime, improve safety, and boost productivity levels.

What if we could apply deep learning techniques to common areas that drive vehicle failures, unplanned downtime, and repair costs?

In this post, we show you how to train and deploy a model to predict vehicle fleet failure probability using Amazon SageMaker JumpStart. SageMaker Jumpstart is the machine learning (ML) hub of Amazon SageMaker, providing pre-trained, publicly available models for a wide range of problem types to help you get started with ML. The solution outlined in the post is available on GitHub.

SageMaker JumpStart solution templates

SageMaker JumpStart provides one-click, end-to-end solutions for many common ML use cases. Explore the following use cases for more information on available solution templates:

The SageMaker JumpStart solution templates cover a variety of use cases, under each of which several different solution templates are offered (the solution in this post, Predictive Maintenance for Vehicle Fleets, is in the Solutions section). Choose the solution template that best fits your use case from the SageMaker JumpStart landing page. For more information on specific solutions under each use case and how to launch a SageMaker JumpStart solution, see Solution Templates.

Solution overview

The AWS predictive maintenance solution for automotive fleets applies deep learning techniques to common areas that drive vehicle failures, unplanned downtime, and repair costs. It serves as an initial building block for you to get to a proof of concept in a short period of time. This solution contains data preparation and visualization functionality within SageMaker and allows you to train and optimize the hyperparameters of deep learning models for your dataset. You can use your own data or try the solution with a synthetic dataset as part of this solution. This version processes vehicle sensor data over time. A subsequent version will process maintenance record data.

The following diagram demonstrates how you can use this solution with SageMaker components. As part of the solution, the following services are used:

  • Amazon S3 – We use Amazon Simple Storage Service (Amazon S3) to store datasets
  • SageMaker notebook – We use a notebook to preprocess and visualize the data, and to train the deep learning model
  • SageMaker endpoint – We use the endpoint to deploy the trained model

Solution overview

The workflow includes the following steps:

  1. An extract of historical data is created from the Fleet Management System containing vehicle data and sensor logs.
  2. After the ML model is trained, the SageMaker model artifact is deployed.
  3. The connected vehicle sends sensor logs to AWS IoT Core (alternatively, via an HTTP interface).
  4. Sensor logs are persisted via Amazon Kinesis Data Firehose.
  5. Sensor logs are sent to AWS Lambda for querying against the model to make predictions.
  6. Lambda sends sensor logs to Sagemaker model inference for predictions.
  7. Predictions are persisted in Amazon Aurora.
  8. Aggregate results are displayed on an Amazon QuickSight dashboard.
  9. Real-time notifications on the predicted probability of failure are sent to Amazon Simple Notification Service (Amazon SNS).
  10. Amazon SNS sends notifications back to the connected vehicle.

The solution consists of six notebooks:

  • 0_demo.ipynb – A quick preview of our solution
  • 1_introduction.ipynb – Introduction and solution overview
  • 2_data_preparation.ipynb – Prepare a sample dataset
  • 3_data_visualization.ipynb – Visualize our sample dataset
  • 4_model_training.ipynb – Train a model on our sample dataset to detect failures
  • 5_results_analysis.ipynb – Analyze the results from the model we trained

Prerequisites

Amazon SageMaker Studio is the integrated development environment (IDE) within SageMaker that provides us with all the ML features that we need in a single pane of glass. Before we can run SageMaker JumpStart, we need to set up SageMaker Studio. You can skip this step if you already have your own version of SageMaker Studio running.

The first thing we need to do before we can use any AWS services is to make sure we have signed up for and created an AWS account. Then we create an administrative user and a group. For instructions on both steps, refer to Set Up Amazon SageMaker Prerequisites.

The next step is to create a SageMaker domain. A domain sets up all the storage and allows you to add users to access SageMaker. For more information, refer to Onboard to Amazon SageMaker Domain. This demo is created in the AWS Region us-east-1.

Finally, you launch SageMaker Studio. For this post, we recommend launching a user profile app. For instructions, refer to Launch Amazon SageMaker Studio.

To run this SageMaker JumpStart solution and have the infrastructure deployed to your AWS account, you need to create an active SageMaker Studio instance (see Onboard to Amazon SageMaker Studio). When your instance is ready, use the instructions in SageMaker JumpStart to launch the solution. The solution artifacts are included in this GitHub repository for reference.

Launch the SageMaker Jumpstart solution

To get started with the solution, complete the following steps:

  1. On the SageMaker Studio console, choose JumpStart.
    choose jumpstart
  2. On the Solutions tab, choose Predictive Maintenance for Vehicle Fleets.
    choose predictive maintenance
  3. Choose Launch.
    launch solution
    It takes a few minutes to deploy the solution.
  4. After the solution is deployed, choose Open Notebook.
    open notebook

If you’re prompted to select a kernel, choose PyTorch 1.8 Python 3.6 for all notebooks in this solution.

Solution preview

We first work on the 0_demo.ipynb notebook. In this notebook, you can get a quick preview of what the outcome will look like when you complete the full notebook for this solution.

Choose Run and Run All Cells to run all cells in SageMaker Studio (or Cell and Run All in a SageMaker notebook instance). You can run all the cells in each notebook one after the other. Ensure all the cells finish processing before moving to the next notebook.

run all cells

This solution relies on a config file to run the provisioned AWS resources. We generate the file as follows:

import boto3
import os
import json

client = boto3.client('servicecatalog')
cwd = os.getcwd().split('/')
i= cwd.index('S3Downloads')
pp_name = cwd[i + 1]
pp = client.describe_provisioned_product(Name=pp_name)
record_id = pp['ProvisionedProductDetail']['LastSuccessfulProvisioningRecordId']
record = client.describe_record(Id=record_id)

keys = [ x['OutputKey'] for x in record['RecordOutputs'] if 'OutputKey' and 'OutputValue' in x]
values = [ x['OutputValue'] for x in record['RecordOutputs'] if 'OutputKey' and 'OutputValue' in x]
stack_output = dict(zip(keys, values))

with open(f'/root/S3Downloads/{pp_name}/stack_outputs.json', 'w') as f:
json.dump(stack_output, f)

We have some sample time series input data consisting of a vehicle’s battery voltage and battery current over time. Next, we load and visualize the sample data. As shown in the following screenshots, the voltage and current values are on the Y axis and the readings (19 readings recorded) are on the X axis.

volt

current

volt and current

We have previously trained a model on this voltage and current data that predicts the probability of vehicle failure and have deployed the model as an endpoint in SageMaker. We will call this endpoint with some sample data to determine the probability of failure in the next time period.

Given the sample input data, the predicted probability of failure is 45.73%.

To move to the next stage, choose Click here to continue.

next stage

Introduction and solution overview

The 1_introduction.ipynb notebook provides an overview of the solution and stages, and a look into the configuration file that has content definition, data sampling period, train and test sample count, parameters, location, and column names for generated content.

After you review this notebook, you can move to the next stage.

Prepare a sample dataset

We prepare a sample dataset in the 2_data_preparation.ipynb notebook.

We first generate the configuration file for this solution:

import boto3
import os
import json

client = boto3.client('servicecatalog')
cwd = os.getcwd().split('/')
i= cwd.index('S3Downloads')
pp_name = cwd[i + 1]
pp = client.describe_provisioned_product(Name=pp_name)
record_id = pp['ProvisionedProductDetail']['LastSuccessfulProvisioningRecordId']
record = client.describe_record(Id=record_id)

keys = [ x['OutputKey'] for x in record['RecordOutputs'] if 'OutputKey' and 'OutputValue' in x]
values = [ x['OutputValue'] for x in record['RecordOutputs'] if 'OutputKey' and 'OutputValue' in x]
stack_output = dict(zip(keys, values))

with open(f'/root/S3Downloads/{pp_name}/stack_outputs.json', 'w') as f:
json.dump(stack_output, f)
import os

from source.config import Config
from source.preprocessing import pivot_data, sample_dataset
from source.dataset import DatasetGenerator
config = Config(filename="config/config.yaml", fetch_sensor_headers=False)
config

The config properties are as follows:

fleet_info_fn=data/example_fleet_info.csv
fleet_sensor_logs_fn=data/example_fleet_sensor_logs.csv
vehicle_id_column=vehicle_id
timestamp_column=timestamp
target_column=target
period_ms=30000
dataset_size=25000
window_length=20
chunksize=10000
processing_chunksize=2500
fleet_dataset_fn=data/processed/fleet_dataset.csv
train_dataset_fn=data/processed/train_dataset.csv
test_dataset_fn=data/processed/test_dataset.csv
period_column=period_ms

You can define your own dataset or use our scripts to generate a sample dataset:

if should_generate_data:
    fleet_statistics_fn = "data/generation/fleet_statistics.csv"
    generator = DatasetGenerator(fleet_statistics_fn=fleet_statistics_fn,
                                 fleet_info_fn=config.fleet_info_fn, 
                                 fleet_sensor_logs_fn=config.fleet_sensor_logs_fn, 
                                 period_ms=config.period_ms, 
                                 )
    generator.generate_dataset()

assert os.path.exists(config.fleet_info_fn), "Please copy your data to {}".format(config.fleet_info_fn)
assert os.path.exists(config.fleet_sensor_logs_fn), "Please copy your data to {}".format(config.fleet_sensor_logs_fn)

You can merge the sensor data and fleet vehicle data together:

pivot_data(config)
sample_dataset(config)

We can now move to data visualization.

Visualize our sample dataset

We visualize our sample dataset in 3_data_vizualization.ipynb. This solution relies on a config file to run the provisioned AWS resources. Let’s generate the file similar to the previous notebook.

The following screenshot shows our dataset.

dataset

Next, let’s build the dataset:

train_ds = PMDataset_torch(
    config.train_dataset_fn,
    sensor_headers=config.sensor_headers,
    target_column=config.target_column,
    standardize=True)

properties = train_ds.vehicle_properties_headers.copy()
properties.remove('vehicle_id')
properties.remove('timestamp')
properties.remove('period_ms')

Now that the dataset is ready, let’s visualize the data statistics. The following screenshot shows the data distribution based on vehicle make, engine type, vehicle class, and model.

visualize

Comparing the log data, let’s look at an example of the mean voltage across different years for Make E and C (random).

The mean of voltage and current is on the Y axis and the number of readings is on the X axis.

  • Possible values for log_target: [‘make’, ‘model’, ‘year’, ‘vehicle_class’, ‘engine_type’]
    • Randomly assigned value for log_target: make
  • Possible values for log_target_value1: [‘Make A’, ‘Make B’, ‘Make E’, ‘Make C’, ‘Make D’]
    • Randomly assigned value for log_target_value1: Make B
  • Possible values for log_target_value2: [‘Make A’, ‘Make B’, ‘Make E’, ‘Make C’, ‘Make D’]
    • Randomly assigned value for log_target_value2: Make D

Based on the above, we assume log_target: make, log_target_value1: Make B and log_target_value2: Make D

make b and d

The following graphs break down the mean of the log data.

engine g h e

The following graphs visualize an example of different sensor log values against voltage and current.

volt current 2

Train a model on our sample dataset to detect failures

In the 4_model_training.ipynb notebook, we train a model on our sample dataset to detect failures.

Let’s generate the configuration file similar to the previous notebook, and then proceed with training configuration:

sage_session = sagemaker.session.Session()
s3_bucket = sagemaker_configs["S3Bucket"]  
s3_output_path = 's3://{}/'.format(s3_bucket)
print("S3 bucket path: {}".format(s3_output_path))

# run in local_mode on this machine, or as a SageMaker TrainingJob
local_mode = False

if local_mode:
    instance_type = 'local'
else:
    instance_type = sagemaker_configs["SageMakerTrainingInstanceType"]
    
role = sagemaker.get_execution_role()
print("Using IAM role arn: {}".format(role))
# only run from SageMaker notebook instance
if local_mode:
    !/bin/bash ./setup.sh
cpu_or_gpu = 'gpu' if instance_type.startswith('ml.p') else 'cpu'


We can now define the data and initiate hyperparameter optimization:

%%time

estimator = PyTorch(entry_point="train.py",
                    source_dir='source',                    
                    role=role,
                    dependencies=["source/dl_utils"],
                    instance_type=instance_type,
                    instance_count=1,
                    output_path=s3_output_path,
                    framework_version="1.5.0",
                    py_version='py3',
                    base_job_name=job_name_prefix,
                    metric_definitions=metric_definitions,
                    hyperparameters= {
                        'epoch': 100,  # tune it according to your need
                        'target_column': config.target_column,
                        'sensor_headers': json.dumps(config.sensor_headers),
                        'train_input_filename': os.path.basename(config.train_dataset_fn),
                        'test_input_filename': os.path.basename(config.test_dataset_fn),
                        }
                     )

if local_mode:
    estimator.fit({'train': training_data, 'test': testing_data})
%%time

tuner = HyperparameterTuner(estimator,
                            objective_metric_name='test_auc',
                            objective_type='Maximize',
                            hyperparameter_ranges=hyperparameter_ranges,
                            metric_definitions=metric_definitions,
                            max_jobs=max_jobs,
                            max_parallel_jobs=max_parallel_jobs,
                            base_tuning_job_name=job_name_prefix)
tuner.fit({'train': training_data, 'test': testing_data})

Analyze the results from the model we trained

In the 5_results_analysis.ipynb notebook, we get data from our hyperparameter tuning job, visualize metrics of all the jobs to identify the best job, and build an endpoint for the best training job.

Let’s generate the configuration file similar to the previous notebook and visualize the metrics of all the jobs. The following plot visualizes test accuracy vs. epoch.

test accuracy

The following screenshot shows the hyperparameter tuning jobs we ran.

hyperparameter tuning jobs

You can now visualize data from the best training job (out of the four training jobs) based on the test accuracy (red).

As we can see in the following screenshots, the test loss declines and AUC and accuracy increase with epochs.

auc and accuracy

auc and accuracy 2

Based on the visualizations, we can now build an endpoint for the best training job:

%%time

role = sagemaker.get_execution_role()

model = PyTorchModel(model_data=model_artifact,
                     role=role,
                     entry_point="inference.py",
                     source_dir="source/dl_utils",
                     framework_version='1.5.0',
                     py_version = 'py3',
                     name=sagemaker_configs["SageMakerModelName"],
                     code_location="s3://{}/endpoint".format(s3_bucket)
                    )

endpoint_instance_type = sagemaker_configs["SageMakerInferenceInstanceType"]

predictor = model.deploy(initial_instance_count=1, instance_type=endpoint_instance_type, endpoint_name=sagemaker_configs["SageMakerEndpointName"])

def custom_np_serializer(data):
    return json.dumps(data.tolist())
    
def custom_np_deserializer(np_bytes, content_type='application/x-npy'):
    out = np.array(json.loads(np_bytes.read()))
    return out

predictor.serializer = custom_np_serializer
predictor.deserializer = custom_np_deserializer

After we build the endpoint, we can test the predictor by passing it sample sensor logs:

import botocore

config = botocore.config.Config(read_timeout=200)
runtime = boto3.client('runtime.sagemaker', config=config)

data = np.ones(shape=(1, 20, 2)).tolist()
payload = json.dumps(data)

response = runtime.invoke_endpoint(EndpointName=sagemaker_configs["SageMakerEndpointName"],
ContentType='application/json',
Body=payload)
out = json.loads(response['Body'].read().decode())[0]

print("Given the sample input data, the predicted probability of failure is {:0.2f}%".format(100*(1.0-out[0])))

Given the sample input data, the predicted probability of failure is 34.60%.

Clean up

When you’ve finished with this solution, make sure that you delete all unwanted AWS resources. On the Predictive Maintenance for Vehicle Fleets page, under Delete solution, choose Delete all resources to delete all the resources associated with the solution.

clean up

You need to manually delete any extra resources that you may have created in this notebook. Some examples include the extra S3 buckets (to the solution’s default bucket) and the extra SageMaker endpoints (using a custom name).

Customize the solution

Our solution is simple to customize. To modify the input data visualizations, refer to sagemaker/3_data_visualization.ipynb. To customize the machine learning, refer to sagemaker/source/train.py and sagemaker/source/dl_utils/network.py. To customize the dataset processing, refer to sagemaker/1_introduction.ipynb on how to define the config file.

Additionally, you can change the configuration in the config file. The default configuration is as follows:

fleet_info_fn=data/example_fleet_info.csv
fleet_sensor_logs_fn=data/example_fleet_sensor_logs.csv
vehicle_id_column=vehicle_id
timestamp_column=timestamp
target_column=target
period_ms=30000
dataset_size=10000
window_length=20
chunksize=10000
processing_chunksize=1000
fleet_dataset_fn=data/processed/fleet_dataset.csv
train_dataset_fn=data/processed/train_dataset.csv
test_dataset_fn=data/processed/test_dataset.csv
period_column=period_ms

The config file has the following parameters:

  • fleet_info_fn, fleet_sensor_logs_fn, fleet_dataset_fn, train_dataset_fn, and test_dataset_fn define the location of dataset files
  • vehicle_id_column, timestamp_column, target_column, and period_column define the headers for columns
  • dataset_size, chunksize, processing_chunksize, period_ms, and window_length define the properties of the dataset

Conclusion

In this post, we showed you how to train and deploy a model to predict vehicle fleet failure probability using SageMaker JumpStart. The solution is based on ML and deep learning models and allows a wide variety of input data including any time-varying sensor data. Because every vehicle has different telemetry on it, you can fine-tune the provided model to the frequency and type of data that you have.

To learn more about what you can do with SageMaker JumpStart, refer to the following:

Resources


About the Authors

Rajakumar Sampathkumar is a Principal Technical Account Manager at AWS, providing customers guidance on business-technology alignment and supporting the reinvention of their cloud operation models and processes. He is passionate about cloud and machine learning. Raj is also a machine learning specialist and works with AWS customers to design, deploy, and manage their AWS workloads and architectures.

Read More