Get started with generative AI on AWS using Amazon SageMaker JumpStart

Get started with generative AI on AWS using Amazon SageMaker JumpStart

Generative AI is gaining a lot of public attention at present, with talk around products such as GPT4, ChatGPT, DALL-E2, Bard, and many other AI technologies. Many customers have been asking for more information on AWS’s generative AI solutions. The aim of this post is to address those needs.

This post provides an overview of generative AI with a real customer use case, provides a concise description and outlines its benefits, references an easy-to-follow demo of AWS DeepComposer for creating new musical compositions, and outlines how to get started using Amazon SageMaker JumpStart for deploying GPT2, Stable Diffusion 2.0, and other generative AI models.

Generative AI overview

Generative AI is a specific field of artificial intelligence that focuses on generating new material. It’s one of the most exciting fields in the AI world, with the potential to transform existing businesses and allow completely new business ideas to come to market. You can use generative techniques for:

  • Creating new works of art using a model such as Stable Diffusion 2.0
  • Writing a best-selling book using a model such as GPT2, Bloom, or Flan-T5-XL
  • Composing your next symphony using the Transformers technique in AWS DeepComposer

AWS DeepComposer is an educational tool that helps you understand the key concepts associated with machine learning (ML) through the language of musical composition. To learn more, refer to Generate a jazz rock track using Generative Artificial Intelligence.

Stable Diffusion, GPT2, Bloom, and Flan-T5-XL are all ML models. They are simply mathematical algorithms that need to be trained to identify patterns within data. After the patterns are learned, they’re deployed onto endpoints, ready for a process known as inference. New data that the model hasn’t seen is fed into the inference model, and new creative material is produced.

For example, with image generation models such as Stable Diffusion, we can create stunning illustrations using a few words. With text generation models such as GPT2, Bloom, and Flan-T5-XL, we can generate new literary articles, and potentially books, from a simple human sentence.

Autodesk is an AWS customer using Amazon SageMaker to help their product designers sort through thousands of iterations of visual designs for various use cases and use ML to help choose the optimal design. Specifically, they have worked with Edera Safety to help develop a spinal cord protector that protects riders from accidents while participating in sporting events, such as mountain biking. For more information, check out the video AWS Machine Learning Enables Design Optimization.

To learn more about what AWS customers are doing with generative AI and fashion, refer to Virtual fashion styling with generative AI using Amazon SageMaker.

Now that we understand what generative AI is all about, let’s jump into a JumpStart demonstration to learn how to generate new text or images with AI.

Prerequisites

Amazon SageMaker Studio is the integrated development environment (IDE) within SageMaker that provides us with all the ML features that we need in a single pane of glass. Before we can run JumpStart, we need to set up Studio. You can skip this step if you already have your own version of Studio running.

The first thing we need to do before we can use any AWS services is to make sure we have signed up for and created an AWS account. Next is to create an administrative user and a group. For instructions on both steps, refer to Set Up Amazon SageMaker Prerequisites.

The next step is to create a SageMaker domain. A domain sets up all the storage and allows you to add users to access SageMaker. For more information, refer to Onboard to Amazon SageMaker Domain. This demo is created in the AWS Region us-east-1.

Finally, you launch Studio. For this post, we recommend launching a user profile app. For instructions, refer to Launch Amazon SageMaker Studio.

Choose a JumpStart solution

Now we come to the exciting part. You should now be logged in to Studio, and see a page similar to the following screenshot.

In the navigation pane, under SageMaker JumpStart, choose Models, notebooks, solutions.

You’re presented with a range of solutions, foundation models, and other artifacts that can help you get started with a specific model or a specific business problem or use case.

If you want to experiment in a particular area, you can use the search function. Or you can simply browse the artifacts to find the relevant model or business solution for your needs.

For example, if you’re interested in fraud detection solutions, enter fraud detection into the search bar.

Fraud Detection Screenshot

If you’re interested in text generation solutions, enter text generation into the search bar. A good place to start if you want to explore a range of text generation models is to select the Intro to JS – Text Generation notebook.

JS - Text Generation

Let’s dive into a specific demonstration of the GPT-2 model.

JumpStart GPT-2 model demo

GPT 2 is a language model that helps generate human-like text based on a given prompt. We can use this type of transformer model to create new sentences and help us automate writing. This can be used for content creation such as blogs, social media posts, and books.

The GPT 2 model is part of the Generative Pre-Trained Transformer family that was the predecessor to GPT 3. At the time of writing, GPT 3 is used as the foundation for the OpenAI ChatGPT application.

To start exploring the GPT-2 model demo in JumpStart, complete the following steps:

  1. On JumpStart, search for and choose GPT 2.
  2. In the Deploy Model section, expand Deployment Configuration.
  3. For SageMaker hosting instance, choose your instance (for this post, we use ml.c5.2xlarge).

Different machine types have different price points attached. At the time of writing, the ml.c5.2xlarge that we selected incurs under $0.50 per hour. For the most up-to-date pricing, refer to Amazon SageMaker Pricing.

  1. For Endpoint name, enter demo-hf-textgeneration-gpt2.
  2. Choose Deploy.

Endpoint Name & Deploy

Wait for the ML endpoint to deploy (up to 15 minutes).

  1. When the endpoint is deployed, choose Open Notebook.

Endpoint Status

You’ll see a page similar to the following screenshot.
Python Code

The document we’re using to showcase our demonstration is a Jupyter notebook, which encompasses all the necessary Python code. Note that the code in this screenshot maybe be slightly different to the code you have, because AWS is constantly updating these notebooks and making sure they are secure, are free of defects, and provide the best customer experience.

  1. Click into the first cell and choose Ctrl+Enter to run the code block.

Code Block 1

An asterisk (*) appears to the left of the code block and then turns into a number. The asterisk indicates that the code is running and is complete when the number appears.

  1. In the next code block, enter some sample text, then press Ctrl+Enter.

Code Block 2

  1. Choose Ctrl+Enter in the third code block to run it.

After about 30-60 seconds, you will see your inference results.

For the input text “Once upon a time there were 18 sandwiches,” we get the following generated text:

Once upon a time there were 18 sandwiches, four plates with some salad, and three sandwiches with some beef. One restaurant was so nice that the food was made by hand. There were people living at the beginning of the time who were waiting so that

For the input text “And for the final time Peter said to Mary,” we get the following generated text:

And for the final time Peter said to Mary that he was a saint.

11 But Peter said that it was not a blessing, but rather that it would be the death of Peter. And when Mary heard of that Peter said to him,

You can experiment with running this third code block multiple times, and you will notice that the model makes different predictions each time.

To tailor the output using some of the advanced features, scroll down to experiment in the fourth code block.

To learn more about text generation models, refer to Run text generation with Bloom and GPT models on Amazon SageMaker JumpStart.

Clean up resources

Before we move on, don’t forget to delete your endpoint when you’re finished. On the previous tab, under Delete Endpoint, choose Delete.

Delete Endpoint

If you have accidentally closed this notebook, you can also delete your endpoint via the SageMaker console. Under Inference in the navigation pane, choose Endpoints.

Select the endpoint you used and on the Actions menu, choose Delete.

Delete Endpoint

Now that we understand how to use our first JumpStart solution, let’s look at using a Stable Diffusion model.

JumpStart Stable Diffusion model demo

We can use the Stable Diffusion 2 model to generate images from a simple line of text. This can be used to generate content for things like social media posts, promotional material, album covers, or anything that requires creative artwork.

  1. Return to JumpStart, then search for and choose Stable Diffusion 2.

Stable Diffusion 2

  1. In the Deploy Model section, expand Deployment Configuration.
  2. For SageMaker hosting instance, choose your instance (for this post, we use ml.g5.2xlarge).
  3. For Endpoint name, enter demo-stabilityai-stable-diffusion-v2.
  4. Choose Deploy.

Because this is a larger model, it can take up to 25 minutes to deploy. When it’s ready, the endpoint status shows as In Service.

In Service

  1. Choose Open Notebook to open a Jupyter notebook with Python code.

Python Code

  1. Run the first and second code blocks.
  2. In the third code block, change the text prompt, then run the cell.

Code Block 1

Wait about 30–60 seconds for your image to appear. The following image is based on our example text.

Output Picture

Again, you can play with the advanced features in the next code block. The picture it creates is different every time.

Clean up resources

Again, don’t forget to delete your endpoint. This time, we’re using ml.g5.2xlarge, so it incurs slightly higher charges than before. At the time of writing, it was just over $1 per hour.

Finally, let’s move to AWS DeepComposer.

AWS DeepComposer

AWS DeepComposer is a great way to learn about generative AI. It allows you to use built-in melodies in your models to generate new forms of music. The model that you use determines on how the input melody is transformed.

If you’re used to participating in AWS DeepRacer days to help your employees learn about re-enforcement learning, consider augmenting and enhancing the day with AWS DeepComposer to learn about generative AI.

For a detailed explanation and easy-to-follow demonstration of three of the models in this post, refer to Generate a jazz rock track using Generative Artificial Intelligence.

Check out the following cool examples uploaded to SoundCloud using AWS DeepComposer.

We would love to see your experiments, so feel free to reach out via social media (@digitalcolmer) and share your learnings and experiments.

Conclusion

In this post, we talked about the definition of generative AI, illustrated by an AWS customer story. We then stepped you through how to get started with Studio and JumpStart, and showed you how to get started with GPT 2 and Stable Diffusion models. We wrapped up with a brief overview of AWS DeepComposer.

To explore JumpStart more, try using your own data to fine-tune an existing model. For more information, refer to Incremental training with Amazon SageMaker JumpStart. For information about fine-tuning Stable Diffusion models, refer to Fine-tune text-to-image Stable Diffusion models with Amazon SageMaker JumpStart.

To learn more about Stable Diffusion models, refer to Generate images from text with the stable diffusion model on Amazon SageMaker JumpStart.

We didn’t cover any information on the Flan-T5-XL model, so to learn more, refer to the following GitHub repo. The Amazon SageMaker Examples repo also includes a range of available notebooks on GitHub for the various SageMaker products, including JumpStart, covering a range of different use cases.

To learn more about AWS ML via a range of free digital assets, check out our AWS Machine Learning Ramp-Up Guide. You can also try our free ML Learning Plan to build on your current knowledge or have a clear starting point. To take an instructor-led course, we highly recommend the following courses:

It is truly an exciting time in the AI/ML space. AWS is here to support your ML journey, so please connect with us on social media. We look forward to seeing all your learning, experiments, and fun with the various ML services over the coming months and relish the opportunity to be your instructor on your ML journey.


About the Author

Paul Colmer is a Senior Technical Trainer at Amazon Web Services specializing in machine learning and generative AI. His passion is helping customers, partners, and employees develop and grow through compelling storytelling, shared experiences, and knowledge transfer. With over 25 years in the IT industry, he specializes in agile cultural practices and machine learning solutions. Paul is a Fellow of the London College of Music and Fellow of the British Computer Society.

Read More

Quickly build high-accuracy Generative AI applications on enterprise data using Amazon Kendra, LangChain, and large language models

Quickly build high-accuracy Generative AI applications on enterprise data using Amazon Kendra, LangChain, and large language models

Generative AI (GenAI) and large language models (LLMs), such as those available soon via Amazon Bedrock and Amazon Titan are transforming the way developers and enterprises are able to solve traditionally complex challenges related to natural language processing and understanding. Some of the benefits offered by LLMs include the ability to create more capable and compelling conversational AI experiences for customer service applications, and improving employee productivity through more intuitive and accurate responses.

For these use cases, however, it’s critical for the GenAI applications implementing the conversational experiences to meet two key criteria: limit the responses to company data, thereby mitigating model hallucinations (incorrect statements), and filter responses according to the end-user content access permissions.

To restrict the GenAI application responses to company data only, we need to use a technique called Retrieval Augmented Generation (RAG). An application using the RAG approach retrieves information most relevant to the user’s request from the enterprise knowledge base or content, bundles it as context along with the user’s request as a prompt, and then sends it to the LLM to get a GenAI response. LLMs have limitations around the maximum word count for the input prompt, therefore choosing the right passages among thousands or millions of documents in the enterprise, has a direct impact on the LLM’s accuracy.

In designing effective RAG, content retrieval is a critical step to ensure the LLM receives the most relevant and concise context from enterprise content to generate accurate responses. This is where the highly accurate, machine learning (ML)-powered intelligent search in Amazon Kendra plays an important role. Amazon Kendra is a fully managed service that provides out-of-the-box semantic search capabilities for state-of-the-art ranking of documents and passages. You can use the high-accuracy search in Amazon Kendra to source the most relevant content and documents to maximize the quality of your RAG payload, yielding better LLM responses than using conventional or keyword-based search solutions. Amazon Kendra offers easy-to-use deep learning search models that are pre-trained on 14 domains and don’t require any ML expertise, so there’s no need to deal with word embeddings, document chunking, and other lower-level complexities typically required for RAG implementations. Amazon Kendra also comes with pre-built connectors to popular data sources such as Amazon Simple Storage Service (Amazon S3), SharePoint, Confluence, and websites, and supports common document formats such as HTML, Word, PowerPoint, PDF, Excel, and pure text files. To filter responses based on only those documents that the end-user permissions allow, Amazon Kendra offers connectors with access control list (ACL) support. Amazon Kendra also offers AWS Identity and Access Management (IAM) and AWS IAM Identity Center (successor to AWS Single Sign-On) integration for user-group information syncing with customer identity providers such as Okta and Azure AD.

In this post, we demonstrate how to implement a RAG workflow by combining the capabilities of Amazon Kendra with LLMs to create state-of-the-art GenAI applications providing conversational experiences over your enterprise content. After Amazon Bedrock launches, we will publish a follow-up post showing how to implement similar GenAI applications using Amazon Bedrock, so stay tuned.

Solution overview

The following diagram shows the architecture of a GenAI application with a RAG approach.

We use an Amazon Kendra index to ingest enterprise unstructured data from data sources such as wiki pages, MS SharePoint sites, Atlassian Confluence, and document repositories such as Amazon S3. When a user interacts with the GenAI app, the flow is as follows:

  1. The user makes a request to the GenAI app.
  2. The app issues a search query to the Amazon Kendra index based on the user request.
  3. The index returns search results with excerpts of relevant documents from the ingested enterprise data.
  4. The app sends the user request and along with the data retrieved from the index as context in the LLM prompt.
  5. The LLM returns a succinct response to the user request based on the retrieved data.
  6. The response from the LLM is sent back to the user.

With this architecture, you can choose the most suitable LLM for your use case. LLM options include our partners Hugging Face, AI21 Labs, Cohere, and others hosted on an Amazon SageMaker endpoint, as well as models by companies like Anthropic and OpenAI. With Amazon Bedrock, you will be able to choose Amazon Titan, Amazon’s own LLM, or partner LLMs such as those from AI21 Labs and Anthropic with APIs securely without the need for your data to leave the AWS ecosystem. The additional benefits that Amazon Bedrock will offer include a serverless architecture, a single API to call the supported LLMs, and a managed service to streamline the developer workflow.

For the best results, a GenAI app needs to engineer the prompt based on the user request and the specific LLM being used. Conversational AI apps also need to manage the chat history and the context. GenAI app developers can use open-source frameworks such as LangChain that provide modules to integrate with the LLM of choice, and orchestration tools for activities such as chat history management and prompt engineering. We have provided the KendraIndexRetriever class, which implements a LangChain retriever interface, which applications can use in conjunction with other LangChain interfaces such as chains to retrieve data from an Amazon Kendra index. We have also provided a few sample applications in the GitHub repo. You can deploy this solution in your AWS account using the step-by-step guide in this post.

Prerequisites

For this tutorial, you’ll need a bash terminal with Python 3.9 or higher installed on Linux, Mac, or Windows Subsystem for Linux, and an AWS account. We also recommend using an AWS Cloud9 instance or an Amazon Elastic Compute Cloud (Amazon EC2) instance.

Implement a RAG workflow

To configure your RAG workflow, complete the following steps:

  1. Use the provided AWS CloudFormation template to create a new Amazon Kendra index.

This template includes sample data containing AWS online documentation for Amazon Kendra, Amazon Lex, and Amazon SageMaker. Alternately, if you have an Amazon Kendra index and have indexed your own dataset, you can use that. Launching the stack requires about 30 minutes followed by about 15 minutes to synchronize it and ingest the data in the index. Therefore, wait for about 45 minutes after launching the stack. Note the index ID and AWS Region on the stack’s Outputs tab.

  1. For an improved GenAI experience, we recommend requesting an Amazon Kendra service quota increase for maximum DocumentExcerpt size, so that Amazon Kendra provides larger document excerpts to improve semantic context for the LLM.
  2. Install the AWS SDK for Python on the command line interface of your choice.
  3. If you want to use the sample web apps built using Streamlit, you first need to install Streamlit. This step is optional if you want to only run the command line versions of the sample applications.
  4. Install LangChain.
  5. The sample applications used in this tutorial require you to have access to one or more LLMs from Flan-T5-XL, Flan-T5-XXL, Anthropic Claud-V1, and OpenAI-text-davinci-003.
    1. If you want to use Flan-T5-XL or Flan-T5-XXL, deploy them to an endpoint for inference using Amazon SageMaker Studio Jumpstart.
    2. If you want to work with Anthropic Claud-V1 or OpenAI-da-vinci-003, acquire the API keys for your LLMs of your interest from https://www.anthropic.com/ and https://openai.com/, respectively.
  6. Follow the instructions in the GitHub repo to install the KendraIndexRetriever interface and sample applications.
  7. Before you run the sample applications, you need to set environment variables with the Amazon Kendra index details and API keys of your preferred LLM or the SageMaker endpoints of your deployments for Flan-T5-XL or Flan-T5-XXL. The following is a sample script to set the environment variables:
    export AWS_REGION="<YOUR-AWS-REGION>"
    export KENDRA_INDEX_ID="<YOUR-KENDRA-INDEX-ID>"
    export FLAN_XL_ENDPOINT="<YOUR-SAGEMAKER-ENDPOINT-FOR-FLAN-T-XL>"
    export FLAN_XXL_ENDPOINT="<YOUR-SAGEMAKER-ENDPOINT-FOR-FLAN-T-XXL>"
    export OPENAI_API_KEY="<YOUR-OPEN-AI-API-KEY>"
    export ANTHROPIC_API_KEY="<YOUR-ANTHROPIC-API-KEY>"

  8. In a command line window, change to the samples subdirectory of where you have cloned the GitHub repository. You can run the command line apps from the command line as python <sample-file-name.py>. You can run the streamlit web app by changing the directory to samples and running streamlit run app.py <anthropic|flanxl|flanxxl|openai>.
  9. Open the sample file kendra_retriever_flan_xxl.py in an editor of your choice.

Observe the statement result = run_chain(chain, "What's SageMaker?"). This is the user query (“What’s SageMaker?”) that’s being run through the chain that uses Flan-T-XXL as the LLM and Amazon Kendra as the retriever. When this file is run, you can observe the output as follows. The chain sent the user query to the Amazon Kendra index, retrieved the top three result excerpts, and sent them as the context in a prompt along with the query, to which the LLM responded with a succinct answer. It has also provided the sources, (the URLs to the documents used in generating the answer).

~. python3 kendra_retriever_flan_xxl.py
Amazon SageMaker is a machine learning service that lets you train and deploy models in the cloud.
Sources:
https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-intro.html
https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-projects-whatis.html
https://docs.aws.amazon.com/sagemaker/latest/dg/whatis.html
  1. Now let’s run the web app app.py as streamlit run app.py flanxxl. For this specific run, we are using a Flan-T-XXL model as the LLM.

It opens a browser window with the web interface. You can enter a query, which in this case is “What is Amazon Lex?” As seen in the following screenshot, the application responds with an answer, and the Sources section provides the URLs to the documents from which the excerpts were retrieved from the Amazon Kendra index and sent to the LLM in the prompt as the context along with the query.

  1. Now let’s run app.py again and get a feel of the conversational experience using streamlit run app.py anthropic. Here the underlying LLM used is Anthropic Claud-V1.

As you can see in the following video, the LLM provides a detailed answer to the user’s query based on the documents it retrieved from the Amazon Kendra index and then supports the answer with the URLs to the source documents that were used to generate the answer. Note that the subsequent queries don’t explicitly mention Amazon Kendra; however, the ConversationalRetrievalChain (a type of chain that’s part of the LangChain framework and provides an easy mechanism to develop conversational application-based information retrieved from retriever instances, used in this LangChain application), manages the chat history and the context to get an appropriate response.

Also note that in the following screenshot, Amazon Kendra finds the extractive answer to the query and shortlists the top documents with excerpts. Then the LLM is able to generate a more succinct answer based on these retrieved excerpts.

In the following sections, we explore two use cases for using Generative AI with Amazon Kendra.

Use case 1: Generative AI for financial service companies

Financial organizations create and store data across various data repositories, including financial reports, legal documents, and whitepapers. They must adhere to strict government regulations and oversight, which means employees need to find relevant, accurate, and trustworthy information quickly. Additionally, searching and aggregating insights across various data sources is cumbersome and error prone. With Generative AI on AWS, users can quickly generate answers from various data sources and types, synthesizing accurate answers at enterprise scale.

We chose a solution using Amazon Kendra and AI21 Lab’s Jurassic-2 Jumbo Instruct LLM. With Amazon Kendra, you can easily ingest data from multiple data sources such as Amazon S3, websites, and ServiceNow. Then Amazon Kendra uses AI21 Lab’s Jurassic-2 Jumbo Instruct LLM to carry out inference activities on enterprise data such as data summarization, report generation, and more. Amazon Kendra augments LLMs to provide accurate and verifiable information to the end-users, which reduces hallucination issues with LLMs. With the proposed solution, financial analysts can make faster decisions using accurate data to quickly build detailed and comprehensive portfolios. We plan to make this solution available as an open-source project in near future.

Example

Using the Kendra Chatbot solution, financial analysts and auditors can interact with their enterprise data (financial reports and agreements) to find reliable answers to audit-related questions. Kendra ChatBot provides answers along with source links and has the capability to summarize longer answers. The following screenshot shows an example conversation with Kendra ChatBot.

Architecture overview

The following diagram illustrates the solution architecture.

The workflow includes the following steps:

  1. Financial documents and agreements are stored on Amazon S3, and ingested to an Amazon Kendra index using the S3 data source connector.
  2. The LLM is hosted on a SageMaker endpoint.
  3. An Amazon Lex chatbot is used to interact with the user via the Amazon Lex web UI.
  4. The solution uses an AWS Lambda function with LangChain to orchestrate between Amazon Kendra, Amazon Lex, and the LLM.
  5. When users ask the Amazon Lex chatbot for answers from a financial document, Amazon Lex calls the LangChain orchestrator to fulfill the request.
  6. Based on the query, the LangChain orchestrator pulls the relevant financial records and paragraphs from Amazon Kendra.
  7. The LangChain orchestrator provides these relevant records to the LLM along with the query and relevant prompt to carry out the required activity.
  8. The LLM processes the request from the LangChain orchestrator and returns the result.
  9. The LangChain orchestrator gets the result from the LLM and sends it to the end-user through the Amazon Lex chatbot.

Use case 2: Generative AI for healthcare researchers and clinicians

Clinicians and researchers often analyze thousands of articles from medical journals or government health websites as part of their research. More importantly, they want trustworthy data sources they can use to validate and substantiate their findings. The process requires hours of intensive research, analysis, and data synthesis, lengthening the time to value and innovation. With Generative AI on AWS, you can connect to trusted data sources and run natural language queries to generate insights across these trusted data sources in seconds. You can also review the sources used to generate the response and validate its accuracy.

We chose a solution using Amazon Kendra and Flan-T5-XXL from Hugging Face. First, we use Amazon Kendra to identify text snippets from semantically relevant documents in the entire corpus. Then we use the power of an LLM such as Flan-T5-XXL to use the text snippets from Amazon Kendra as context and obtain a succinct natural language answer. In this approach, the Amazon Kendra index functions as the passage retriever component in the RAG mechanism. Lastly, we use Amazon Lex to power the front end, providing a seamless and responsive experience to end-users. We plan to make this solution available as an open-source project in the near future.

Example

The following screenshot is from a web UI built for the solution using the template available on GitHub. The text in pink are responses from the Amazon Kendra LLM system, and the text in blue are the user questions.

Architecture overview

The architecture and solution workflow for this solution are similar to that of use case 1.

Clean up

To save costs, delete all the resources you deployed as part of the tutorial. If you launched the CloudFormation stack, you can delete it via the AWS CloudFormation console. Similarly, you can delete any SageMaker endpoints you may have created via the SageMaker console.

Conclusion

Generative AI powered by large language models is changing how people acquire and apply insights from information. However, for enterprise use cases, the insights must be generated based on enterprise content to keep the answers in-domain and mitigate hallucinations, using the Retrieval Augmented Generation approach. In the RAG approach, the quality of the insights generated by the LLM depends on the semantic relevance of the retrieved information on which it is based, making it increasingly necessary to use solutions such as Amazon Kendra that provide high-accuracy semantic search results out of the box. With its comprehensive ecosystem of data source connectors, support for common file formats, and security, you can quickly start using Generative AI solutions for enterprise use cases with Amazon Kendra as the retrieval mechanism.

For more information on working with Generative AI on AWS, refer to Announcing New Tools for Building with Generative AI on AWS. You can start experimenting and building RAG proofs of concept (POCs) for your enterprise GenAI apps, using the method outlined in this blog. As mentioned earlier, once Amazon Bedrock is available, we will publish a follow up blog showing how you can build RAG using Amazon Bedrock.


About the authors

Abhinav JawadekarAbhinav Jawadekar is a Principal Solutions Architect focused on Amazon Kendra in the AI/ML language services team at AWS. Abhinav works with AWS customers and partners to help them build intelligent search solutions on AWS.

Jean-Pierre Dodel is the Principal Product Manager for Amazon Kendra and leads key strategic product capabilities and roadmap prioritization. He brings extensive Enterprise Search and ML/AI experience to the team, with prior leading roles at Autonomy, HP, and search startups prior to joining Amazon 7 years ago.

Mithil Shah is an ML/AI Specialist at AWS. Currently he helps public sector customers improve lives of citizens by building Machine Learning solutions on AWS.

Firaz Akmal is a Sr. Product Manager for Amazon Kendra at AWS. He is a customer advocate, helping customers understand their search and generative AI use-cases with Kendra on AWS. Outside of work Firaz enjoys spending time in the mountains of the PNW or experiencing the world through his daughter’s perspective.

Abhishek Maligehalli Shivalingaiah is a Senior AI Services Solution Architect at AWS with focus on Amazon Kendra. He is passionate about building applications using Amazon Kendra ,Generative AI and NLP. He has around 10 years of experience in building Data & AI solutions to create value for customers and enterprises. He has built a (personal) chatbot for fun to answers questions about his career and professional journey. Outside of work he enjoys making portraits of family & friends, and loves creating artworks.

Read More

Implement backup and recovery using an event-driven serverless architecture with Amazon SageMaker Studio

Implement backup and recovery using an event-driven serverless architecture with Amazon SageMaker Studio

Amazon SageMaker Studio is the first fully integrated development environment (IDE) for ML. It provides a single, web-based visual interface where you can perform all machine learning (ML) development steps required to build, train, tune, debug, deploy, and monitor models. It gives data scientists all the tools you need to take ML models from experimentation to production without leaving the IDE. Moreover, as of November 2022, Studio supports shared spaces to accelerate real-time collaboration and multiple Amazon SageMaker domains in a single AWS Region for each account.

There are two prevailing use cases for Studio domain backup and recovery. The first use case involves a customer business unit and project wanting a functionality to replicate data scientists’ artifacts and data files to any target domains and profiles at will. The second use case involves the replication only when the domain and profile are deleted due to conditions such as the change from a customer-managed key to an AWS-managed key or a change of onboarding from AWS Identity and Access Management (IAM) authentication (see Onboard to Amazon SageMaker Domain Using IAM) to AWS IAM Identity Center (see Onboard to Amazon SageMaker Domain Using IAM Identity Center).

This post mainly covers the second use case by presenting how to back up and recover users’ work when the user and space profiles are deleted and recreated, but we also provide the Python script to support the first use case.

When the user and space profiles are recreated in the existing Studio domain, a new ID of the profile directory will be created within the Studio Amazon Elastic File System (Amazon EFS) volume. As a result, the Studio users could lose access to the model artifacts and data files stored in their previous profile directory if they are deleted. Additionally, Studio domains don’t currently support mounting custom or additional EFS volumes. We recommend keeping the previous Studio EFS volume as a backup using RetentionPolicy in Studio.

Therefore, a proper recovery solution needs to be implemented to access the data from the previous directory in case of profile deletion or to recover files from a detached volume in case of domain deletion. Data scientists can minimize the potential impacts of deleting the domain and profiles if they frequently commit their code to the repository and utilize external storage for data access. However, having the capability to back up and recover the data scientist’s workspace is another layer to ensure their continuity of work, which may increase their productivity. Moreover, if you have tens and hundreds of Studio users, consider how to automate the recovery process to avoid mistakes and save costs and time. To solve this problem, we provide a solution to supplement Studio domain recovery.

This post explains the backup and recovery module and one approach to automate the process using an event-driven architecture. First, we demonstrate how to perform backup and recovery if you create a new Studio domain, user, and space profiles using AWS CloudFormation templates. Next, we explain the required steps to test our recovery solution using the existing domain and profiles without using our CloudFormation templates (you can use your own templates). Although this post focuses on a single domain setting, our solution works for multiple Studio domains as well. Finally, we have automated the provisioning of all resources using the AWS Serverless Application Model (AWS SAM), an open-source framework for building serverless applications.

Solution overview

The following diagram illustrates the high-level workflow of Studio domain backup and recovery with an event-driven architecture.

technical architecture

The event-driven app includes the following steps:

  1. An Amazon CloudWatch events rule uses AWS CloudTrail to trackCreateUserProfile and CreateSpace API calls, trigger the rule, and invoke the AWS Lambda function.
  2. The function updates the user table and appends items in the history table in Amazon DynamoDB. In addition, the database layer keeps track of the domain and profile name and file system mapping.

The following image shows the DynamoDB tables structure. The partition key and sort key in the studioUser table consist of the profile and domain name. The replication column holds the replication flag with true as the default value. In addition, bytes_written, bytes_file_transferred, total_duration_ms, and replication_status fields are populated when the replication completes successfully.

table schema

The database layer can be replaced by other services, such as Amazon Relational Database Service (Amazon RDS) or Amazon Simple Storage Service (Amazon S3). However, we chose DynamoDB because of the Amazon DynamoDB Streams feature.

  1. DynamoDB Streams is enabled on the user table, and the Lambda function is set as a trigger and synchronously invoked when new stream records are available.
  2. Another Lambda function triggers the process to restore the files using the user and space files restore tools.

The backup and recovery workflow includes the following steps:

  1. The backup and recovery workflow consists of AWS Step Functions, integrated with other AWS services, including AWS DataSync, to orchestrate the recovery of the user and space files from the previous directory to a new directory between the same Studio domain EFS volume (profile recreation) or a new domain EFS volume (domain recreation). With the Step Functions Workflow Studio, the workflow can be implemented with no code (such as in this case) or low code for a more customized solution. The Step Functions state machine is invoked when the event-driven app detects the profile creation event. For each profile, the Step Functions state machine runs the DataSync task to copy all files from their previous directories to the new directory.

The following image is the actual graph of the Step Functions state machine. Note that the ListApp* step ensures the profile directories are populated in the Studio EFS volume before proceeding. Also, we implemented retry with exponential backoff to handle API throttle for DataSync CreateLocationEfs and CreateTask API calls.

step functions diagram

  1. When the users open their Studio, all the files from the respective directories from the previous directory will be available to continue their work. The DataSync job replicating one gigabyte of data from our experiment took approximately 1 minute.

The following are services that will be used as part of the solution:

Prerequisites

To implement this solution, you must have the following prerequisites:

  • An AWS account if you don’t already have one. The IAM user that you use must have sufficient permissions to make the necessary AWS service calls and manage AWS resources.
  • The AWS SAM CLI installed and configured.
  • Your AWS credentials set up.
  • Git installed.
  • Python 3.9.
  • A Studio profile and domain name combination that is unique across all Studio domains within a Region and account.
  • You need to use the existing Amazon VPC and S3 bucket to follow the deployment step.
  • Also, be aware of the service quota for the maximum number of DataSync tasks per account per Region (default is 100). You can request a quota increase to meet the number of replication tasks for your use case.

Refer to the AWS Regional Services List for service availability based on Region. Additionally, review Amazon SageMaker endpoints and quotas.

Set up a Studio profile recovery infrastructure

The following diagram shows the logical steps for a SageMaker administrator to set up the Studio user and space recovery infrastructure, which a single command can complete with our automated solution.

logical flow 1

To set up the environment, clone the GitHub repo in the terminal:

git clone https://github.com/aws-samples/sagemaker-studio-efs-recovery-serverless.git && cd sagemaker-studio-efs-recovery-serverless

The following code shows the deployment script usage:

bash deploy.sh -h

Usage: deploy.sh [-n <stack_name>] [-v <vpc_id>] [-s <subnet_id>] [-b <s3_bucket>] [-r <aws_region>] [-d] Options: -n: specify stack name -v: specify your vpc id -s: specify subnet -b: specify s3 bucket name to store artifacts -r: specify aws region -d: whether to skip a creation of a new SageMaker Studio Domain (default: no)

To create a new Amazon SageMaker domain, run the following command. You need to specify which Amazon VPC and subnet you want to use. We use VPC only mode for the Studio deployment. If you don’t have any preference, you can use the default VPC and subnet. Also, specify any stack name, AWS Region, and S3 bucket name for AWS SAM to deploy the Lambda function:

bash deploy.sh -v <vpc_id> -s <subnet_id> -b <s3_bucket_name> -n <stack_name> -r <aws_region>

If you want to use an existing Studio domain, run the following command. Option -d yes will skip creating a new Studio domain:

bash deploy.sh -v <vpc_id> -s <subnet_id> -b <s3_bucket_name> -n <stack_name> -r <aws_region> -d yes

For the existing domains, the SageMaker administrator must also update the source and target Studio EFS security groups to allow connection to the user and space file restore tool. For example, to run the following command, you need to specify HomeEfsFileSystemId, the EFS file system ID, and SecurityGroupId used by the user and space file restore tool (we discuss this in more detail later in the post):

python3 src/add-security-group.py --efs-id <HomeEfsFileSystemId> --security-groups <SecurityGroupId> --region <aws_region>

User and space recovery logical flow

The following diagram shows the logical user and space recovery flow diagram for a SageMaker administrator to understand how the solution works, and no additional setup is required. If the profile (user or space) and domain are accidentally deleted, the EFS volume is detached but not deleted. A possible scenario is that we may want to revert the deletion by recreating a new domain and profiles. If the same profiles are being onboarded again, they may wish to access the files from their respective workspace in the detached volume. The recovery process is almost entirely automated; the only action required by the SageMaker administrator is to recreate the Studio domain and profiles using the same CloudFormation template. The rest of the steps are automated.

logical flow 2

Optionally, if the SageMaker admin wants control over replication, run the following command to turn off replication for specific domains and profiles. This script updates the replication field given the domain and profile name in the table. Note that you need to run the script for the same user each time they get recreated.

python3 src/update-replication-flag.py --profile-name <profile_name> --domain-name <domain_name> --region <aws_region> --no-replication

The following optional step provides the solution for the first use case to allow replication to take place between the specified source file system to any target domain and profile name. If the SageMaker admin wants to replicate particular profile data to a different domain and a profile that doesn’t exist yet, run the following command. The script inserts the new domain and profile name with the specified source file system information. The subsequent profile creation will trigger the replication task. Note that you need to run add-security-group.py from the previous step to allow connection to the file restore tool.

python3 src/add-replication-target.py --src-profile-name <profile_name> --src-domain-name <domain_name> --target-profile-name <profile_name> --target-domain-name <domain_name> --region <aws_region>

In the following sections, we test two scenarios to confirm that the solution works as expected.

Create a new Studio domain

Our first test scenario assumes you are starting from scratch and want to create a new Studio domain and profiles in your environment using our templates. Then we deploy the Studio domain, user and space, backup and recovery workflow, and event app. The purpose of the first scenario is to confirm that the profile file is recovered in the new home directory automatically when the profile is deleted and recreated within the same Studio domain.

Complete the following steps:

  1. To deploy the application, run the following command:
    bash deploy.sh -v <vpc_id> -s <subnet_id> -b <s3_bucket_name> -n <stack_name> -r <aws_region>

  1. On the AWS CloudFormation console, ensure the following stacks are in CREATE_COMPLETE status:
    1. <stack_name>-DemoBootstrap-*
    2. <stack_name>-StepFunction-*
    3. <stack_name>-EventApp-*
    4. <stack_name>-StudioDomain-*
    5. <stack_name>-StudioUser1-*
    6. <stack_name>-StudioSpace-*

cloud formation console

If the deployment failed in any stacks, check the error and resolve the issues. Then, proceed to the next step only if the problems are resolved.

  1. On the DynamoDB console, choose Tables in the navigation pane and confirm that the studioUser and studioUserHistory tables are created.
  2. Select studioUser and choose Explore table items to confirm that items for user1 and space1 are populated in the table.
  3. On the SageMaker console, choose Domains in the navigation pane.
  4. Choose demo-myapp-dev-studio-domain.
  5. On the User profiles tab, select user1 and choose Launch, and choose Studio to open the Studio for the user.

Note that Studio may take 10-15 minutes to load for the first time.

  1. On the File menu, choose Terminal to launch a new terminal within Studio.
  2. Run the following command in the terminal to create a file for testing:
    echo "i don't want to lose access to this file" > user1.txt

  1. Repeat these steps for space1 (choose Spaces in Step 7). Feel free to create a file of your choice.
  2. Delete the Studio user user1 and space1 by removing the nested stacks <stack_name>-StudioUser1-* and <stack_name>-StudioSpace-* from the parent. Delete the stacks by commenting out the following code blocks from the AWS SAM template file, template.yaml. Make sure to save the file after the edit:
    StudioUser1:
      Type: AWS::Serverless::Application
      Condition: CreateDomainCond
      DependsOn: StudioDomain
      Properties:
        Location: Infrastructure/Templates/sagemaker-studio-user.yaml
        Parameters:
          LambdaLayerArn: !GetAtt DemoBootstrap.Outputs.LambdaLayerArn
          StudioUserProfileName: !Ref StudioUserProfileName1
          UID: !Ref UID
          Env: !Ref Env
          AppName: !Ref AppName
    StudioSpace:
      Type: AWS::Serverless::Application
      Condition: CreateDomainCond
      DependsOn: StudioDomain
      Properties:
        Location: Infrastructure/Templates/sagemaker-studio-space.yaml
        Parameters:
          LambdaLayerArn: !GetAtt DemoBootstrap.Outputs.LambdaLayerArn
          StudioSpaceName: !Ref StudioSpaceName
          UID: !Ref UID
          Env: !Ref Env
          AppName: !Ref AppName

  1. Run the following command to deploy the stack with this change:
    bash deploy.sh -v <vpc_id> -s <subnet_id> -b <s3_bucket_name> -n <stack_name> -r <aws_region>

  2. Recreate the Studio profiles by adding back the stack back to the parent. Uncomment the code block from the previous step, save the file, and run the same command:
    bash deploy.sh -v <vpc_id> -s <subnet_id> -b <s3_bucket_name> -n <stack_name> -r <aws_region>

After a successful deployment, you can check the results.

  1. On the AWS CloudFormation console, choose the stack <stack_name>-StepFunction-*
  2. In the stack, choose the value for Physical ID of StepFunction in the Resources section.
  3. Choose the most recent run and confirm its status in Graph view.

It should look like the following screenshot for the user profile replication. You can also check the other run to ensure the same for the space profile.

step functions complete

  1. If you completed Steps 5–10, open the Studio domain for user1 and confirm that the user1.txt file is copied to the newly created directory.

It should not be visible in space1 directory, keeping the same file ownership.

  1. Repeat this step for space1.
  2. On the DataSync console, choose the most recent task ID.
  3. Choose History and the most recent run ID.

This is another way to inspect the configurations and the run status of the DataSync task. As an example, the following screenshot shows the task result for user1 directory replication.

datasync complete

We only covered profile recreation in this scenario. However, our solution works in the same way for Studio domain recreation, and it can be tested by deleting and recreating the domain.

Use an existing Studio domain

Our second test scenario assumes you want to use the existing SageMaker domain and profiles in the environment. Therefore, we only deploy the backup and recovery workflow and the event app. Again, you can use your own Studio CloudFormation template or create them through the AWS CloudFormation console to follow along. Because we’re using the existing Studio domain, the solution will list the current user and space for all domains within the Region, which we call seeding.

Complete the following steps:

  1. To deploy the application, run the following command:
    bash deploy.sh -v <vpc_id> -s <subnet_id> -b <s3_bucket_name> -n <stack_name> -r <aws_region> -d yes

  2. On the AWS CloudFormation console, ensure the following stacks are in CREATE_COMPLETE status:
    1. <stack_name>-DemoBootstrap-*
    2. <stack_name>-StepFunction-*
    3. <stack_name>-EventApp-*

If the deployment failed in any stacks, check the error and resolve the issues. Then, proceed to the next step only if the problems are resolved.

  1. Verify the initial data seed has completed.
  2. On the DynamoDB console, choose Tables in the navigation pane and confirm that the studioUser and studioUserHistory tables are created.
  3. Choose studioUser and choose Explore table items to confirm that items for the existing Studio domain are populated in the table.

Proceed to the next step only if the seed has completed successfully. If the tables aren’t populated, check the CloudWatch logs of the corresponding Lambda function. On the AWS CloudFormation console, choose the stack <stack_name>-EventApp-*, and choose the physical ID of DDBSeedLambda in the Resources section. Under Monitor, choose View CloudWatch Logs and check the logs for the most recent run to troubleshoot.

  1. To update the EFS security group, first get the SecurityGroupId. We use the security group created by the CloudFormation template, which allows all traffic in the outbound connection. Run the following command:
    echo "SecurityGroupId:" $(aws ssm get-parameter --name /network/vpc/sagemaker/securitygroups --region
    
    <aws_region> --query 'Parameter.Value')

  1. Get the HomeEfsFileSystemId, which is the ID of the Studio home EFS volume. Run the following command:
    echo "HomeEfsFileSystemId:" $(aws sagemaker describe-domain --domain-id <domain_id> --region <aws_region> --query 'HomeEfsFileSystemId')

  2. Finally, update the EFS security group by allowing inbounds from the security group shared with the DataSync task using port 2049. Run the following command:
    python3 src/add-security-group.py --efs-id <HomeEfsFileSystemId> --security-groups <SecurityGroupId> --region <aws_region>

  3. Delete and recreate the Studio profiles of your choice using the same profile name.
  4. Confirm the run status of the Step Functions state machine and recovery of the Studio profile directory by following the steps from the first scenario.

You can also test the Step Functions workflow manually with your choice of source and target inputs for replication (more details found in README.md in the GitHub repository).

Clean up

Run the following commands to clean up your resources:

sam delete --region <aws_region> --no-prompts --stack-name <stack_name>

Manually delete the SageMakerSecurityGroup after 20 minutes or so. Deletion of the Elastic Network Interface (ENI) can make the stack show as DELETE_IN_PROGRESS for some time, so we intentionally set the security group to be retained. Also, you need to disassociate that security group from the security group managed by SageMaker before you can delete it.

Conclusion

Studio is a powerful IDE that allows data scientists to quickly develop, train, test, and deploy models. This post discusses how to back up and recover the files stored in a data scientist’s home and shared space directory. We also demonstrated how an event-driven architecture can help automate the recovery process.

Our solution can help improve the resiliency of data scientists’ artifacts within Studio, leading to operational efficiency on the AWS Cloud. Also, the solution is modular, so you can use the necessary components and update them for your usage. For instance, the enhancement to this solution might be a cross-account replication. We hope that what we demonstrated in the post will be a helpful resource to support those ideas.

To get started with Studio, check out Amazon SageMaker for Data Scientists. Please send us feedback on the AWS forum for SageMaker or through your AWS support contacts. You can find other Studio examples in our GitHub repository.


About the Authors

Kenny Sato is a Machine Learning Engineer at AWS, guiding customers in architecting and implementing machine learning solutions. He received his master’s in Computer Engineering from Virginia Tech and is pursuing a PhD in Computer Science. In his spare time, you can find him in his backyard or out somewhere playing with his lovely daughters.

Gautam Nambiar is a DevOps Consultant with AWS. He is particularly interested in architecting and building automated solutions, MLOps pipelines, and creating reusable and secure DevOps best practice patterns. In his spare time, he likes playing and watching soccer.

Read More

Optimized PyTorch 2.0 inference with AWS Graviton processors

Optimized PyTorch 2.0 inference with AWS Graviton processors

New generations of CPUs offer a significant performance improvement in machine learning (ML) inference due to specialized built-in instructions. Combined with their flexibility, high speed of development, and low operating cost, these general-purpose processors offer an alternative to other existing hardware solutions.

AWS, Arm, Meta and others helped optimize the performance of PyTorch 2.0 inference for Arm-based processors. As a result, we are delighted to announce that AWS Graviton-based instance inference performance for PyTorch 2.0 is up to 3.5 times the speed for Resnet50 compared to the previous PyTorch release (see the following graph), and up to 1.4 times the speed for BERT, making Graviton-based instances the fastest compute optimized instances on AWS for these models.

AWS measured up to 50% cost savings for PyTorch inference with AWS Graviton3-based Amazon Elastic Cloud Compute C7g instances across Torch Hub Resnet50, and multiple Hugging Face models relative to comparable EC2 instances, as shown in the following figure.

Additionally, the latency of inference is also reduced, as shown in the following figure.

We have seen a similar trend in the price-performance advantage for other workloads on Graviton, for example video encoding with FFmpeg.

Optimization details

The optimizations focused on three key areas:

  • GEMM kernels – PyTorch supports Arm Compute Library (ACL) GEMM kernels via the OneDNN backend (previously called MKL-DNN) for Arm-based processors. The ACL library provides Neon and SVE optimized GEMM kernels for both fp32 and bfloat16 formats. These kernels improve the SIMD hardware utilization and reduce the end-to-end inference latencies.
  • bfloat16 support – The bfloat16 support in Graviton3 allows for efficient deployment of models trained using bfloat16, fp32, and AMP (Automatic Mixed Precision). The standard fp32 models use bfloat16 kernels via OneDNN fast math mode, without model quantization, providing up to two times faster performance compared to the existing fp32 model inference without bfloat16 fast math support.
  • Primitive caching – We also implemented primitive caching for conv, matmul, and inner product operators to avoid redundant GEMM kernel initialization and tensor allocation overhead.

How to take advantage of the optimizations

The simplest way to get started is by using the AWS Deep Learning Containers (DLCs) on Amazon Elastic Compute Cloud (Amazon EC2) C7g instances or Amazon SageMaker. DLCs are available on Amazon Elastic Container Registry (Amazon ECR) for AWS Graviton or x86. For more details on SageMaker, refer to Run machine learning inference workloads on AWS Graviton-based instances with Amazon SageMaker and Amazon SageMaker adds eight new Graviton-based instances for model deployment.

Use AWS DLCs

To use AWS DLCs, use the following code:

sudo apt-get update
sudo apt-get -y install awscli docker

# Login to ECR to avoid image download throttling
aws ecr get-login-password --region us-east-1 
| docker login --username AWS 
  --password-stdin 763104351884.dkr.ecr.us-east-1.amazonaws.com

# Pull the AWS DLC for pytorch
# Graviton
docker pull 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference-graviton:2.0.0-cpu-py310-ubuntu20.04-ec2

# x86
docker pull 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:2.0.0-cpu-py310-ubuntu20.04-ec2

If you prefer to install PyTorch via pip, install the PyTorch 2.0 wheel from the official repo. In this case, you will have to set two environment variables as explained in the code below before launching PyTorch to activate the Graviton optimization.

Use the Python wheel

To use the Python wheel, refer to the following code:

# Install Python
sudo apt-get update
sudo apt-get install -y python3 python3-pip

# Upgrade pip3 to the latest version
python3 -m pip install --upgrade pip

# Install PyTorch and extensions
python3 -m pip install torch
python3 -m pip install torchvision torchaudio torchtext

# Turn on Graviton3 optimization
export DNNL_DEFAULT_FPMATH_MODE=BF16
export LRU_CACHE_CAPACITY=1024

Run inference

You can use PyTorch TorchBench to measure the CPU inference performance improvements, or to compare different instance types:

# Pre-requisite: 
# pull and run the AWS DLC
# or 
# pip install PyTorch2.0 wheels and set the previously mentioned environment variables

# Clone PyTorch benchmark repo
git clone https://github.com/pytorch/benchmark.git

# Setup Resnet50 benchmark
cd benchmark
python3 install.py resnet50

# Install the dependent wheels
python3 -m pip install numba

# Run Resnet50 inference in jit mode. On successful completion of the inference runs,
# the script prints the inference latency and accuracy results
python3 run.py resnet50 -d cpu -m jit -t eval --use_cosine_similarity

Benchmarking

You can use the Amazon SageMaker Inference Recommender utility to automate performance benchmarking across different instances. With Inference Recommender, you can find the real-time inference endpoint that delivers the best performance at the lowest cost for a given ML model. We collected the preceding data using the Inference Recommender notebooks by deploying the models on production endpoints. For more details on Inference Recommender, refer to the GitHub repo. We benchmarked the following models for this post: ResNet50 image classification, DistilBERT sentiment analysis, RoBERTa fill mask, and RoBERTa sentiment analysis.

Conclusion

AWS measured up to 50% cost savings for PyTorch inference with AWS Graviton3-based Amazon Elastic Cloud Compute C7g instances across Torch Hub Resnet50, and multiple Hugging Face models relative to comparable EC2 instances. These instances are available on SageMaker and Amazon EC2. The AWS Graviton Technical Guide provides the list of optimized libraries and best practices that will help you achieve cost benefits with Graviton instances across different workloads.

If you find use cases where similar performance gains aren’t observed on AWS Graviton, please open an issue on the AWS Graviton Technical Guide to let us know about it. We will continue to add more performance improvements to make Graviton the most cost-effective and efficient general-purpose processor for inference using PyTorch.


About the author

Sunita Nadampalli is a Software Development Manager at AWS. She leads Graviton software performance optimizations for machine leaning, HPC, and multimedia workloads. She is passionate about open-source development and delivering cost-effective software solutions with Arm SoCs.

Read More

How Vericast optimized feature engineering using Amazon SageMaker Processing

How Vericast optimized feature engineering using Amazon SageMaker Processing

This post is co-written by Jyoti Sharma and Sharmo Sarkar from Vericast.

For any machine learning (ML) problem, the data scientist begins by working with data. This includes gathering, exploring, and understanding the business and technical aspects of the data, along with evaluation of any manipulations that may be needed for the model building process. One aspect of this data preparation is feature engineering.

Feature engineering refers to the process where relevant variables are identified, selected, and manipulated to transform the raw data into more useful and usable forms for use with the ML algorithm used to train a model and perform inference against it. The goal of this process is to increase the performance of the algorithm and resulting predictive model. The feature engineering process entails several stages, including feature creation, data transformation, feature extraction, and feature selection.

Building a platform for generalized feature engineering is a common task for customers needing to produce many ML models with differing datasets. This kind of platform includes the creation of a programmatically driven process to produce finalized, feature engineered data ready for model training with little human intervention. However, generalizing feature engineering is challenging. Each business problem is different, each dataset is different, data volumes vary wildly from client to client, and data quality and often cardinality of a certain column (in the case of structured data) might play a significant role in the complexity of the feature engineering process. Furthermore, the dynamic nature of a customer’s data can also result in a large variance of the processing time and resources required to optimally complete the feature engineering.

AWS customer Vericast is a marketing solutions company that makes data-driven decisions to boost marketing ROIs for its clients. Vericast’s internal cloud-based Machine Learning Platform, built around the CRISP-ML(Q) process, uses various AWS services, including Amazon SageMaker, Amazon SageMaker Processing, AWS Lambda, and AWS Step Functions, to produce the best possible models that are tailored to the specific client’s data. This platform aims at capturing the repeatability of the steps that go into building various ML workflows and bundling them into standard generalizable workflow modules within the platform.

In this post, we share how Vericast optimized feature engineering using SageMaker Processing.

Solution overview

Vericast’s Machine Learning Platform aids in the quicker deployment of new business models based on existing workflows or quicker activation of existing models for new clients. For example, a model predicting direct mail propensity is quite different from a model predicting discount coupon sensitivity of the customers of a Vericast client. They solve different business problems and therefore have different usage scenarios in a marketing campaign design. But from an ML standpoint, both can be construed as binary classification models, and therefore could share many common steps from an ML workflow perspective, including model tuning and training, evaluation, interpretability, deployment, and inference.

Because these models are binary classification problems (in ML terms), we are separating the customers of a company into two classes (binary): those that would respond positively to the campaign and those that would not. Furthermore, these examples are considered an imbalanced classification because the data used to train the model wouldn’t contain an equal number of customers who would and would not respond favorably.

The actual creation of a model such as this follows the generalized pattern shown in the following diagram.

Typical Imbalanced Class Binary Classification Model Training Process

Most of this process is the same for any binary classification except for the feature engineering step. This is perhaps the most complicated yet at times overlooked step in the process. ML models are largely dependent on the features used to create it.

Vericast’s cloud-native Machine Learning Platform aims to generalize and automate the feature engineering steps for various ML workflows and optimize their performance on a cost vs. time metric by using the following features:

  • The platform’s feature engineering library – This consists of an ever-evolving set of transformations that have been tested to yield high-quality generalizable features based on specific client concepts (for example, customer demographics, product details, transaction details, and so on).
  • Intelligent resource optimizers – The platform uses AWS’s on-demand infrastructure capability to spin up the most optimal type of processing resources for the particular feature engineering job based on the expected complexity of the step and the amount of data it needs to churn through.
  • Dynamic scaling of feature engineering jobs – A combination of various AWS services is used for this, but most notably SageMaker Processing. This ensures that the platform produces high-quality features in a cost-efficient and timely manner.

This post is focused around the third point in this list and shows how to achieve dynamic scaling of SageMaker Processing jobs to achieve a more managed, performant, and cost-effective data processing framework for large data volumes.

SageMaker Processing enables workloads that run steps for data preprocessing or postprocessing, feature engineering, data validation, and model evaluation on SageMaker. It also provides a managed environment and removes the complexity of undifferentiated heavy lifting required to set up and maintain the infrastructure needed to run the workloads. Furthermore, SageMaker Processing provides an API interface for running, monitoring, and evaluating the workload.

Running SageMaker Processing jobs takes place fully within a managed SageMaker cluster, with individual jobs placed into instance containers at run time. The managed cluster, instances, and containers report metrics to Amazon CloudWatch, including usage of GPU, CPU, memory, GPU memory, disk metrics, and event logging.

These features provide benefits to Vericast data engineers and scientists by assisting in the development of generalized preprocessing workflows and abstracting the difficulty of maintaining generated environments in which to run them. Technical problems can arise, however, given the dynamic nature of the data and its varied features that can be fed into such a general solution. The system must make an educated initial guess as to the size of the cluster and instances that compose it. This guess needs to evaluate criteria of the data and infer the CPU, memory, and disk requirements. This guess may be wholly appropriate and perform adequately for the job, but in other cases it may not. For a given dataset and preprocessing job, the CPU may be undersized, resulting in maxed out processing performance and lengthy times to complete. Worse yet, memory could become an issue, resulting in either poor performance or out of memory events causing the entire job to fail.

With these technical hurdles in mind, Vericast set out to create a solution. They needed to remain general in nature and fit into the larger picture of the preprocessing workflow being flexible in the steps involved. It was also important to solve for both the potential need to scale up the environment in cases where performance was compromised and to gracefully recover from such an event or when a job finished prematurely for any reason.

The solution built by Vericast to solve this issue uses several AWS services working together to reach their business objectives. It was designed to restart and scale up the SageMaker Processing cluster based on performance metrics observed using Lambda functions monitoring the jobs. To not lose work when a scaling event takes place or to recover from a job unexpectedly stopping, a checkpoint-based service was put in place that uses Amazon DynamoDB and stores the partially processed data in Amazon Simple Storage Service (Amazon S3) buckets as steps complete. The final outcome is an auto scaling, robust, and dynamically monitored solution.

The following diagram shows a high-level overview of how the system works.

Solution Architecture Diagram

In the following sections, we discuss the solution components in more detail.

Initializing the solution

The system assumes that a separate process initiates the solution. Conversely, this design is not designed to work alone because it won’t yield any artifacts or output, but rather acts as a sidecar implementation to one of the systems that use SageMaker Processing jobs. In Vericast’s case, the solution is initiated by way of a call from a Step Functions step started in another module of the larger system.

Once the solution initiated and a first run is triggered, a base standard configuration is read from a DynamoDB table. This configuration is used to set parameters for the SageMaker Processing job and has the initial assumptions of infrastructure needs. The SageMaker Processing job is now started.

Monitoring metadata and output

When the job starts, a Lambda function writes the job processing metadata (the current job configuration and other log information) into the DynamoDB log table. This metadata and log information maintains a history of the job, its initial and ongoing configuration, and other important data.

At certain points, as steps complete in the job, checkpoint data is added to the DynamoDB log table. Processed output data is moved to Amazon S3 for quick recovery if needed.

This Lambda function also sets up an Amazon EventBridge rule that monitors the running job for its state. Specifically, this rule is watching the job to observe if the job status changes to stopping or is in a stopped state. This EventBridge rule plays an important part in restarting a job if there is a failure or a planned auto scaling event occurs.

Monitoring CloudWatch metrics

The Lambda function also sets a CloudWatch alarm based on a metric math expression on the processing job, which monitors the metrics of all the instances for CPU utilization, memory utilization, and disk utilization. This type of alarm (metric) uses CloudWatch alarm thresholds. The alarm generates events based on the value of the metric or expression relative to the thresholds over a number of time periods.

In Vericast’s use case, the threshold expression is designed to consider the driver and the executor instances as separate, with metrics monitored individually for each. By having them separate, Vericast knows which is causing the alarm. This is important to decide how to scale accordingly:

  • If the executor metrics are passing the threshold, it’s good to scale horizontally
  • If the driver metrics cross the threshold, scaling horizontally will probably not help, so we must scale vertically

Alarm metrics expression

Vericast can access the following metrics in its evaluation for scaling and failure:

  • CPUUtilization – The sum of each individual CPU core’s utilization
  • MemoryUtilization – The percentage of memory that is used by the containers on an instance
  • DiskUtilization – The percentage of disk space used by the containers on an instance
  • GPUUtilization – The percentage of GPU units that are used by the containers on an instance
  • GPUMemoryUtilization – The percentage of GPU memory used by the containers on an instance

As of this writing, Vericast only considers CPUUtilization, MemoryUtilization, and DiskUtilization. In the future, they intend to consider GPUUtilization and GPUMemoryUtilization as well.

The following code is an example of a CloudWatch alarm based on a metric math expression for Vericast auto scaling:

(IF((cpuDriver) > 80, 1, 0) 
 OR IF((memoryDriver) > 80, 1, 0) 
 OR IF((diskDriver) > 80, 1, 0)) 
 OR (IF(AVG(METRICS("cpuExec")) > 80, 1, 0) 
 OR IF(AVG(METRICS("memoryExec")) > 80, 1, 0) 
 OR IF(AVG(METRICS("diskExec")) > 80, 1, 0))

This expression illustrates that the CloudWatch alarm is considering DriverMemoryUtilization (memoryDriver), CPUUtilization (cpuDriver), DiskUtilization (diskDriver), ExecutorMemoryUtilization (memoryExec), CPUUtilization (cpuExec), and DiskUtilization (diskExec) as monitoring metrics. The number 80 in the preceding expression stands for the threshold value.

Here, IF((cpuDriver) > 80, 1, 0 implies that if the driver CPU utilization goes beyond 80%, 1 is assigned as the threshold else 0. IF(AVG(METRICS("memoryExec")) > 80, 1, 0 implies that all the metrics with string memoryExec in it are considered and an average is calculated on that. If that average memory utilization percentage goes beyond 80, 1 is assigned as the threshold else 0.

The logical operator OR is used in the expression to unify all the utilizations in the expression—if any of the utilizations reach its threshold, trigger the alarm.

For more information on using CloudWatch metric alarms based on metric math expressions, refer to Creating a CloudWatch alarm based on a metric math expression.

CloudWatch alarm limitations

CloudWatch limits the number of metrics per alarm to 10. This can cause limitations if you need to consider more metrics than this.

To overcome this limitation, Vericast has set alarms based on the overall cluster size. One alarm is created per three instances (for three instances, there will be one alarm because that would add up to nine metrics). Assuming the driver instance is to be considered separately, another separate alarm is created for the driver instance. Therefore, the total number of alarms that are created are roughly equivalent to one third the number of executor nodes and an additional one for the driver instance. In each case, the number of metrics per alarm is under the 10 metric limitation.

What happens when in an alarm state

If a predetermined threshold is met, the alarm goes to an alarm state, which uses Amazon Simple Notification Service (Amazon SNS) to send out notifications. In this case, it sends out an email notification to all subscribers with the details about the alarm in the message.

Amazon SNS is also used as a trigger to a Lambda function that stops the currently running SageMaker Processing job because we know that the job will probably fail. This function also records logs to the log table related to the event.

The EventBridge rule set up at job start will notice that the job has gone into a stopping state a few seconds later. This rule then reruns the first Lambda function to restart the job.

The dynamic scaling process

The first Lambda function after running two or more times will know that a previous job had already started and now has stopped. The function will go through a similar process of getting the base configuration from the original job in the log DynamoDB table and will also retrieve updated configuration from the internal table. This updated configuration is a resources delta configuration that is set based on the scaling type. The scaling type is determined from the alarm metadata as described earlier.

The original configuration plus the resources delta are used because a new configuration and a new SageMaker Processing job are started with the increased resources.

This process continues until the job completes successfully and can result in multiple restarts as needed, adding more resources each time.

Vericast’s outcome

This custom auto scaling solution has been instrumental in making Vericast’s Machine Learning Platform more robust and fault tolerant. The platform can now gracefully handle workloads of different data volumes with minimal human intervention.

Before implementing this solution, estimating the resource requirements for all the Spark-based modules in the pipeline was one of the biggest bottlenecks of the new client onboarding process. Workflows would fail if the client data volume increased, or the cost would be unjustifiable if the data volume decreased in production.

With this new module in place, workflow failures due to resource constraints have been reduced by almost 80%. The few remaining failures are mostly due to AWS account constraints and beyond the auto scale process. Vericast’s biggest win with this solution is the ease with which they can onboard new clients and workflows. Vericast expects to speed up the process by at least 60–70%, with data still to be gathered for a final number.

Though this is viewed as a success by Vericast, there is a cost that comes with it. Based on the nature of this module and the concept of dynamic scaling as a whole, the workflows tend to take around 30% longer (average case) than a workflow with a custom-tuned cluster for each module in the workflow. Vericast continues to optimize in this area, looking to improve the solution by incorporating heuristics-based resource initialization for each client module.

Sharmo Sarkar, Senior Manager, Machine Learning Platform at Vericast, says, “As we continue to expand our use of AWS and SageMaker, I wanted to take a moment to highlight the incredible work of our AWS Client Services Team, dedicated AWS Solutions Architects, and AWS Professional Services that we work with. Their deep understanding of AWS and SageMaker allowed us to design a solution that met all of our needs and provided us with the flexibility and scalability we required. We are so grateful to have such a talented and knowledgeable support team on our side.”

Conclusion

In this post, we shared how SageMaker and SageMaker Processing have enabled Vericast to build a managed, performant, and cost-effective data processing framework for large data volumes. By combining the power and flexibility of SageMaker Processing with other AWS services, they can easily monitor the generalized feature engineering process. They can automatically detect potential issues generated from lack of compute, memory, and other factors, and automatically implement vertical and horizontal scaling as needed.

SageMaker and its tools can help your team meet its ML goals as well. To learn more about SageMaker Processing and how it can assist in your data processing workloads, refer to Process Data. If you’re just getting started with ML and are looking for examples and guidance, Amazon SageMaker JumpStart can get you started. JumpStart is an ML hub from which you can access built-in algorithms with pre-trained foundation models to help you perform tasks such as article summarization and image generation and pre-built solutions to solve common use cases.

Finally, if this post helps you or inspires you to solve a problem, we would love to hear about it! Please share your comments and feedback.


About the Authors

Anthony McClure is a Senior Partner Solutions Architect with the AWS SaaS Factory team. Anthony also has a strong interest in machine learning and artificial intelligence working with the AWS ML/AI Technical Field Community to assist customers in bringing their machine learning solutions to reality.

Jyoti Sharma is a Data Science Engineer with the machine learning platform team at Vericast. She is passionate about all aspects of data science and focused on designing and implementing a highly scalable and distributed Machine Learning Platform.

Sharmo Sarkar is a Senior Manager at Vericast. He leads the Cloud Machine Learning Platform and the Marketing Platform ML R&D Teams at Vericast. He has extensive experience in Big Data Analytics, Distributed Computing, and Natural Language Processing. Outside work, he enjoys motorcycling, hiking, and biking on mountain trails.

Read More

Hosting ML Models on Amazon SageMaker using Triton: XGBoost, LightGBM, and Treelite Models

Hosting ML Models on Amazon SageMaker using Triton: XGBoost, LightGBM, and Treelite Models

One of the most popular models available today is XGBoost. With the ability to solve various problems such as classification and regression, XGBoost has become a popular option that also falls into the category of tree-based models. In this post, we dive deep to see how Amazon SageMaker can serve these models using NVIDIA Triton Inference Server. Real-time inference workloads can have varying levels of requirements and service level agreements (SLAs) in terms of latency and throughput, and can be met using SageMaker real-time endpoints.

SageMaker provides single model endpoints, which allow you to deploy a single machine learning (ML) model against a logical endpoint. For other use cases, you can choose to manage cost and performance using multi-model endpoints, which allow you to specify multiple models to host behind a logical endpoint. Regardless of the option the you choose, SageMaker endpoints allow a scalable mechanism for even the most demanding enterprise customers while providing value in a plethora of features, including shadow variants, auto scaling, and native integration with Amazon CloudWatch (for more information, refer to CloudWatch Metrics for Multi-Model Endpoint Deployments).

Triton supports various backends as engines to support the running and serving of various ML models for inference. For any Triton deployment, it’s crucial to know how the backend behavior impacts your workloads and what to expect so that you can be successful. In this post, we help you understand the Forest Inference Library (FIL) backend, which is supported by Triton on SageMaker, so that you can make an informed decision for your workloads and get the best performance and cost optimization possible.

Deep dive into the FIL backend

Triton supports the FIL backend to serve tree models, such as XGBoost, LightGBM, scikit-learn Random Forest, RAPIDS cuML Random Forest, and any other model supported by Treelite. These models have long been used for solving problems such as classification or regression. Although these types of models have traditionally run on CPUs, the popularity of these models and inference demands have led to various techniques to increase inference performance. The FIL backend utilizes many of these techniques by using cuML constructs and is built on C++ and the CUDA core library to optimize inference performance on GPU accelerators.

The FIL backend uses cuML’s libraries to use CPU or GPU cores to accelerate learning. In order to use these processors, data is referenced from host memory (for example, NumPy arrays) or GPU arrays (uDF, Numba, cuPY, or any library that supports the __cuda_array_interface__) API. After the data is staged in memory, the FIL backend can run processing across all the available CPU or GPU cores.

The FIL backend threads can communicate with each other without utilizing shared memory of the host, but in ensemble workloads, host memory should be considered. The following diagram shows an ensemble scheduler runtime architecture where you have the ability to fine-tune the memory areas, including CPU addressable shared memory that is used for inter-process communication between Triton (C++) and the Python process (Python backend) for exchanging tensors (input/output) with the FIL backend.

Triton Inference Server provides configurable options for developers to tune their workloads and optimize model performance. The configuration dynamic_batching allows Triton to hold client-side requests and batch them on the server side in order to efficiently use FIL’s parallel computation to inference the entire batch together. The option max_queue_delay_microseconds offers a fail-safe control of how long Triton waits to form a batch.

There are a number of other FIL-specific options available that impact performance and behavior. We suggest starting with storage_type. When running the backend on GPU, FIL creates a new memory/data structure that is a representation of the tree for which FIL can impact performance and footprint. This is configurable via the environment parameter storage_type, which has the options dense, sparse, and auto. Choosing the dense option will consume more GPU memory and doesn’t always result in better performance, so it’s best to check. In contrast, the sparse option will consume less GPU memory and can possibly perform as well or better than dense. Choosing auto will cause the model to default to dense unless doing so will consume significantly more GPU memory than sparse.

When it comes to model performance, you might consider emphasizing the threads_per_tree option. One thing that you may overserve in real-world scenarios is that threads_per_tree can have a bigger impact on throughput than any other parameter. Setting it to any power of 2 from 1–32 is legitimate. The optimal value is hard to predict for this parameter, but when the server is expected to deal with higher load or process larger batch sizes, it tends to benefit from a larger value than when it’s processing a few rows at a time.

Another parameter to be aware of is algo, which is also available if you’re running on GPU. This parameter determines the algorithm that’s used to process the inference requests. The options supported for this are ALGO_AUTO, NAIVE, TREE_REORG, and BATCH_TREE_REORG. These options determine how nodes within a tree are organized and can also result in performance gains. The ALGO_AUTO option defaults to NAIVE for sparse storage and BATCH_TREE_REORG for dense storage.

Lastly, FIL comes with Shapley explainer, which can be activated by using the treeshap_output parameter. However, you should keep in mind that Shapley outputs hurt performance due to its output size.

Model format

There is currently no standard file format to store forest-based models; every framework tends to define its own format. In order to support multiple input file formats, FIL imports data using the open-source Treelite library. This enables FIL to support models trained in popular frameworks, such as XGBoost and LightGBM. Note that the format of the model that you’re providing must be set in the model_type configuration value specified in the config.pbtxt file.

Config.pbtxt

Each model in a model repository must include a model configuration that provides the required and optional information about the model. Typically, this configuration is provided in a config.pbtxt file specified as ModelConfig protobuf. To learn more about the config settings, refer to Model Configuration. The following are some of the model configuration parameters:

  • max_batch_size – This determines the maximum batch size that can be passed to this model. In general, the only limit on the size of batches passed to a FIL backend is the memory available with which to process them. For GPU runs, the available memory is determined by the size of Triton’s CUDA memory pool, which can be set via a command line argument when starting the server.
  • input – Options in this section tell Triton the number of features to expect for each input sample.
  • output – Options in this section tell Triton how many output values there will be for each sample. If the predict_proba option is set to true, then a probability value will be returned for each class. Otherwise, a single value will be returned, indicating the class predicted for the given sample.
  • instance_group – This determines how many instances of this model will be created and whether they will use GPU or CPU.
  • model_type – This string indicates what format the model is in (xgboost_json in this example, but xgboost, lightgbm, and tl_checkpoint are valid formats as well).
  • predict_proba – If set to true, probability values will be returned for each class rather than just a class prediction.
  • output_class – This is set to true for classification models and false for regression models.
  • threshold – This is a score threshold for determining classification. When output_class is set to true, this must be provided, although it won’t be used if predict_proba is also set to true.
  • storage_type – In general, using AUTO for this setting should meet most use cases. If AUTO storage is selected, FIL will load the model using either a sparse or dense representation based on the approximate size of the model. In some cases, you may want to explicitly set this to SPARSE in order to reduce the memory footprint of large models.

Triton Inference Server on SageMaker

SageMaker allows you to deploy both single model and multi-model endpoints with NVIDIA Triton Inference Server. The following figure shows the Triton Inference Server high-level architecture. The model repository is a file system-based repository of the models that Triton will make available for inferencing. Inference requests arrive at the server and are routed to the appropriate per-model scheduler. Triton implements multiple scheduling and batching algorithms that can be configured on a model-by-model basis. Each model’s scheduler optionally performs batching of inference requests and then passes the requests to the backend corresponding to the model type. The backend performs inferencing using the inputs provided in the batched requests to produce the requested outputs. The outputs are then returned.

When configuring your auto scaling groups for SageMaker endpoints, you may want to consider SageMakerVariantInvocationsPerInstance as the primary criteria to determine the scaling characteristics of your auto scaling group. In addition, depending on whether your models are running on GPU or CPU, you may also consider using CPUUtilization or GPUUtilization as additional criteria. Note that for single model endpoints, because the models deployed are all the same, it’s fairly straightforward to set proper policies to meet your SLAs. For multi-model endpoints, we recommend deploying similar models behind a given endpoint to have more steady predictable performance. In use cases where models of varying sizes and requirements are used, you may want to separate those workloads across multiple multi-model endpoints or spend some time fine-tuning your auto scaling group policy to obtain the best cost and performance balance.

For a list of NVIDIA Triton Deep Learning Containers (DLCs) supported by SageMaker inference, refer to Available Deep Learning Containers Images.

SageMaker notebook walkthrough

ML applications are complex and can often require data preprocessing. In this notebook, we dive into how to deploy a tree-based ML model like XGBoost using the FIL backend in Triton on a SageMaker multi-model endpoint. We also cover how to implement a Python-based data preprocessing inference pipeline for your model using the ensemble feature in Triton. This will allow us to send in the raw data from the client side and have both data preprocessing and model inference happen in a Triton SageMaker endpoint for optimal inference performance.

Triton model ensemble feature

Triton Inference Server greatly simplifies the deployment of AI models at scale in production. Triton Inference Server comes with a convenient solution that simplifies building preprocessing and postprocessing pipelines. The Triton Inference Server platform provides the ensemble scheduler, which is responsible for pipelining models participating in the inference process while ensuring efficiency and optimizing throughput. Using ensemble models can avoid the overhead of transferring intermediate tensors and minimize the number of requests that must be sent to Triton.

In this notebook, we show how to use the ensemble feature for building a pipeline of data preprocessing with XGBoost model inference, and you can extrapolate from it to add custom postprocessing to the pipeline.

Set up the environment

We begin by setting up the required environment. We install the dependencies required to package our model pipeline and run inferences using Triton Inference Server. We also define the AWS Identity and Access Management (IAM) role that will give SageMaker access to the model artifacts and the NVIDIA Triton Amazon Elastic Container Registry (Amazon ECR) image. See the following code:

import boto3
import sagemaker
from sagemaker import get_execution_role
import pandas as pd
import numpy as np
import subprocess
sess = boto3.Session()
sm = sess.client("sagemaker")
##NOTE :Replace with your S3 bucket name
default_bucket="" 
sagemaker_session = sagemaker.Session(default_bucket=default_bucket)

##NOTE : Make sure to have SageMakerFullAccess permission to the below IAM Role
role = get_execution_role()
client = boto3.client("sagemaker-runtime")
s3_bucket = sagemaker_session.default_bucket()

##NOTE : Latest SageMaker DLCs can be found here, please change region and account ids accordingly - https://github.com/aws/deep-learning-containers/blob/master/available_images.md

triton_image_uri = (
"{account_id}.dkr.ecr.{region}.{base}/sagemaker-tritonserver:23.02-py3".format(
account_id=account_id_map[region], region=region, base=base
))

Create a Conda environment for preprocessing dependencies

The Python backend in Triton requires us to use a Conda environment for any additional dependencies. In this case, we use the Python backend to preprocess the raw data before feeding it into the XGBoost model that is running in the FIL backend. Even though we originally used RAPIDS cuDF and cuML to do the data preprocessing, here we use Pandas and scikit-learn as preprocessing dependencies during inference. We do this for three reasons:

  • We show how to create a Conda environment for your dependencies and how to package it in the format expected by Triton’s Python backend.
  • By showing the preprocessing model running in the Python backend on the CPU while the XGBoost runs on the GPU in the FIL backend, we illustrate how each model in Triton’s ensemble pipeline can run on a different framework backend as well as different hardware configurations.
  • It highlights how the RAPIDS libraries (cuDF, cuML) are compatible with their CPU counterparts (Pandas, scikit-learn). For example, we can show how LabelEncoders created in cuML can be used in scikit-learn and vice versa.

We follow the instructions from the Triton documentation for packaging preprocessing dependencies (scikit-learn and Pandas) to be used in the Python backend as a Conda environment TAR file. The bash script create_prep_env.sh creates the Conda environment TAR file, then we move it into the preprocessing model directory. See the following code:

#!/bin/bash

conda create -y -n preprocessing_env python=3.8
source /opt/conda/etc/profile.d/conda.sh
conda activate preprocessing_env
export PYTHONNOUSERSITE=True
conda install -y -c conda-forge pandas scikit-learn
pip install conda-pack
conda-pack

After we run the preceding script, it generates preprocessing_env.tar.gz, which we copy to the preprocessing directory:

!cp preprocessing_env.tar.gz model_cpu_repository/preprocessing/
!cp preprocessing_env.tar.gz model_gpu_repository/preprocessinggpu/

Set up preprocessing with the Triton Python backend

For preprocessing, we use Triton’s Python backend to perform tabular data preprocessing (categorical encoding) during inference for raw data requests coming into the server. For more information about the preprocessing that was done during training, refer to the training notebook.

The Python backend enables preprocessing, postprocessing, and any other custom logic to be implemented in Python and served with Triton. Using Triton on SageMaker requires us to first set up a model repository folder containing the models we want to serve. We have already set up a model for Python data preprocessing called preprocessing in cpu_model_repository and gpu_model_repository.

Triton has specific requirements for model repository layout. Within the top-level model repository directory, each model has its own subdirectory containing the information for the corresponding model. Each model directory in Triton must have at least one numeric subdirectory representing a version of the model. The value 1 represents version 1 of our Python preprocessing model. Each model is run by a specific backend, so within each version subdirectory there must be the model artifact required by that backend. For this example, we use the Python backend, which requires the Python file you’re serving to be called model.py, and the file needs to implement certain functions. If we were using a PyTorch backend, a model.pt file would be required, and so on. For more details on naming conventions for model files, refer to Model Files.

The model.py Python file we use here implements all the tabular data preprocessing logic to convert raw data into features that can be fed into our XGBoost model.

Every Triton model must also provide a config.pbtxt file describing the model configuration. To learn more about the config settings, refer to Model Configuration. Our config.pbtxt file specifies the backend as python and all the input columns for raw data along with preprocessed output, which consists of 15 features. We also specify we want to run this Python preprocessing model on the CPU. See the following code:

name: "preprocessing"
backend: "python"
max_batch_size: 882352
input [
    {
        name: "User"
        data_type: TYPE_FP32
        dims: [ 1 ]
    },
    {
        name: "Card"
        data_type: TYPE_FP32
        dims: [ 1 ]
    },
    {
        name: "Year"
        data_type: TYPE_FP32
        dims: [ 1 ]
    },
    {
        name: "Month"
        data_type: TYPE_FP32
        dims: [ 1 ]
    },
    {
        name: "Day"
        data_type: TYPE_FP32
        dims: [ 1 ]
    },
    {
        name: "Time"
        data_type: TYPE_STRING
        dims: [ 1 ]
    },
    {
        name: "Amount"
        data_type: TYPE_STRING
        dims: [ 1 ]
    },
    {
        name: "Use Chip"
        data_type: TYPE_STRING
        dims: [ 1 ]
    },
    {
        name: "Merchant Name"
        data_type: TYPE_STRING
        dims: [ 1 ]
    },
    {
        name: "Merchant City"
        data_type: TYPE_STRING
        dims: [ 1 ]
    },
    {
        name: "Merchant State"
        data_type: TYPE_STRING
        dims: [ 1 ]
    },
    {
        name: "Zip"
        data_type: TYPE_STRING
        dims: [ 1 ]
    },
    {
        name: "MCC"
        data_type: TYPE_STRING
        dims: [ 1 ]
    },
    {
        name: "Errors?"
        data_type: TYPE_STRING
        dims: [ 1 ]
    }
    
]
output [
    {
        name: "OUTPUT"
        data_type: TYPE_FP32
        dims: [ 15 ]
    }
]

instance_group [
    {
        count: 1
        kind: KIND_CPU
    }
]
parameters: {
  key: "EXECUTION_ENV_PATH",
  value: {string_value: "$$TRITON_MODEL_DIRECTORY/preprocessing_env.tar.gz"}
}

Set up a tree-based ML model for the FIL backend

Next, we set up the model directory for a tree-based ML model like XGBoost, which will be using the FIL backend.

The expected layout for cpu_memory_repository and gpu_memory_repository are similar to the one we showed earlier.

Here, FIL is the name of the model. We can give it a different name like xgboost if we want to. 1 is the version subdirectory, which contains the model artifact. In this case, it’s the xgboost.json model that we saved. Let’s create this expected layout:

# move saved xgboost model into fil model directory
!mkdir -p model_cpu_repository/fil/1
!cp xgboost.json model_cpu_repository/fil/1/
!cp xgboost.json model_gpu_repository/filgpu/1/

We need to have the configuration file config.pbtxt describing the model configuration for the tree-based ML model, so that the FIL backend in Triton can understand how to serve it. For more information, refer to the latest generic Triton configuration options and the configuration options specific to the FIL backend. We focus on just a few of the most common and relevant options in this example.

Create config.pbtxt for model_cpu_repository:

USE_GPU =False
FIL_MODEL_DIR = "./model_cpu_repository/fil"

# Maximum size in bytes for input and output arrays. If you are
# using Triton 21.11 or higher, all memory allocations will make
# use of Triton's memory pool, which has a default size of
# 67_108_864 bytes
MAX_MEMORY_BYTES = 60_000_000
NUM_FEATURES = 15
NUM_CLASSES = 2
bytes_per_sample = (NUM_FEATURES + NUM_CLASSES) * 4
max_batch_size = MAX_MEMORY_BYTES // bytes_per_sample

IS_CLASSIFIER = True
model_format = "xgboost_json"

# Select deployment hardware (GPU or CPU)
if USE_GPU:
    instance_kind = "KIND_GPU"
else:
    instance_kind = "KIND_CPU"

# whether the model is doing classification or regression
if IS_CLASSIFIER:
    classifier_string = "true"
else:
    classifier_string = "false"

# whether to predict probabilites or not
predict_proba = False

if predict_proba:
    predict_proba_string = "true"
else:
    predict_proba_string = "false"

config_text = f"""backend: "fil"
max_batch_size: {max_batch_size}
input [                                 
 {{  
    name: "input__0"
    data_type: TYPE_FP32
    dims: [ {NUM_FEATURES} ]                    
  }} 
]
output [
 {{
    name: "output__0"
    data_type: TYPE_FP32
    dims: [ 1 ]
  }}
]
instance_group [{{ kind: {instance_kind} }}]
parameters [
  {{
    key: "model_type"
    value: {{ string_value: "{model_format}" }}
  }},
  {{
    key: "predict_proba"
    value: {{ string_value: "{predict_proba_string}" }}
  }},
  {{
    key: "output_class"
    value: {{ string_value: "{classifier_string}" }}
  }},
  {{
    key: "threshold"
    value: {{ string_value: "0.5" }}
  }},
  {{
    key: "storage_type"
    value: {{ string_value: "AUTO" }}
  }}
]

dynamic_batching {{}}"""

config_path = os.path.join(FIL_MODEL_DIR, "config.pbtxt")
with open(config_path, "w") as file_:
    file_.write(config_text)

Similarly, set up config.pbtxt for model_gpu_repository (note the difference is USE_GPU = True):

USE_GPU = True
FIL_MODEL_DIR = "./model_gpu_repository/filgpu"

# Maximum size in bytes for input and output arrays. If you are
# using Triton 21.11 or higher, all memory allocations will make
# use of Triton's memory pool, which has a default size of
# 67_108_864 bytes
MAX_MEMORY_BYTES = 60_000_000
NUM_FEATURES = 15
NUM_CLASSES = 2
bytes_per_sample = (NUM_FEATURES + NUM_CLASSES) * 4
max_batch_size = MAX_MEMORY_BYTES // bytes_per_sample

IS_CLASSIFIER = True
model_format = "xgboost_json"

# Select deployment hardware (GPU or CPU)
if USE_GPU:
    instance_kind = "KIND_GPU"
else:
    instance_kind = "KIND_CPU"

# whether the model is doing classification or regression
if IS_CLASSIFIER:
    classifier_string = "true"
else:
    classifier_string = "false"

# whether to predict probabilites or not
predict_proba = False

if predict_proba:
    predict_proba_string = "true"
else:
    predict_proba_string = "false"

config_text = f"""backend: "fil"
max_batch_size: {max_batch_size}
input [                                 
 {{  
    name: "input__0"
    data_type: TYPE_FP32
    dims: [ {NUM_FEATURES} ]                    
  }} 
]
output [
 {{
    name: "output__0"
    data_type: TYPE_FP32
    dims: [ 1 ]
  }}
]
instance_group [{{ kind: {instance_kind} }}]
parameters [
  {{
    key: "model_type"
    value: {{ string_value: "{model_format}" }}
  }},
  {{
    key: "predict_proba"
    value: {{ string_value: "{predict_proba_string}" }}
  }},
  {{
    key: "output_class"
    value: {{ string_value: "{classifier_string}" }}
  }},
  {{
    key: "threshold"
    value: {{ string_value: "0.5" }}
  }},
  {{
    key: "storage_type"
    value: {{ string_value: "AUTO" }}
  }}
]

dynamic_batching {{}}"""

config_path = os.path.join(FIL_MODEL_DIR, "config.pbtxt")
with open(config_path, "w") as file_:
    file_.write(config_text)

Set up an inference pipeline of the data preprocessing Python backend and FIL backend using ensembles

Now we’re ready to set up the inference pipeline for data preprocessing and tree-based model inference using an ensemble model. An ensemble model represents a pipeline of one or more models and the connection of input and output tensors between those models. Here we use the ensemble model to build a pipeline of data preprocessing in the Python backend followed by XGBoost in the FIL backend.

The expected layout for the ensemble model directory is similar to the ones we showed previously:

# create model version directory for ensemble CPU model
!mkdir -p model_cpu_repository/ensemble/1
# create model version directory for ensemble GPU model
!mkdir -p model_gpu_repository/ensemble/1

We created the ensemble model’s config.pbtxt following the guidance in Ensemble Models. Importantly, we need to set up the ensemble scheduler in config.pbtxt, which specifies the data flow between models within the ensemble. The ensemble scheduler collects the output tensors in each step, and provides them as input tensors for other steps according to the specification.

Package the model repository and upload to Amazon S3

Finally, we end up with the following model repository directory structure, containing a Python preprocessing model and its dependencies along with the XGBoost FIL model and the model ensemble.

We package the directory and its contents as model.tar.gz for uploading to Amazon Simple Storage Service (Amazon S3). We have two options in this example: using a CPU-based instance or a GPU-based instance. A GPU-based instance is more suitable when you need higher processing power and want to use CUDA cores.

Create and upload the model package for a CPU-based instance (optimized for CPU) with the following code:

!tar —exclude='.ipynb_checkpoints' -czvf model-cpu.tar.gz -C model_cpu_repository .

model_uri_cpu = sagemaker_session.upload_data(
path="model-cpu.tar.gz", key_prefix="triton-fil-mme-ensemble"
)

Create and upload the model package for a GPU-based instance (optimized for GPU) with the following code:

!tar —exclude='.ipynb_checkpoints' -czvf model-gpu.tar.gz -C model_gpu_repository .

model_uri_cpu = sagemaker_session.upload_data(
path="model-gpu.tar.gz", key_prefix="triton-fil-mme-ensemble"
)

Create a SageMaker endpoint

We now have the model artifacts stored in an S3 bucket. In this step, we can also provide the additional environment variable SAGEMAKER_TRITON_DEFAULT_MODEL_NAME, which specifies the name of the model to be loaded by Triton. The value of this key should match the folder name in the model package uploaded to Amazon S3. This variable is optional in the case of a single model. In the case of ensemble models, this key has to be specified for Triton to start up in SageMaker.

Additionally, you can set SAGEMAKER_TRITON_BUFFER_MANAGER_THREAD_COUNT and SAGEMAKER_TRITON_THREAD_COUNT for optimizing the thread counts.

# Set the primary path for where all the models are stored on S3 bucket
model_location = f"s3://{s3_bucket}/triton-fil-mme-ensemble/"
sm_model_name = f"{user_profile}" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

container = {
 "Image": triton_image_uri,
 "ModelDataUrl": model_location,
"Mode": "MultiModel",
 "Environment": {
 "SAGEMAKER_TRITON_DEFAULT_MODEL_NAME": "ensemble",
# "SAGEMAKER_TRITON_DEFAULT_MODEL_NAME": model_uri.rsplit('/')[-2], #m_name,
# "SAGEMAKER_TRITON_LOG_VERBOSE": "true", #"200",
# "SAGEMAKER_TRITON_SHM_DEFAULT_BYTE_SIZE" : "20000000", #"1677721600", #"16777216000", "16777216"
# "SAGEMAKER_TRITON_SHM_GROWTH_BYTE_SIZE": "1048576"
},
}

create_model_response = sm.create_model(
 ModelName=sm_model_name, ExecutionRoleArn=role, PrimaryContainer=container
)

We use the preceding model to create an endpoint configuration where we can specify the type and number of instances we want in the endpoint

eendpoint_config_name = f"{user_profile}" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

create_endpoint_config_response = sm.create_endpoint_config(
 EndpointConfigName=endpoint_config_name,
 ProductionVariants=[
 {
 "InstanceType": "ml.g4dn.xlarge",
 "InitialVariantWeight": 1,
 "InitialInstanceCount": 1,
 "ModelName": sm_model_name,
 "VariantName": "AllTraffic",
 }
 ],
)

We use this endpoint configuration to create a SageMaker endpoint and wait for the deployment to finish. With SageMaker MMEs, we have the option to host multiple ensemble models by repeating this process, but we stick with one deployment for this example:

endpoint_name = f"{studio_user_profile_output}-lab1-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())
create_endpoint_response = sm.create_endpoint(
 EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name
)

The status will change to InService when the deployment is successful.

Invoke your model hosted on the SageMaker endpoint

After the endpoint is running, we can use some sample raw data to perform inference using JSON as the payload format. For the inference request format, Triton uses the KFServing community standard inference protocols. See the following code:

data_infer = pd.read_csv("data_infer.csv")
STR_COLUMNS = [
    "Time",
    "Amount",
    "Zip",
    "MCC",
    "Merchant Name",
    "Use Chip",
    "Merchant City",
    "Merchant State",
    "Errors?",
]

batch_size = len(data_infer)

payload = {}
payload["inputs"] = []
data_dict = {}
for col_name in data_infer.columns:
    data_dict[col_name] = {}
    data_dict[col_name]["name"] = col_name
    if col_name in STR_COLUMNS:
        data_dict[col_name]["data"] = data_infer[col_name].astype(str).tolist()
        data_dict[col_name]["datatype"] = "BYTES"
    else:
        data_dict[col_name]["data"] = data_infer[col_name].astype("float32").tolist()
        data_dict[col_name]["datatype"] = "FP32"
    data_dict[col_name]["shape"] = [batch_size, 1]
    payload["inputs"].append(data_dict[col_name])
#Invoke the endpoint
# Change the TargetModel to either CPU or GPU
response = client.invoke_endpoint(
 EndpointName=endpoint_name, ContentType="application/octet-stream", Body=json.dumps(payload),TargetModel="model-cpu.tar.gz",
)

#Read the results
response_body = json.loads(response["Body"].read().decode("utf8"))
predictions = response_body["outputs"][0]["data"]

CLASS_LABELS = ["NOT FRAUD", "FRAUD"]
predictions = [CLASS_LABELS[int(idx)] for idx in predictions]
print(predictions)

The notebook referred in the blog can be found in the GitHub repository.

Best practices

In addition to the options to fine-tune the settings of the FIL backend we mentioned earlier, data scientists can also ensure that the input data for the backend is optimized for processing by the engine. Whenever possible, input data in row-major format into the GPU array. Other formats will require internal conversion and take up cycles, decreasing performance.

Due to the way FIL data structures are maintained in GPU memory, be mindful of the tree depth. The deeper the tree depth, the larger your GPU memory footprint will be.

Use the instance_group_count parameter to add worker processes and increase the throughput of the FIL backend, which will result in larger CPU and GPU memory consumption. In addition, consider SageMaker-specific variables that are available to increase the throughput, such as HTTP threads, HTTP buffer size, batch size, and max delay.

Conclusion

In this post, we dove deep into the FIL backend that Triton Inference Server supports on SageMaker. This backend provides for both CPU and GPU acceleration of your tree-based models such as the popular XGBoost algorithm. There are many options to consider to get the best performance for inference, such as batch sizes, data input formats, and other factors that can be tuned to meet your needs. SageMaker allows you to use this capability with single and multi-model endpoints to balance of performance and cost savings.

We encourage you to take the information in this post and see if SageMaker can meet your hosting needs to serve tree-based models, meeting your requirements for cost reduction and workload performance.

The notebook referenced in this post can be found in the SageMaker examples GitHub repository. Furthermore, you can find the latest documentation on the FIL backend on GitHub.


About the Authors

Raghu Ramesha is an Senior ML Solutions Architect with the Amazon SageMaker Service team. He focuses on helping customers build, deploy, and migrate ML production workloads to SageMaker at scale. He specializes in machine learning, AI, and computer vision domains, and holds a master’s degree in Computer Science from UT Dallas. In his free time, he enjoys traveling and photography.

James Park is a Solutions Architect at Amazon Web Services. He works with Amazon.com to design, build, and deploy technology solutions on AWS, and has a particular interest in AI and machine learning. In his spare time he enjoys seeking out new cultures, new experiences,  and staying up to date with the latest technology trends.

Dhawal Patel is a Principal Machine Learning Architect at AWS. He has worked with organizations ranging from large enterprises to mid-sized startups on problems related to distributed computing and artificial intelligence. He focuses on deep learning, including NLP and computer vision domains. He helps customers achieve high-performance model inference on Amazon SageMaker.

Jiahong Liu is a Solution Architect on the Cloud Service Provider team at NVIDIA. He assists clients in adopting machine learning and AI solutions that leverage NVIDIA accelerated computing to address their training and inference challenges. In his leisure time, he enjoys origami, DIY projects, and playing basketball.

Kshitiz Gupta is a Solutions Architect at NVIDIA. He enjoys educating cloud customers about the GPU AI technologies NVIDIA has to offer and assisting them with accelerating their machine learning and deep learning applications. Outside of work, he enjoys running, hiking and wildlife watching.

Read More

Bring your own ML model into Amazon SageMaker Canvas and generate accurate predictions

Bring your own ML model into Amazon SageMaker Canvas and generate accurate predictions

Machine learning (ML) helps organizations generate revenue, reduce costs, mitigate risk, drive efficiencies, and improve quality by optimizing core business functions across multiple business units such as marketing, manufacturing, operations, sales, finance, and customer service. With AWS ML, organizations can accelerate the value creation from months to days. Amazon SageMaker Canvas is a visual, point-and-click service that allows business analysts to generate accurate ML predictions without writing a single line of code or requiring ML expertise. You can use models to make predictions interactively and for batch scoring on bulk datasets.

In this post, we showcase architectural patterns on how business teams can use ML models built anywhere by generating predictions in Canvas and achieve effective business outcomes.

This integration of model development and sharing creates a tighter collaboration between business and data science teams and lowers time to value. Business teams can use existing models built by their data scientists or other departments to solve a business problem instead of rebuilding new models in outside environments.

Finally, business analysts can import shared models into Canvas and generate predictions before deploying to production with just a few clicks.

Solution overview

The following figure describes three different architecture patterns to demonstrate how data scientists can share models with business analysts, who can then directly generate predictions from those models in the visual interface of Canvas:

Prerequisites

To train and build your model using SageMaker and bring your model into Canvas, complete the following prerequisites:

  1. If you don’t already have a SageMaker domain and Studio user, set up and onboard a Studio user to a SageMaker domain.
  2. Enable and set up Canvas base permissions for your users and grant users permissions to collaborate with Studio.
  3. You must have a trained model from Autopilot, JumpStart, or the model registry. For any model that you’ve built outside of SageMaker, you must register your model in the model registry before importing it into Canvas.

Now let’s assume the role of a data scientist who is looking to train, build, deploy, and share ML models with a business analyst for each of these three architectural patterns.

Use Autopilot and Canvas

Autopilot automates key tasks of an automatic ML (AutoML) process like exploring data, selecting the relevant algorithm for the problem type, and then training and tuning it. All of this can be achieved while allowing you to maintain full control and visibility on the dataset. Autopilot automatically explores different solutions to find the best model, and users can either iterate on the ML model or directly deploy the model to production with one click.

In this example, we use a customer churn synthetic dataset from the telecom domain and are tasked with identifying customers that are potentially at risk of churning. Complete the following steps to use Autopilot AutoML to build, train, deploy, and share an ML model with a business analyst:

  1. Download the dataset, upload it to an Amazon S3 (Amazon Simple Storage Service) bucket, and make a note of the S3 URI.
  2. On the Studio console, choose AutoML in the navigation pane.
  3. Choose Create AutoML experiment.
  4. Specify the experiment name (for this post, Telecom-Customer-Churn-AutoPilot), S3 data input, and output location.
  5. Set the target column as churn.
  6. In the deployment settings, you can enable the auto deploy option to create an endpoint that deploys your best model and runs inference on the endpoint.

For more information, refer to Create an Amazon SageMaker Autopilot experiment.

  1. Choose your experiment, then select your best model and choose Share model.
  2. Add a Canvas user and choose Share to share the model.

(Note: You can’t share model with the same Canvas user as used for Studio login. For example, Studio user-A can’t share model with Canvas User-A. But user-A can share model with user-B, hence choose different uses for model-sharing)

For more information, refer to Studio users: Share a model to SageMaker Canvas.

Use JumpStart and Canvas

JumpStart is an ML hub that provides pre-trained, open-source models for a wide range of ML use cases like fraud detection, credit risk prediction, and product defect detection. You can deploy more than 300 pre-trained models for tabular, vision, text, and audio data.

For this post, we use a LightGBM regression pre-trained model from JumpStart. We train the model on a custom dataset and share the model with a Canvas user (business analyst). The pre-trained model can be deployed to an endpoint for inference. JumpStart provides an example notebook to access the model after it is deployed.

In this example, we use the abalone dataset. The dataset contains examples of eight physical measurements such as length, diameter, and height to predict the age of abalone (a regression problem).

  1. Download the abalone dataset from Kaggle.
  2. Create an S3 bucket and upload the train, validation, and custom header datasets.
  3. On the Studio console, under SageMaker JumpStart in the navigation pane, choose Models, notebooks, solutions.
  4. Under Tabular Models, choose LightGBM Regression.
  5. Under Train Model, specify the S3 URIs for the training, validation, and column header datasets.
  6. Choose Train.
  7. In the navigation pane, choose Launched JumpStart assets.
  8. On the Training jobs tab, choose your training job.
  9. On the Share menu, choose Share to Canvas.
  10. Choose the Canvas users to share with, specify the model details, and choose Share.

For more information, refer to Studio users: Share a model to SageMaker Canvas.

Use SageMaker model registry and Canvas

With SageMaker model registry, you can catalog models for production, manage model versions, associate metadata, manage the approval status of a model, deploy models to production, and automate model deployment with CI/CD.

Let’s assume the role of a data scientist. For this example, you’re building an end-to-end ML project that includes data preparation, model training, model hosting, model registry, and model sharing with a business analyst. Optionally, for data preparation and preprocessing or postprocessing steps, you can use Amazon SageMaker Data Wrangler and an Amazon SageMaker Processing job. In this example, we use the abalone dataset downloaded from LIBSVM. The target variable is the age of abalone.

  1. In Studio, clone the GitHub repo.
  2. Complete the steps listed in the README file.
  3. On the Studio console, under Models in the navigation pane, choose Model registry.
  4. Choose the model sklearn-reg-ablone.
  5. Share model version 1 from the model registry to Canvas.
  6. Choose the Canvas users to share with, specify the model details, and choose Share.

For instructions, refer to the Model Registry section in Studio users: Share a model to SageMaker Canvas.

Manage shared models

After you share the model using any of the preceding methods, you can go to the Models section in Studio and review all shared models. In the following screenshot, we see 3 different models shared by a Studio user (data scientist) with different Canvas users (business teams).

Import a shared model and make predictions with Canvas

Let’s assume the role of business analyst and log in to Canvas with your Canvas user.

When a data scientist or Studio user shares a model with a Canvas user, you receive a notification within the Canvas application that a Studio user has shared a model with you. In the Canvas application, the notification is similar to the following screenshot.

You can choose View update to see the shared model, or you can go to the Models page in the Canvas application to discover all the models that have been shared with you. The model import from Studio can take up to 20 minutes.

After importing the model, you can view its metrics and generate real-time predictions with what-if analysis or batch predictions.

Considerations

Keep in mind the following when sharing models with Canvas:

  • You store training and validation datasets in Amazon S3, and the S3 URIs are passed to Canvas with AWS Identity and Access Management (IAM) permissions.
  • Provide the target column to Canvas or use the first column as default.
  • For a Canvas container to parse inference data, the Canvas endpoint accepts either text (CSV) or application (JSON).
  • Canvas doesn’t support multiple container or inference pipelines.
  • A data schema is provided to Canvas if no headers are provided in the training and validation datasets. By default, the JumpStart platform doesn’t provide headers in the training and validation datasets.
  • With Jumpstart, the training job needs to be complete before you can share it with Canvas.

Refer to Limitations and troubleshooting to help you troubleshoot any issues you encounter when sharing models.

Clean up

To avoid incurring future charges, delete or shut down the resources you created while following this post. Refer to Logging out of Amazon SageMaker Canvas for more details. Shut down the individual resources, including notebooks, terminal, kernels, apps and instances. For more information, refer to Shut Down Resources. Delete the model version, SageMaker endpoint and resources, Autopilot experiment resources, and S3 bucket.

Conclusion

Studio allows data scientists to share ML models with business analysts in a few simple steps. Business analysts can benefit from ML models already built by data scientists to solve business problems instead of creating a new model in Canvas. However, it might be difficult to use these models outside the environments in which they are built due to technical requirements and manual processes to import models. This often forces users to rebuild ML models, resulting in the duplication of effort and additional time and resources. Canvas removes these limitations so you can generate predictions in Canvas with models that you have trained anywhere. By using the three patterns illustrated in this post, you can register ML models in the SageMaker model registry, which is a metadata store for ML models, and import them into Canvas. Business analysts can then analyze and generate predictions from any model in Canvas.

To learn more about using SageMaker services, check out the following resources:

If you have questions or suggestions, leave a comment.


About the authors

Aman Sharma is a Senior Solutions Architect With AWS. He works with start-ups, small and medium businesses, and enterprise customers across the APJ region, more than 19 years of experience in consulting, architecting, and solutioning. He is passionate about democratizing AI and ML and helping customers in designing their data and ML strategies. Outside work, he likes to explore nature and wildlife.

Zichen Nie is the Senior Software Engineer at AWS SageMaker leading the project Bring Your Own Model to SageMaker Canvas last year. She has been working in Amazon for more than 7 years and has experience in both Amazon Supply Chain Optimization and AWS AI services. She enjoys Barre workouts and music after work.

Read More