Question answering using Retrieval Augmented Generation with foundation models in Amazon SageMaker JumpStart

Question answering using Retrieval Augmented Generation with foundation models in Amazon SageMaker JumpStart

Today, we announce the availability of sample notebooks that demonstrate question answering tasks using a Retrieval Augmented Generation (RAG)-based approach with large language models (LLMs) in Amazon SageMaker JumpStart. Text generation using RAG with LLMs enables you to generate domain-specific text outputs by supplying specific external data as part of the context fed to LLMs.

JumpStart is a machine learning (ML) hub that can help you accelerate your ML journey. JumpStart provides many pre-trained language models called foundation models that can help you perform tasks such as article summarization, question answering, and conversation generation and image generation.

In this post, we describe RAG and its advantages, and demonstrate how to quickly get started by using a sample notebook to solve a question answering task using RAG implementation with LLMs in Jumpstart. We demonstrate two approaches:

  • How to solve the problem with the open-sourced LangChain library and Amazon SageMaker endpoints in a few lines of code
  • How to use the SageMaker KNN algorithm to perform semantic searching for large-scale data using SageMaker endpoints

LLMS and constraints

LLMs are trained on large amounts of unstructured data and are great at general text generation. LLMs can store factual knowledge by training their parameters on a large corpus of natural language data.

There are a few limitations of using off-the-shelf pre-trained LLMs:

  • They’re usually trained offline, making the model agnostic to the latest information (for example, a chatbot trained from 2011–2018 has no information about COVID-19).
  • They make predictions by only looking up information stored in its parameters, leading to inferior interpretability.
  • They’re mostly trained on general domain corpora, making them less effective on domain-specific tasks. There are scenarios when you want models to generate text based on specific data rather than generic data. For example, a health insurance company may want their question answering bot to answer questions using the latest information stored in their enterprise document repository or database, so the answers are accurate and reflect their unique business rules.

Currently, there are two popular ways to reference specific data in LLMs:

  • Insert data as context in the model prompt as a way to provide the information that the model can use while creating the result
  • Fine-tine the model by providing a file with prompt and completion pairs

The challenge of the context-based approach is that models come with limited context size, and including all the documents as context may not fit into the allowed context size of the model. Depending on the model used, there may also be additional cost for larger context.

For the approach of fine-tuning, generating the right formatted information is time consuming and involves cost. In addition, if external data used for fine-tuning changes frequently, it would imply frequent fine-tunings and retraining are needed to create accurate results. Frequent training impacts speed to market and adds to the overall solution cost.

To demonstrate these constraints, we used an LLM Flan T5 XXL model and asked the following question:

question = "Which instances can I use with Managed Spot Training in SageMaker?"

We get the following response:

"""For model: huggingface-text2text-flan-t5-xxl, the generated output is: 
the Managed Spot Training is a subscriptions product available for the following instances: Data Science Virtual Machine (DSVM), DSVM High, and DSVM Low.
"""

As you can see, the response is not accurate. The correct answer should be all SageMaker instances support Managed Spot Training.

We tried the same question but with additional context passed along with the question:

question + context + prompt = """
Answer based on context:

Managed Spot Training can be used with all instances supported in Amazon SageMaker. Managed Spot Training is supported in all AWS Regions where Amazon SageMaker is currently available.

Which instances can I use with Managed Spot Training in SageMaker?
"""

We got the following response this time:

"""For model: huggingface-text2text-flan-t5-xxl, the generated output is: 
instances supported in Amazon SageMaker
"""

The response is better but still not accurate. However, in real production use cases, users may send various queries, and to provide accurate responses, you may want to include all or most of the available information as part of the static context to create accurate responses. Therefore, with this approach, we may hit the context size limitation constraint because even non-relevant information for the question asked is sent as part of the context. This is where you can use the RAG-based approach to create scalable and accurate responses for a user’s queries.

Retrieval Augmented Generation

To solve the constraints we discussed, we can use Retrieval Augmented Generation (RAG) with LLMs. RAG retrieves data from outside the language model (non-parametric) and augments the prompts by adding the relevant retrieved data in context. RAG models were introduced by Lewis et al. in 2020 as a model where parametric memory is a pre-trained seq2seq model and the non-parametric memory is a dense vector index of Wikipedia, accessed with a pre-trained neural retriever.

In RAG, the external data can come from multiple data sources, such as a document repository, databases, or APIs. The first step is to convert the documents and the user query in the format so they can be compared and relevancy search can be performed. To make the formats comparable for doing relevancy search, a document collection (knowledge library) and the user-submitted query are converted to numerical representation using embedding language models. The embeddings are essentially numerical representations of concept in text. Next, based on the embedding of user query, its relevant text is identified in the document collection by a similarity search in the embedding space. Then the prompt provided by the user is appended with relevant text that was searched and it’s added to the context. The prompt is now sent to the LLM and because the context has relevant external data along with the original prompt, the model output is relevant and accurate.

To maintain up-to-date information for the reference documents, you can asynchronously update the documents and update embedding representation of the documents. This way, the updated documents will be used to generate answers for future questions to provide accurate responses.

The following diagram shows the conceptual flow of using RAG with LLMs.

In this post, we demonstrate how to implement a question answering application with the following steps:

  1. Generate embedding for each of document in the knowledge library with a SageMaker GPT-J-6B embedding model.
  2. Identify the top K most relevant documents based on the user query.
    1. For your query, generate the embedding of the query using the same embedding model.
    2. Search the indexes of the top K most relevant documents in the embedding space using an in-memory FAISS search.
    3. Use the indexes to retrieve the corresponding documents.
  3. Use the retrieved relevant documents as context with the prompt and question, and send them to the SageMaker LLM to generate the response.

We demonstrate the following approaches:

  • How to solve a question answering task with SageMaker LLMs and embedding endpoints and the open-sourced library LangChain in a few lines of code. In particular, we use two SageMaker endpoints for the LLM (Flan T5 XXL) and embedding model (GPT-J 6B), and the vector database used is in-memory FAISS. For more details, see the GitHub repo.
  • If the in-memory FAISS doesn’t fit into your large dataset, we provide you with a SageMaker KNN algorithm to perform the semantic search, which also uses FAISS as the underlying searching algorithm. For details, see the GitHub repo.

The following diagram depicts the solution architecture.

JumpStart RAG-based implementation notebook with LangChain

LangChain is an open-source framework for developing applications powered by language models. LangChain provides a generic interface for many different LLMs. It also makes it easier for developers to chain various LLMs together and build powerful applications. LangChain provides a standard interface for memory and a collection of memory implementations to persist the state between calls of agents or chains.

LangChain has many other utility features that can add to developer productivity. These features include a prompt template that helps customize prompts using variables in the prompt template, agents to build end-to-end applications, indexes for search and retrieval steps of the chain, and much more. To further explore LangChain capabilities, refer to the LangChain documentation.

Create LLM Model

As a first step, deploy the JumpStart LLM model of your choice. In this demo, we use a Jumpstart Flan T5 XXL model endpoint. For deployment instructions, refer to Zero-shot prompting for the Flan-T5 foundation model in Amazon SageMaker JumpStart. Based on your use case, you can also deploy other instruction-tuned models like Flan T5 UL2 or BloomZ 7B1. For details, see the example notebook.

To use the SageMaker LLM endpoint with LangChain, we use langchain.llms.sagemaker_endpoint.SagemakerEndpoint, which abstracts the SageMaker LLM endpoint. We need to perform a transformation for the request and response payload as shown in the following code for the LangChain SageMaker integration. Note that you may need to adjust the code in ContentHandler based on the content_type and accepts format of the LLM model that you choose to use.

from langchain.llms.sagemaker_endpoint import SagemakerEndpoint

class ContentHandler(ContentHandlerBase):
    content_type = "application/json"
    accepts = "application/json"

    def transform_input(self, prompt: str, model_kwargs={}) -> bytes:
        input_str = json.dumps({"text_inputs": prompt, **model_kwargs})
        return input_str.encode("utf-8")

    def transform_output(self, output: bytes) -> str:
        response_json = json.loads(output.read().decode("utf-8"))
        return response_json["generated_texts"][0]

content_handler = ContentHandler()

sm_llm = SagemakerEndpoint(
    endpoint_name=_MODEL_CONFIG_["huggingface-text2text-flan-t5-xxl"]["endpoint_name"],
    region_name=aws_region,
    model_kwargs=parameters,
    content_handler=content_handler,
)

Create the embedding model

Next, we need to get our embedded model ready. We deploy the GPT-J 6B model as the embedding model. If you’re using a JumpStart embedding model, you need to customize the LangChain SageMaker endpoint embedding class and transform the model request and response to integrate with LangChain. For a detailed implementation, refer to the GitHub repo.

embeddings = SagemakerEndpointEmbeddingsJumpStart(
    endpoint_name=_MODEL_CONFIG_["huggingface-textembedding-gpt-j-6b"]["endpoint_name"],
    region_name=aws_region,
    content_handler=content_handler,
)

Load domain-specific documents using the LangChain document loader and create an index

We use the CSVLoader package in LangChain to load CSV-formatted documents into the document loader:

loader = CSVLoader(file_path="rag_data/processed_data.csv")
documents = loader.load()

Next, we use TextSplitter to preprocess data for embedding purposes and use the SageMaker embedding model GPT-J -6B to create the embedding. We store embedding in a FAISS vector store to create an index. We use this index to find relevant documents that are semantically similar to the user’s query.

The following code shows how all these steps are done by the VectorstoreIndexCreator class in just few lines of code in LangChain to create a concise implementation of question answering with RAG:

index_creator = VectorstoreIndexCreator(
    vectorstore_cls=FAISS,
    embedding=embeddings,
    text_splitter=CharacterTextSplitter(chunk_size=300, chunk_overlap=0),
)
index = index_creator.from_loaders([loader])

Use the index to search for relevant context and pass it to the LLM model

Next, use the query method on the created index and pass the user’s question and SageMaker endpoint LLM. LangChain selects the top four closest documents (K=4) and passes the relevant context extracted from the documents to generate an accurate response. See the following code:

index.query(question=question, llm=sm_llm)

We get the following response for the query using the RAG-based approach with Flan T5 XXL:

"""For model: huggingface-text2text-flan-t5-xxl, the generated output is: 
Managed Spot Training can be used with all instances supported in Amazon SageMaker
"""

The response looks more accurate compared to the response we got with other approaches that we demonstrated earlier that have no context or static context that may not be always relevant.

Alternate approach to implement RAG with more customization using SageMaker and LangChain

In this section, we show you another approach to implement RAG using SageMaker and LangChain. This approach offers the flexibility to configure top K parameters for a relevancy search in the documents. It also allows you to use the LangChain feature of prompt templates, which allow you to easily parameterize the prompt creation instead of hard coding the prompts.

In the following code, we explicitly use FAISS to generate embedding for each of the document in the knowledge library with the SageMaker GPT-J-6B embedding model. Then we identify the top K (K=3) most relevant documents based on the user query.

docsearch = FAISS.from_documents(documents, embeddings)
docs = docsearch.similarity_search(question, k=3)

Next, we use a prompt template and chain it with the SageMaker LLM:

prompt_template = """Answer based on context:nn{context}nn{question}"""
PROMPT = PromptTemplate(template=prompt_template, input_variables=["context", "question"])
chain = load_qa_chain(llm=sm_llm, prompt=PROMPT)

We send the top three (K=3) relevant documents we found as context to the prompt by using a LangChain chain:

result = chain({"input_documents": docs, "question": question}, return_only_outputs=True)["output_text"]

With this approach of RAG implementation, we were able to take advantage of the additional flexibility of LangChain prompt templates and customize the number of documents searched for a relevancy match using the top K hyperparameter.

JumpStart RAG-based implementation notebook with SageMaker KNN

In this section, we implement the RAG-based approach using the KNN algorithm for finding relevant documents to create enhanced context. In this approach, we’re not using LangChain, but we use same dataset Amazon SageMaker FAQs as knowledge documents, embedding the models GPT-J-6B and LLM Flan T5 XXL just as we did in the previous LangChain approach.

If you have a large dataset, the SageMaker KNN algorithm may provide you with an effective semantic search. The SageMaker KNN algorithm also uses FAISS as the underlying search algorithm. The notebook for this solution can be found on GitHub.

First, we deploy the LLM Flan T5 XXL and GPT-J 6B embedding models in the same way as in the previous section. For each record in the knowledge database, we generate an embedding vector using the GPT-J embedding model.

Next, we use a SageMaker KNN training job to index the embedding of the knowledge data. The underlying algorithm used to index the data is FAISS. We want to find the top five most relevant documents, so we set the TOP_K variable to 5. We create the estimator for the KNN algorithm, run the training job, and deploy the KNN model to find indexes of the top five documents matching the query. See the following code:

from sagemaker.amazon.amazon_estimator import get_image_uri

def trained_estimator_from_hyperparams(s3_train_data, hyperparams, output_path):
    """
    Create an Estimator from the given hyperparams, fit to training data,
    and return a deployed predictor

    """
    # set up the estimator
    knn = sagemaker.estimator.Estimator(
        get_image_uri(boto3.Session().region_name, "knn"),
        aws_role,
        instance_count=1,
        instance_type="ml.m5.2xlarge",
        output_path=output_path,
        sagemaker_session=sess,
    )
    knn.set_hyperparameters(**hyperparams)

    # train a model. fit_input contains the locations of the train data
    fit_input = {"train": s3_train_data}
    knn.fit(fit_input)
    return knn

hyperparams = {"feature_dim": train_features.shape[1], "k": TOP_K,"sample_size": train_features.shape[0], "predictor_type": "classifier"}
output_path = f"s3://{bucket}/{prefix}/default_example/output"
knn_estimator = trained_estimator_from_hyperparams(
    s3_train_data, hyperparams, output_path)

Next, we create an embedding representation of the query using the GPT-J-6B embedding model that we used for creating an embedding of the knowledge library documents:

query_response = query_endpoint_with_json_payload(question, endpoint_name_embed, content_type="application/x-text")
question_embedding = parse_response_text_embed(query_response)

Then we use the KNN endpoint and pass the embedding of the query to the KNN endpoint to get the indexes of the top K most relevant documents. We use the indexes to retrieve the corresponded textual documents. Next, we concatenate the documents, ensuring the maximum allowed length of context is not exceeded. See the following code:

"""With maximum sequence length 500, selected top 4 document sections: 
  Managed Spot Training can be used with all instances supported in Amazon SageMaker.
  Managed Spot Training is supported in all AWS Regions where Amazon SageMaker is currently available.
  The difference between Savings Plans for Amazon SageMaker and Savings Plans for EC2 is in the services they 
  include. 
  SageMaker Savings Plans apply only to SageMaker ML Instance usage.
  There are no fixed limits to the size of the dataset you can use for training models with Amazon SageMaker.
"""

Now we come to our final step in which we combine the query, prompt, and the context containing text from relevant documents and pass it to the text generation LLM Flan T5 XXL model to generate the answer.

We get the following response for the query using a RAG-based approach with Flan T5 XXL:

"""
For model: huggingface-text2text-flan-t5-xxl, the generated output is: 

Managed Spot Training can be used with all instances supported in Amazon SageMaker
"""

Clean up

Make sure to delete the endpoints that we created in this notebook when not using them to avoid reoccurring cost.

Conclusion

In this post, we demonstrated the implementation of a RAG-based approach with LLMs for question answering tasks using two approaches: LangChain and the built-in KNN algorithm. The RAG-based approach optimizes the accuracy of the text generation using Flan T5 XXL by dynamically providing relevant context that was created by searching a list of documents.

You can use this these notebooks in SageMaker as is or you may customize them to your needs. To customize, you can use your own set of documents in the knowledge library, use other relevancy search implementations like OpenSearch, and use other embedding models and text generation LLMs available on JumpStart.

We look forward to seeing what you build on JumpStart using a RAG-based approach!


About the authors

Dr. Xin Huang is a Senior Applied Scientist for Amazon SageMaker JumpStart and Amazon SageMaker built-in algorithms. He focuses on developing scalable machine learning algorithms. His research interests are in the area of natural language processing, explainable deep learning on tabular data, and robust analysis of non-parametric space-time clustering. He has published many papers in ACL, ICDM, KDD conferences, and Royal Statistical Society: Series A.

Rachna Chadha is a Principal Solution Architect AI/ML in Strategic Accounts at AWS. Rachna is an optimist who believes that ethical and responsible use of AI can improve society in future and bring economical and social prosperity. In her spare time, Rachna likes spending time with her family, hiking and listening to music.

Dr. Kyle Ulrich is an Applied Scientist with the Amazon SageMaker built-in algorithms team. His research interests include scalable machine learning algorithms, computer vision, time series, Bayesian non-parametrics, and Gaussian processes. His PhD is from Duke University and he has published papers in NeurIPS, Cell, and Neuron.

Hemant Singh is a Machine Learning Engineer with experience in Amazon SageMaker JumpStart and Amazon SageMaker built-in algorithms. He got his masters from Courant Institute of Mathematical Sciences and B.Tech from IIT Delhi. He had experience in working on a diverse range of Machine Learning problems within the domain of natural language processing, computer vision, and time-series analysis.

Manas Dadarkar is a Software Development Manager owning the engineering of the Amazon Forecast service. He is passionate about the applications of machine learning and making ML technologies easily available for everyone to adopt and deploy to production. Outside of work, he has multiple interests including travelling, reading and spending time with friends and family.

Dr. Ashish Khetan is a Senior Applied Scientist with Amazon SageMaker built-in algorithms and helps develop machine learning algorithms. He got his PhD from University of Illinois Urbana-Champaign. He is an active researcher in machine learning and statistical inference, and has published many papers in NeurIPS, ICML, ICLR, JMLR, ACL, and EMNLP conferences.

Read More

Prepare image data with Amazon SageMaker Data Wrangler

Prepare image data with Amazon SageMaker Data Wrangler

The rapid adoption of smart phones and other mobile platforms has generated an enormous amount of image data. According to Gartner, unstructured data now represents 80–90% of all new enterprise data, but just 18% of organizations are taking advantage of this data. This is mainly due to a lack of expertise and the large amount of time and effort that’s required to sift through all that information to identify quality data and useful insights.

Before you can use image data for labeling, training, or inference, it first needs to be cleaned (deduplicate, drop corrupted images or outliers, and so on), analyzed (such as group images based on certain attributes), standardized (resize, change orientation, standardize lighting and color, and so on), and augmented for better labeling, training, or inference results (enhance contrast, blur some irrelevant objects, upscale, and so on).

Today, we are happy to announce that with Amazon SageMaker Data Wrangler, you can perform image data preparation for machine learning (ML) using little to no code.

Data Wrangler reduces the time it takes to aggregate and prepare data for ML from weeks to minutes. With Data Wrangler, you can simplify the process of data preparation and feature engineering, and complete each step of the data preparation workflow (including data selection, cleansing, exploration, visualization, and processing at scale) from a single visual interface.

Data Wrangler’s image data preparation feature addresses your needs via a visual UI for image preview, import, transformation, and export. You can browse, import, and transform image data just like how you use Data Wrangler for tabular data. In this post, we show an example of how to use this feature.

Solution overview

For this post, we focus on the Data Wrangler component of image processing, which we use to help an image classification model detect crashes with better quality images. We use the following services:

The following diagram illustrates the solution architecture.

Data Wrangler is a SageMaker feature available within Studio. You can follow the Studio onboarding process to spin up the Studio environment and notebooks. Although you can choose from a few authentication methods, the simplest way to create a Studio domain is to follow the Quick start instructions. The Quick start uses the same default settings as the standard Studio setup. You can also choose to onboard using AWS IAM Identity Center (successor to AWS Single Sign-On) for authentication (see Onboard to Amazon SageMaker Domain Using IAM Identity Center).

For this use case, we use CCTV footage data of accidents and non-accidents available from Kaggle. The dataset contains frames captured from YouTube videos of accidents and non-accidents. The images are split into train, test, and validation folders.

Prerequisites

As a prerequisite, download the sample dataset and upload it to an Amazon Simple Storage Service (Amazon S3) bucket. We use this dataset for image processing and subsequently for training a custom model.

Process images using Data Wrangler

Start your Studio environment and create a new Data Wrangler flow called car_crash_detection_data.flow. Now let’s import the dataset to Data Wrangler from the S3 bucket where the dataset was uploaded. Data Wrangler allows you to import datasets from different data sources, including images.

Data Wrangler supports a variety of built-in transformations for image processing, including the following:

  • Blur image – Data Wrangler supports different techniques from an open-source image library (Gaussian, Average, Median, Motion, and more) for blurring images. For details of each technique, refer to augmenters.blur.
  • Corrupt image – Data Wrangler also supports different corruption techniques (Gaussian noise, Impulse noise, Speckle noise, and more). For details of each technique, refer to augmenters.imgcorruptlike.
  • Enhance image contrast – You can deploy different contrast enhancement techniques (Gamma contrast, Sigmoid contrast, Log contrast, Linear contrast, Histogram equalization, and more). For more details, refer to augmenters.contrast.
  • Resize image – Data Wrangler supports different resizing techniques (cropping, padding, thumbnail, and more). For more details, refer to augmenters.size.

In this post, we highlight a subset of these functionalities through the following steps:

  1. Upload images from the source S3 bucket and preview the image.
  2. Create quick image transformations using the built-in transformers.
  3. Write custom code like finding outliers or using the Search Example Snippets function.
  4. Export the final cleansed data to another S3 bucket.
  5. Combine images from different Amazon S3 sources into one Data Wrangler flow.
  6. Create a job to trigger the Data Wrangler flow.

Let’s look at each step in detail.

Upload images from the source bucket S3 and preview the image

To upload all the images under one folder, complete the following steps:

  1. Select the S3 folder containing the images.
  2. For File type, choose Image.
  3. Select Import nested directories.
  4. Choose Import.

You can preview the images that you’re uploading by turning the Preview option on. Note that Date Wrangler only imports 100 random images for the interactive data preparation.

The preview function allows you to view images and preview the quality of images imported.

For this post, we load the images from both the accident and non-accident folders. We create a set of transformations for each: one flow to corrupt images, resize, and remove outliers, and another flow to enhance image contrast, resize, and remove outliers.

Transform images using built-in transformations

Image data transformation is very important to regularize your model network and increase the size of your training set. These transformations can change the images’ pixel values but still keep almost all the information of the image, so that a human could hardly tell whether it was augmented or not. This forces the model to be more flexible with the wide variety of objects in the image, regarding position, orientation, size, color, and so on. Models trained with data augmentation usually generalize better.

Data Wrangler offers you built-in and custom transformations to improve the quality of images for labeling, training, or inference. We use some of these transformations to improve the image dataset fed to the model for machine learning.

Corrupt images

The first built-in transformation we use is Corrupt images.

We add a step with Corruption set to Impulse noise and set our severity.

Corrupting an image or creating any kind of noise helps make a model more robust. The model can predict with more accuracy even if it receives a corrupted image because it was trained with corrupt and non-corrupt images.

Enhance contrast

We also add a transform to enhance the Gamma contrast.

Resize images

We can use a built-in transformation to resize all the images to add symmetry. Data Wrangler offers several resize options, as shown in the following screenshot.

The following example shows images prior to resize: 720 x 1280.

The following images have been resized to 620 x 1180.

Add a custom transformation to detect and remove image outliers

With image preparation in Data Wrangler, we can also invoke another endpoint for another model. You can find some sample scripts with boilerplate code in the Search example snippets section.

For our example, we create a new transformation to remove outliers. Please note that this code is just for demonstration purpose. You may need to modify code to suit any production workload needs.

This is a custom snippet written in PySpark. Before running this step, we need to have an image embedding model mobile-net (for example, jumpstart-dft-mobilenet-v2-100-224-featurevector-4). After the endpoint is deployed, we can call a JumpStart model endpoint.

JumpStart models solve common ML tasks such as image classification, object detection, text classification, sentence pair classification, and question answering, and are available for quick model creation and deployment.

To learn how to create an image embedding model in JumpStart, refer to the GitHub repo. The steps to follow are similar to creating an image classification model with Jumpstart. See the following code:

# Table is available as variable ‘df’
# number of outliers to cut-off
n_outliers = 1

import json
import ast 
import boto3
from pyspark.sql.functions import col, udf
import numpy as np
from sklearn.neighbors import NearestNeighbors

# UDF below queries a Sagemaker endpoint to get an embedding for every image.
# This example uses Sagemaker Jumpstart pretrained MobileNet model for image embedding.
# Replace the endpoint name highlighted below with your own

def get_embedding(image):
	response = 	boto3.client('runtime.sagemaker').
	invoke_endpoint(EndpointName='jumpstart-dft-mobilenet-v2-130-224-featurevector-4', 
	ContentType='application/x-image',Body=image.data)
	return json.loads(response['Body'].read())["embedding"]

convertUDF = udf(lambda z: get_embedding(z))
df_pd = df.select("image_col.path", convertUDF(col("image_col")).alias("emb")).toPandas()

# parsing the output
df_np = []
for val in df_pd["emb"].to_list():
	df_np.append(ast.literal_eval(val))  
df_np = np.array(df_np)

# setting up a nearest neighbor tree 
knn_tree = NearestNeighbors(n_neighbors=10, metric='cosine')
knn_tree.fit(df_np)

# scoring all images based on distance to the closest 10 neighbors in the embedding space 
scores = knn_tree.kneighbors(df_np, n_neighbors=10, return_distance=True)[0][:,1:]
scores = np.mean(scores, axis = 1) 

# sorting by the score and declaring n_outliers the most abnormal images as outliers  
scores_argsort = np.argsort(scores)
outliers = df_pd["path"][list(scores_argsort[-n_outliers:])].to_list()

The following screenshots show an example of our custom transform.

This identifies the outlier. Next, let’s filter out the outlier. You can use the following snippet in the end of the custom transform:

df = df.filter(~ df.path.isin(outliers))

Choose Preview and Add to save the changes.

Export the final cleansed data to another S3 bucket

After adding all the transformations, let’s define the destination in Amazon S3.

Provide the location of your S3 bucket.

Combine images from different Amazon S3 sources into one Data Wrangler flow

In the previous sections, we processed images of accidents. We can implement a similar flow for any other images (in our case, non-accident images). Choose Import and follow the same steps to create a second flow.

Now we can view the flows for each dataset.

Create a job to run the automated flow

When we create a job, we scale the recipe to the entire dataset, which could be thousands or millions of images. You can also schedule the flow to run periodically, and you can parameterize the input data source to scale the processing. For details on job scheduling and input parameterization, refer to Create a Schedule to Automatically Process New Data.

Choose Create job to run a job for the end-to-end flow.

Provide the details of the job and select both datasets.

Congratulations! You have successfully created a job to process images using Data Wrangler.

Model training and deployment

JumpStart provides one-click, end-to-end solutions for many common ML use cases. We can use the images prepared with Data Wrangler when creating a quick image classification model in JumpStart. For instructions, refer to Run image classification with Amazon SageMaker JumpStart.

Clean up

When you’re not using Data Wrangler, it’s important to shut down the instance on which it runs to avoid incurring additional fees.

Data Wrangler automatically saves your data flow every 60 seconds. To avoid losing work, save your data flow before shutting Data Wrangler down.

  1. To save your data flow in Studio, choose File, then choose Save Data Wrangler Flow.
  2. To shut down the Data Wrangler instance, in Studio, choose Running Instances and Kernels.
  3. Under RUNNING APPS, choose the shutdown icon next to the sagemaker-data-wrangler-1.0 app.
  4. Choose Shut down all to confirm.

Data Wrangler runs on an ml.m5.4xlarge instance. This instance disappears from RUNNING INSTANCES when you shut down the Data Wrangler app.

  1. Shut down the JumpStart endpoint that you created for the outlier transformation image embedding.

After you shut down the Data Wrangler app, it has to restart the next time you open a Data Wrangler flow file. This can take a few minutes.

Conclusion

In this post, we demonstrated using image data preparation for ML on Data Wrangler. To get started with Data Wrangler, see Prepare ML Data with Amazon SageMaker Data Wrangler, and check out the latest information on the Data Wrangler product page.


About the Authors

Deepmala Agarwal works as an AWS Data Specialist Solutions Architect. She is passionate about helping customers build out scalable, distributed, and data-driven solutions on AWS. When not at work, Deepmala likes spending time with family, walking, listening to music, watching movies, and cooking!

Meenakshisundaram Thandavarayan works for AWS as an AI/ ML Specialist. He has a passion to design, create, and promote human-centered data and analytics experiences. Meena focusses on developing sustainable systems that deliver measurable, competitive advantages for strategic customers of AWS. Meena is a connector, design thinker, and strives to drive business to new ways of working through innovation, incubation and democratization.

Lu Huang is a Senior Product Manager on Data Wrangler. She’s passionate about AI, machine learning, and big data.

Nikita Ivkin is a Senior Applied Scientist at Amazon SageMaker Data Wrangler with interests in machine learning and data cleaning algorithms.

Read More

Improve multi-hop reasoning in LLMs by learning from rich human feedback

Improve multi-hop reasoning in LLMs by learning from rich human feedback

Recent large language models (LLMs) have enabled tremendous progress in natural language understanding. However, they are prone to generating confident but nonsensical explanations, which poses a significant obstacle to establishing trust with users. In this post, we show how to incorporate human feedback on the incorrect reasoning chains for multi-hop reasoning to improve performance on these tasks. Instead of collecting the reasoning chains from scratch by asking humans, we instead learn from rich human feedback on model-generated reasoning chains using the prompting abilities of the LLMs. We collect two such datasets of human feedback in the form of (correction, explanation, error type) for StrategyQA and Sports Understanding datasets, and evaluate several common algorithms to learn from such feedback. Our proposed methods perform competitively to chain-of-thought prompting using the base Flan-T5, and ours is better at judging the correctness of its own answer.

Solution overview

With the onset of large language models, the field has seen tremendous progress on various natural language processing (NLP) benchmarks. Among them, the progress has been striking on relatively simpler tasks such as short context or factual question answering, compared to harder tasks that require reasoning such as multi-hop question answering. The performance of certain tasks using LLMs may be similar to random guessing at smaller scales, but improves significantly at larger scales. Despite this, the prompting abilities of LLMs have the potential to provide some relevant facts required to answer the question.

However, those models may not reliably generate correct reasoning chains or explanations. Those confident but nonsensical explanations are even more prevalent when LLMs are trained using Reinforcement Learning from Human Feedback (RLHF), where reward hacking may occur.

Motivated by this, we try to address the following research question: can we improve reasoning of LLMs by learning from human feedback on model-generated reasoning chains? The following figure provides an overview of our approach: we first prompt the model to generate reasoning chains for multi-hop questions, then collect diverse human feedback on these chains for diagnosis and propose training algorithms to learn from the collected data.

We collect diverse human feedback on two multi-hop reasoning datasets, StrategyQA and Sports Understanding from BigBench. For each question and model-generated reasoning chain, we collect the correct reasoning chain, the type of error in the model-generated reasoning chain, and a description (in natural language) of why that error is presented in the provided reasoning chain. The final dataset contains feedback for 1,565 samples from StrategyQA and 796 examples for Sports Understanding.

We propose multiple training algorithms to learn from the collected feedback. First, we propose a variant of self-consistency in chain-of-thought prompting by considering a weighted variant of it that can be learned from the feedback. Second, we propose iterative refinement, where we iteratively refine the model-generated reasoning chain until it’s correct. We demonstrate empirically on the two datasets that fine-tuning an LLM, namely Flan-T5 using the proposed algorithms, performs comparably to the in-context learning baseline. More importantly, we show that the fine-tuned model is better at judging if its own answer is correct compared to the base Flan-T5 model.

Data collection

In this section, we describe the details of the feedback we collected and the annotation protocol followed during data collection. We collected feedback for model generations based on two reasoning-based datasets: StrategyQA and Sports Understanding from BigBench. We used GPT-J to generate the answer for StrategyQA and Flan-T5 to generate the answer for the Sports Understanding dataset. In each case, the model was prompted with k in-context examples containing question, answer, and explanation, followed by the test question.

The following figure shows the interface we used. Annotators are given the question, the model-generated answer, and the explanation split into steps.

For each question, we collected the following feedback:

  • Subquestions – The annotators decompose the original question into simpler subquestions required to answer the original question. This task was added after a pilot where we found that adding this task helps prepare the annotators and improve the quality of the rest of the tasks.
  • Correction – Annotators are provided with a free-form text box pre-filled with the model-generated answer and explanation, and asked to edit it to obtain the correct answer and explanation.
  • Error type – Among the most common types of error we found in the model generations (Factual Error, Missing Facts, Irrelevant Facts, and Logical Inconsistency), annotators were asked to pick one or more of the error types that apply to the given answer and explanation.
  • Error description – The annotators were instructed to not only classify the errors but also give a comprehensive justification for their categorization, including pinpointing the exact step where the mistake occurred and how it applies to the answer and explanation provided.

We used Amazon SageMaker Ground Truth Plus in our data collection. The data collection took place over multiple rounds. We first conducted two small pilots of 30 examples and 200 examples, respectively, after which the annotator team was given detailed feedback on the annotation. We then conducted the data collection over two batches for StrategyQA, and over one batch for Sports Understanding, giving periodic feedback throughout—a total of 10 annotators worked on the task over a period of close to 1 month.

We gathered feedback on a total of 1,565 examples for StrategyQA and 796 examples for Sports Understanding. The following table illustrates the percentage of examples that were error-free in the model generation and the proportion of examples that contained a specific error type. It’s worth noting that some examples may have more than one error type.

Error Type StrategyQA Sports Understanding
None 17.6% 31.28%
Factual Error 27.6% 38.1%
Missing Facts 50.4% 46.1%
Irrelevant Facts 14.6% 3.9%
Logical Inconsistency 11.2% 5.2%

Learning algorithms

For each question q, and model-generated answer and explanation m, we collected the following feedback: correct answer and explanation c, type of error present in m (denoted by t), and error description d, as described in the previous section.

We used the following methods:

  • Multitask learning – A simple baseline to learn from the diverse feedback available is to treat each of them as a separate task. More concretely, we fine-tune Flan-T5 (text to text) with the objective maximize p(c|q) + p(t|q, m) + p(d|q, m). For each term in the objective, we use a separate instruction appropriate for the task (for example, “Predict error in the given answer”). We also convert the categorical variable t into a natural language sentence. During inference, we use the instruction for the term p(c|q) (“Predict the correct answer for the given question”) to generate the answer for the test question.
  • Weighted self-consistency – Motivated by the success of self-consistency in chain-of-thought prompting, we propose a weighted variant of it. Instead of treating each sampled explanation from the model as correct and considering the aggregate vote, we instead first consider whether the explanation is correct and then aggregate accordingly. We first fine-tune Flan-T5 with the same objective as in multitask learning. During inference, given a test question q, we sample multiple possible answers with the instruction for p(c|q)): a1, a2, .., an. For each sampled answer ai, we use the instruction for the term p(t|q, m) (“Predict error in the given answer”) to identify if it contains error ti = argmax p(t|q, a_i). Each answer ai is assigned a weight of 1 if it’s correct, otherwise it’s assigned a weight smaller than 1 (tunable hyperparameter). The final answer is obtained by considering a weighted vote over all the answers a1 to an.
  • Iterative refinement – In the previous proposed methods, the model directly generates the correct answer c conditioned on the question q. Here we propose to refine the model-generated answer m to obtain the correct answer for a given question. More specifically, we first fine-tune Flan-T5 (text to text with the objective) with maximize p(t; c|q, m), where ; denotes the concatenation (error type t followed by the correct answer c). One way to view this objective is that the model is first trained to identify the error in given generation m, and then to remove that error to obtain the correct answer c. During inference, we can use the model iteratively until it generates the correct answer—given a test question q, we first obtain the initial model generation m (using pre-trained Flan-T5). We then iteratively generate the error type ti and potential correct answer ci until ti = no error (in practice, we set a maximum number of iterations to a hyperparameter), in which case the final correct answer will be ci-1 (obtained from p(ti ; ci | q, ci-1)).

Results

For both datasets, we compare all the proposed learning algorithms with the in-context learning baseline. All models are evaluated on the dev set of StrategyQA and Sports Understanding. The following table shows the results.

Method StrategyQA Sports Understanding
Flan-T5 4-shot Chain-of-Thought In-Context Learning 67.39 ± 2.6% 58.5%
Multitask Learning 66.22 ± 0.7% 54.3 ± 2.1%
Weighted Self Consistency 61.13 ± 1.5% 51.3 ± 1.9%
Iterative Refinement 61.85 ± 3.3% 57.0 ± 2.5%

As observed, some methods perform comparable to the in-context learning baseline (multitask for StrategyQA, and iterative refinement for Sports Understanding), which demonstrates the potential of gathering ongoing feedback from humans on model outputs and using that to improve language models. This is different from recent work such as RLHF, where the feedback is limited to categorical and usually binary.

As shown in the following table, we investigate how models adapted with human feedback on reasoning mistakes can help improve the calibration or the awareness of confidently wrong explanations. This is evaluated by prompting the model to predict if its generation contains any errors.

Method Error Correction StrategyQA
Flan-T5 4-shot Chain-of-Thought In-Context Learning No 30.17%
Multitask Finetuned Model Yes 73.98%

In more detail, we prompt the language model with its own generated answer and reasoning chain (for which we collected feedback), and then prompt it again to predict the error in the generation. We use the appropriate instruction for the task (“Identify error in the answer”). The model is scored correctly if it predicts “no error” or “correct” in the generation if the annotators labeled the example as having no error, or if it predicts any of the error types in the generation (along with “incorrect” or “wrong”) when the annotators labeled it as having an error. Note that we don’t evaluate the model’s ability to correctly identify the error type, but rather if an error is present. The evaluation is done on a set of 173 additional examples from the StrategyQA dev set that were collected, which aren’t seen during fine-tuning. Four examples out of these are reserved for prompting the language model (first row in the preceding table).

Note that we do not show the 0-shot baseline result because the model is unable to generate useful responses. We observe that using human feedback for error correction on reasoning chains can improve the model’s prediction of whether it makes errors or not, which can improve the awareness or calibration of the wrong explanations.

Conclusion

In this post, we showed how to curate human feedback datasets with fine-grained error corrections, which is an alternative way to improve the reasoning abilities of LLMs. Experimental results corroborate that human feedback on reasoning errors can improve performance and calibration on challenging multi-hop questions.

If you’re looking for human feedback to improve your large language models, visit Amazon SageMaker Data Labeling and the Ground Truth Plus console.


About the Authors

Erran Li is the applied science manager at humain-in-the-loop services, AWS AI, Amazon. His research interests are 3D deep learning, and vision and language representation learning. Previously he was a senior scientist at Alexa AI, the head of machine learning at Scale AI and the chief scientist at Pony.ai. Before that, he was with the perception team at Uber ATG and the machine learning platform team at Uber working on machine learning for autonomous driving, machine learning systems and strategic initiatives of AI. He started his career at Bell Labs and was adjunct professor at Columbia University. He co-taught tutorials at ICML’17 and ICCV’19, and co-organized several workshops at NeurIPS, ICML, CVPR, ICCV on machine learning for autonomous driving, 3D vision and robotics, machine learning systems and adversarial machine learning. He has a PhD in computer science at Cornell University. He is an ACM Fellow and IEEE Fellow.

Nitish Joshi was an applied science intern at AWS AI, Amazon. He is a PhD student in computer science at New York University’s Courant Institute of Mathematical Sciences advised by Prof. He He. He works on Machine Learning and Natural Language Processing, and he was affiliated with the Machine Learning for Language (ML2) research group. He was broadly interested in robust language understanding: both in building models which are robust to distribution shifts (e.g. through human-in-the-loop data augmentation) and also in designing better ways to evaluate/measure the robustness of models. He has also been curious about the recent developments in in-context learning and understanding how it works.

Kumar Chellapilla is a General Manager and Director at Amazon Web Services and leads the development of ML/AI Services such as human-in-loop systems, AI DevOps, Geospatial ML, and ADAS/Autonomous Vehicle development. Prior to AWS, Kumar was a Director of Engineering at Uber ATG and Lyft Level 5 and led teams using machine learning to develop self-driving capabilities such as perception and mapping. He also worked on applying machine learning techniques to improve search, recommendations, and advertising products at LinkedIn, Twitter, Bing, and Microsoft Research.

Read More

How to extend the functionality of AWS Trainium with custom operators

How to extend the functionality of AWS Trainium with custom operators

Deep learning (DL) is a fast-evolving field, and practitioners are constantly innovating DL models and inventing ways to speed them up. Custom operators are one of the mechanisms developers use to push the boundaries of DL innovation by extending the functionality of existing machine learning (ML) frameworks such as PyTorch. In general, an operator describes the mathematical function of a layer in a deep learning model. A custom operator allows developers to build their own mathematical functions for a layer in the deep learning model.

AWS Trainium and AWS Inferentia2, which are purpose built for DL training and inference, extend their functionality and performance by supporting custom operators (or CustomOps, for short). AWS Neuron, the SDK that supports these accelerators, uses the standard PyTorch interface for CustomOps. Developers can easily get started with their existing code when using Trainium-based Amazon EC2 Trn1 instances or Inferentia2-based Amazon EC2 Inf2 instances. In this post, we cover the benefits of CustomOps, their efficient implementation on Trainium, and examples to get you started with CustomOps on Trainium-powered Trn1 instances.

To follow along, familiarity with core AWS services such as Amazon Elastic Compute Cloud (Amazon EC2) is implied, and basic familiarity with deep learning, PyTorch, and C++ would be helpful.

Custom operators in PyTorch and their benefits

CustomOps for PyTorch originated in version 1.10, called PyTorch C++ Frontend, and provided an easy-to-use mechanism to register CustomOps written in C++. The following are some of the benefits that CustomOps provide:

  • Performance optimization – CustomOps can be optimized for specific use cases, leading to faster model runs and improved performance.
  • Improved model expressiveness – With CustomOps, you can express complex computations that aren’t easily expressible using the built-in operators provided by PyTorch.
  • Increased modularity – You can use CustomOps as building blocks to create more complex models by creating C++ libraries of reusable components. This makes the development process easier and more modular, and facilitates rapid experimentation.
  • Increased flexibility – CustomOps enables operations beyond the built-in operators—that is, they provide a flexible way to define complex operations that aren’t implemented using the standard ones.

Trainium support for custom operators

Trainium (and AWS Inferentia2) supports CustomOps in software through the Neuron SDK and accelerates them in hardware using the GPSIMD engine (General Purpose Single Instruction Multiple Data engine). Let’s look at how these enable efficient CustomOps implementation and provide increased flexibility and performance when developing and innovating DL models.

Neuron SDK

The Neuron SDK helps developers train models on Trainium and deploy models on the AWS Inferentia accelerators. It integrates natively with frameworks, such as PyTorch and TensorFlow, so you can continue using your existing workflows and application code to train models on Trn1 instances.

The Neuron SDK uses the standard PyTorch interface for CustomOps. Developers can use the standard programming interface in PyTorch to write CustomOps in C++ and extend Neuron’s official operator support. Neuron then compiles these CustomOps to run efficiently on the GPSIMD engine, which is described in more detail in the following section. This makes it easy to implement new experimental CustomOps and accelerate them on purpose-built hardware, without any intimate knowledge of this underlying hardware.

General Purpose Single Instruction Multiple Data engine

At the core of Trainium optimizations resides the NeuronCore architecture, a fully independent, heterogeneous compute-unit with four main engines: tensor, vector, scalar, and the GPSIMD engine. The scalar and vector engines are highly parallelized and optimized for floating-point operations. The tensor engine is based on a power-optimized, systolic-array supporting mixed precision computation.

The GPSIMD engine is a general-purpose Single Instruction Multiple Data (SIMD) engine designed for running and accelerating CustomOps. This engine consists of eight fully programmable 512-bit wide general-purpose processors, which can run straight-line C-code and have direct inline access to the other NeuronCore-v2 engines, as well as the embedded SRAM and HBM memories. Together, these capabilities help run CustomOps efficiently on Trainium.

Take for example operators such as TopK, LayerNorm, or ZeroCompression, which read data from memory and only use it for a minimal number of ALU calculations. Regular CPU systems are completely memory bound for these calculations, and performance is limited by the time required to move the data into the CPU. In Trainium, the GP-SIMD engines are tightly coupled with the on-chip caches using a high bandwidth streaming interface, which can sustain 2 TB/sec of memory bandwidth. Therefore, CustomOps like these can be run really fast on Trainium.

Neuron SDK custom operators in practice

For this post, we assume a DLAMI (refer to instructions for either Ubuntu or Amazon Linux) is being used to instantiate an EC2 Trn1 instance (either 2x.large or 32x.large). Note all necessary software, drivers, and tools have already been installed on the DLAMIs, and only the activation of the Python environment is needed to start working with the tutorial. We reference the CustomOps functionality available in Neuron as “Neuron CustomOps.”

Similar to the process of PyTorch integration with C++ code, Neuron CustomOps requires a C++ implementation of an operator via a NeuronCore-ported subset of the Torch C++ API . The C++ implementation of the operator is called the kernel function, and the port of the C++ API contains everything required for CustomOps development and model integration, specifically tensor and scalar classes in c10 (a namespace used for low-level C++ code across different PyTorch libraries), and a subset of ATen operators (or Automatic Tensor, the C++ library that provides the core tensor operations used in PyTorch).

The torch.h header needs to be included when defining the kernel for you to have access to a NeuronCore-ported subset of the Pytorch C++ API:

#include <torch/torch.h>

Neuron CustomOps also require a shape function. The shape function has the same function signature as the kernel function, but doesn’t perform any computations. It only defines the shape of the output tensor but not the actual values.

Neuron CustomOps are grouped into libraries, and macros are used to register them with the NEURON_LIBRARY scope from within the shape function. The function will be run on the host at compilation time and will require the register.h header from the torchneuron library:

#include "torchneuron/register.h"

Finally, the custom library is built by calling the load API. If supplying the build_directory parameter, the library file will be stored in the indicated directory:

import torch_neuronx
from torch_neuronx.xla_impl import custom_op

custom_op.load(
name=name,# this is the name for the library(i.e, 'relu')
compute_srcs=['CustomOP.cpp'],
shape_srcs=['shape.cpp'],
build_directory*=*os.getcwd()
)

To use the CustomOp from a PyTorch model, simply load the library by calling the load_library API and call the Neuron CustomOp in the same manner that CustomOps are called in PyTorch via the torch.ops namespace. The format is usually torch.ops.<library_name>.<operator_name>. See the following code:

import torch
import torch_neuronx
from torch_neuronx.xla_impl import custom_op

custom_op.load_library('/home/user/libmy_ops.so')
out_tensor = torch.ops.my_lib.my_op(in_tensor)

Note that the custom_op.load API builds the C++ library, whereas the custom_op.load_library API loads an already-built library file.

Example: Neuron CustomOps in MLP training

To get started, perform the following steps:

  1. Create and launch your EC2 Trn1 instance. Be sure that you use a DLAMI image (either Ubuntu or Amazon Linux, pre-installed with all necessary Neuron software) and that you have specified a root volume size of 512 GB.
  2. After your instance is up and running, SSH to your instance.
  3. Install PyTorch Neuron (torch-neuronx) on your running Trn1 instance. For instructions, refer to Neuron Custom C++ Operators in MLP Training.
  4. Download the sample code from the GitHub repository.

Now that your environment is set up, continue through this post as we describe the implementation of a typical C++ CustomOp in Neuron in the form of Relu forward and backward functions to be used on a simple multilayer perceptron (MLP) model. The steps are described in the AWS Neuron Documentation.

The example code from the repository shows two folders:

  • ./customop_mlp/PyTorch – Contains the Relu code that will be compiled for a CPU
  • ./customop_mlp/neuron – Contains the Relu code that will be compiled for Trainium

Develop a Neuron CustomOp: The kernel function

The host or dev environment for the development of the kernel function (the Neuron CustomOp) can run PyTorch 1.13 and a C++17 compatible compiler in a Linux environment. This is the same as developing any C++ function for PyTorch, and the only libraries that need to be present in the development environment are those for PyTorch and C++. In the following example, we create a relu.cpp file with the custom Relu forward and backward functions:

#include <stdint.h>
#include <stdlib.h>
#include <torch/torch.h>

torch::Tensor relu_forward(const torch::Tensor& t_in) {
torch::Tensor t_out = torch::zeros(t_in.sizes(), torch::kFloat);
auto t_in_acc = t_in.accessor<float, 2>();
auto t_out_acc = t_out.accessor<float, 2>();
auto shape = t_in.sizes();
for (int i = 0; i < shape[0]; i++) {
for (int j = 0; j < shape[1]; j++) {
t_out_acc[i][j] = t_in_acc[i][j] > 0.0 ? t_in_acc[i][j] : 0.0;
}
}
return t_out;
}

torch::Tensor relu_backward(const torch::Tensor& t_grad, const torch::Tensor& t_in) {
torch::Tensor t_out = torch::zeros(t_in.sizes(), torch::kFloat);
auto t_in_acc = t_in.accessor<float, 2>();
auto t_grad_acc = t_grad.accessor<float, 2>();
auto t_out_acc = t_out.accessor<float, 2>();
auto shape = t_in.sizes();
for (int i = 0; i < shape[0]; i++) {
for (int j = 0; j < shape[1]; j++) {
t_out_acc[i][j] = t_in_acc[i][j] > 0.0 ? t_grad_acc[i][j] : 0.0;
}
}
return t_out;
}

When developing a Neuron CustomOp for Neuron, make sure you take into account the currently supported features and APIs. For more information, refer to Custom Operators API Reference Guide [Experimental].

Build and register the Neuron CustomOp: The shape function

The build for the Neuron CustomOp and runtime environment is the Trn1 instance where the training will take place, and the Neuron CustomOp will be compiled and registered as a neuronx-cc library and interpreted by the Neuron runtime to run on the highly optimized GP-SIMD engine.

To build and register the Neuron CustomOp, we need to create a shape function (shape.cpp) that will define the input and output tensors and register the operators: the relu_fwd_shape and relu_bwd_shape functions. See the following code:

#include <stdint.h>
#include <stdlib.h>
#include <torch/torch.h>
#include "torchneuron/register.h"

torch::Tensor relu_fwd_shape(torch::Tensor t_in) {
torch::Tensor t_out = torch::zeros(t_in.sizes(), torch::kFloat);
return t_out;
}

torch::Tensor relu_bwd_shape(torch::Tensor t_grad, torch::Tensor t_in) {
torch::Tensor t_out = torch::zeros(t_in.sizes(), torch::kFloat);
return t_out;
}

NEURON_LIBRARY(my_ops, m) {
m.def("relu_forward", &relu_fwd_shape, "relu_forward");
m.def("relu_backward", &relu_bwd_shape, "relu_backward");
}

The relu_fwd_shape and relu_bwd_shape functions define the shape of the output tensor (to be the same size as the input tensor). Then we register the functions in the NEURON_LIBRARY scope.

In the ./customop_ml/neuron repository example, we have a build.py script to run the build and registration of the CustomOp, by simply calling the load function from the torch_neuronx.xla_impl package:

import os
import torch_neuronx
from torch_neuronx.xla_impl import custom_op

custom_op.load(
name='relu',
compute_srcs=['relu.cpp'],
shape_srcs=['shape.cpp'],
build_directory=os.getcwd()
)

In the build_directory, we should find the librelu.so library ready to be loaded and used in training our model.

Build the MLP model with the Neuron CustomOp

In this section, we go through the steps to build the MLP model with the Neuron CustomOp.

Define the Relu class

For a detailed explanation of how to train an MLP model, refer to Multi-Layer Perceptron Training Tutorial.

After we build the CustomOp, we create a Python package called my_ops.py, where we define a Relu PyTorch class, inheriting from the torch autograd function. The autograd function implements automatic differentiation, so that it can be used in a training loop.

First we load the librelu.so library, then we define the new class with the forward and backward functions defined with static method decorators. In this way, the methods can be called directly when we define the model. See the following code:

import torch
import torch_neuronx
from torch_neuronx.xla_impl import custom_op

custom_op.load_library('librelu.so')

class Relu(torch.autograd.Function):
@staticmethod
def forward(ctx, input):
ctx.save_for_backward(input)
return torch.ops.my_ops.relu_forward(input)

@staticmethod
def backward(ctx, grad):
input, = ctx.saved_tensors
return torch.ops.my_ops.relu_backward(grad, input), None

Examine the MLP model

Now we’re ready to write our multilayer perceptron model with our Neuron CustomOp by importing the my_ops package where we have defined the Relu class:

import torch
import torch.nn as nn
from torch.nn import functional as F
import my_ops

# Declare 3-layer MLP for MNIST dataset
class MLP(nn.Module):
def __init__(self, input_size = 28 * 28, output_size = 10, layers = [120, 84]):
super(MLP, self).__init__()
self.fc1 = nn.Linear(input_size, layers[0])
self.fc2 = nn.Linear(layers[0], layers[1])
self.fc3 = nn.Linear(layers[1], output_size)

def forward(self, x):
f1 = self.fc1(x)
r1 = my_ops.Relu.apply(f1)
f2 = self.fc2(r1)
r2 = my_ops.Relu.apply(f2)
f3 = self.fc3(r2)
return torch.log_softmax(f3, dim=1)

Run the training script

Now we can train our model by using the train.py provided script:

import os
import time
import torch
from model import MLP

from torchvision.datasets import mnist
from torch.utils.data import DataLoader
from torchvision.transforms import ToTensor

# XLA imports
import torch_xla.core.xla_model as xm

# Global constants
EPOCHS = 4
WARMUP_STEPS = 2
BATCH_SIZE = 32

# Load MNIST train dataset
train_dataset = mnist.MNIST(root='./MNIST_DATA_train',
train=True, download=True, transform=ToTensor())

def main():
# Prepare data loader
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE)

# Fix the random number generator seeds for reproducibility
torch.manual_seed(0)

# XLA: Specify XLA device (defaults to a NeuronCore on Trn1 instance)
device = 'xla'

# Move model to device and declare optimizer and loss function
model = MLP().to(device)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
loss_fn = torch.nn.NLLLoss()

# Run the training loop
print('----------Training ---------------')
model.train()
for epoch in range(EPOCHS):
start = time.time()
for idx, (train_x, train_label) in enumerate(train_loader):
optimizer.zero_grad()
train_x = train_x.view(train_x.size(0), -1)
train_x = train_x.to(device)
train_label = train_label.to(device)
output = model(train_x)
loss = loss_fn(output, train_label)
loss.backward()
optimizer.step()
xm.mark_step() # XLA: collect ops and run them in XLA runtime
if idx < WARMUP_STEPS: # skip warmup iterations
start = time.time()
# Compute statistics for the last epoch
interval = idx - WARMUP_STEPS # skip warmup iterations
throughput = interval / (time.time() - start)
print("Train throughput (iter/sec): {}".format(throughput))
print("Final loss is {:0.4f}".format(loss.detach().to('cpu')))

# Save checkpoint for evaluation
os.makedirs("checkpoints", exist_ok=True)
checkpoint = {'state_dict': model.state_dict()}
# XLA: use xm.save instead of torch.save to ensure states are moved back to cpu
# This can prevent "XRT memory handle not found" at end of test.py execution
xm.save(checkpoint,'checkpoints/checkpoint.pt')

print('----------End Training ---------------')

if __name__ == '__main__':
main()

By sending the model to the xla device, the model and Relu custom operator are compiled to be run by the Neuron runtime using the optimized Trainium hardware.

In this example, we showed how to create a custom Relu operator that takes advantage of the hardware engine (GP-SIMD) available on the Trainium ML accelerator chip. The result is a trained PyTorch model that can now be deployed for inferencing.

Conclusion

Modern state-of-the-art model architectures require an increasing number of resources from engineering staff (data scientists, ML engineers, MLOps engineers, and others) to actual infrastructure including storage, compute, memory, and accelerators. These requirements increase the cost and complexity of developing and deploying deep learning models. Trainium accelerators deliver a high-performance, low-cost solution for DL training in the cloud. The use of Trainium is facilitated by the Neuron SDK, which includes a deep learning compiler, runtime, and tools that are natively integrated into popular frameworks such as PyTorch and TensorFlow. (Note that at the time of writing, the Neuron SDK 2.9 only supports PyTorch for the development of custom operators.)

As demonstrated in this post, Trainium not only provides the means to train your models performantly and efficiently, but also offers the ability to customize your operators to add flexibility and expressiveness to both training and experimentation.

For more information, refer to the GitHub repo.


About the Authors

Lorea Arrizabalaga is a Solutions Architect aligned to the UK Public Sector, where she helps customers design ML solutions with Amazon SageMaker. She is also part of the Technical Field Community dedicated to hardware acceleration and helps with testing and benchmarking AWS Inferentia and AWS Trainium workloads.

Shruti Koparkar is a Senior Product Marketing Manager at AWS. She helps customers explore, evaluate, and adopt Amazon EC2 accelerated computing infrastructure for their machine learning needs.

Read More

Deliver your first ML use case in 8–12 weeks

Deliver your first ML use case in 8–12 weeks

Do you need help to move your organization’s Machine Learning (ML) journey from pilot to production? You’re not alone. Most executives think ML can apply to any business decision, but on average only half of the ML projects make it to production.

This post describes how to implement your first ML use case using Amazon SageMaker in just 8–12 weeks by leveraging a methodology called Experience-based Acceleration (EBA).

Challenges

Customers may face several challenges when implementing machine learning (ML) solutions.

  • You may struggle to connect your ML technology efforts to your business value proposition, making it difficult for IT and business leadership to justify the investment it requires to operationalize models.
  • You may often select low-value use cases as proof of concept rather than solving a meaningful business or customer problem.
  • You may have gaps in skills and technologies, including operationalizing ML solutions, implementing ML services, and managing ML projects for rapid iterations.
  • Ensuring data quality, governance, and security may slow down or stall ML projects.

Solution overview: Machine Learning Experience-based Acceleration (ML EBA)

Machine learning EBA is a 3-day, sprint-based, interactive workshop (called a party) that uses SageMaker to accelerate business outcomes by guiding you through an accelerated and a prescriptive ML lifecycle. It starts with identifying business goals and ML problem framing, and takes you through data processing, model development, production deployment, and monitoring.

The following visual illustrates a sample ML lifecycle.

Sample Machine Learning Lifecycle

Two primary customer scenarios apply. The first is by using low-code or no-code ML services such as Amazon SageMaker Canvas, Amazon SageMaker Data Wrangler, Amazon SageMaker Autopilot, and Amazon SageMaker JumpStart to help data analysts prepare data, build models, and generate predictions. The second is by using SageMaker to help data scientists and ML engineers build, train, and deploy custom ML models.

We recognize that customers have different starting points. If you’re starting from scratch, it’s often simpler to begin with low-code or no-code solutions and gradually transition to developing custom models. In contrast, if you have an existing on-premises ML infrastructure, you can begin directly by using SageMaker to alleviate challenges with your current solution.

Through ML EBA, experienced AWS ML subject matter experts work side by side with your cross-functional team to provide prescriptive guidance, remove blockers, and build organizational capability for a continued ML adoption. This party steers you to solve a compelling business problem as opposed to thinking in terms of data and ML technology environments. Additionally, the party gets you started on driving material business value from untapped data.

ML EBA helps you to think big, start small, and scale fast. Although it creates a minimum viable ML model in 3 days, there are 4–6 weeks of preparation leading up to the EBA. Furthermore, you spend 4–6 weeks post-EBA to fine-tune the model with additional feature engineering and hyperparameter optimization before production deployment.

Let’s dive into what the whole process looks like and how you can use the ML EBA methodology to address the common blockers.

EBA prep (4–6 weeks)

In this section, we detail the 4–6 weeks of preparation leading up to the EBA.

6 weeks before the party: Problem framing and qualification

The first step is to frame and qualify the ML problem, which includes the following:

  • Identify the right business outcome – You must have a clear understanding of the problem you are trying to solve and the desired outcome you hope to achieve through the use of ML. You must be able to measure the business value gained against specific objectives and success criteria. Furthermore, you must be able to identify what should be observed, and what should be predicted. AWS works with you to help answer the following important questions before embarking on the ML EBA:
    • Does the ML use case solve a meaningful business problem?
    • Is it important enough to get the attention of business leadership?
    • Do you already have data to solve the ML use case?
    • Can the use case eventually be operationalized into production?
    • Does it really require ML?
    • Are there organizational processes in place for the business to use the model’s output?

The AI Use Case Explorer is a good starting point to explore the right use cases by industry, business function, or desired business outcome and discover relevant customer success stories.

  • Executive sponsorship – To help you move faster than you would have organically, AWS meets with the executive sponsor to confirm buy-in, remove internal obstacles, and commit resources. Additionally, AWS can offer financial incentives to help offset the costs for your first ML use case.
  • Meeting you where you are at in your ML journey – AWS assesses your current state—people, process, and technology. We help you detail requirements and dependencies; specifically, what teams and data are required to begin the journey successfully. Additionally, we provide recommendations on the technical path: starting with low-code or no-code services, or building a custom model using SageMaker.

5 weeks before the party: Workstream configuration and transition into action

The next step is to identify the teams needed to support the EBA effort. Commonly, the work is split up between the following workstreams:

  • Cloud engineering (infrastructure and security) – Focuses on verifying that the AWS accounts and infrastructure are set up and secure ahead of EBA. This includes AWS Identity and Access Management (IAM) or single sign-on (SSO) access, security guardrails, Amazon SageMaker Studio provisioning, automated stop/start to save costs, and Amazon Simple Storage Service (Amazon S3) set up.
  • Data engineering – Identifies the data sources, sets up data ingestion and pipelines, and prepares data using Data Wrangler.
  • Data science – The heart of ML EBA and focuses on feature engineering, model training, hyperparameter tuning, and model validation.
  • MLOps engineering – Focuses on automating the DevOps pipelines for operationalizing the ML use case. This may often be the same team as cloud engineering.
  • Leadership team – Responsible for orchestrating the effort, removing blockers, aligning with the executive sponsors, and is ultimately accountable for delivering the expected outcomes.

After these efforts have been completed, we must transition into action. A standard baseline 4-week timeline should be strictly adhered to make sure the EBA stays on track. Experienced AWS subject matter experts will guide and coach you through this preparation leading up to the EBA party.

4 weeks before the party: Inspire builders and curate a technical plan

Every customer is different; AWS helps you curate a technical plan of activities to be completed in the next 4 weeks leading up to the party.

AWS conducts Immersion Days to inspire your builders and build momentum for the party. An Immersion Day is a half or full day workshop with the right mix of presentation, hands-on labs, and Q&A to introduce AWS services or solutions. AWS will help you select the right Immersion Days from the AI/ML Workshops catalog.

We recognize that every builder in your organization is at a different level. We recommend that your builders use the ML ramp-up guide resources or digital or classroom training to start where they are at and build the necessary skills for the party.

3 weeks before the party: Tech prep focused on cloud and data engineering

Your cloud and data engineering teams should work on the following with guidance from AWS:

  • Create AWS accounts with network and security set up
  • Set up Amazon SageMaker Studio
  • Create Amazon S3 buckets to store data
  • Identify data sources (or producers)
  • Integrate external sources to dump data into S3 buckets

2 weeks before the party: Tech prep focused on data science

Your data science team should work on the following with guidance from AWS:

1 week before the party: Assess readiness (go/no-go)

AWS works with you to assess go/no-go readiness for technical activities, skills, and momentum for the party. Then we solidify the scope for the 3-day party, prioritizing progress over perfection.

EBA (3-day party)

Although the EBA party itself is customized for your organization, the recommended agenda for the 3 days is shown in the following table. You will learn by doing during the EBA with guidance from AWS subject matter experts.

. Day 1 Day 2 Day 3
Data Science

AM: Try AutoPilot or JumpStart models.

PM: Pick 1–2 models based on AutoPilot outcomes to experiment further.

Improve model accuracy:

  • In-depth feature engineering (example, PCA)
  • Hyperparameter optimization (HPO)

Quality assurance and validation with test data.

Deploy to production (inference endpoint).

Monitoring setup (model, data drift).

Data Engineering Explore using feature store for future ML use cases. Create a backlog of items for data governance and associated guardrails.
Cloud/MLOps Engineering Evaluate the MLOps framework solution library. Assess if this can be used for a repeatable MLOps framework. Identify gaps and create a backlog of things to enhance the solution library or create your own MLOps framework. Implement backlog items to create a repeatable MLOps framework. Continue implementing backlog items to create a repeatable MLOps framework.

Post-EBA

ML involves extensive experimentation, and it’s common to not reach your desired model accuracy during the 3-day EBA. Therefore, creating a well-defined backlog or path to production is essential, including improving model accuracy through experimentation, feature engineering, hyperparameter optimization, and production deployment. AWS will continue to assist you through production deployment.

Conclusion

By complementing ML EBA methodology with SageMaker, you can achieve the following results:

  • Move from pilot to production value in 8-12 weeks – Bring together business and technology teams to deploy the first ML use case to production in 8-12 weeks.
  • Build the organizational capability to speed up and scale ML across lines of business – The ML EBA inspires and up-skills builders with real work experience. It establishes a successful working model (a collaboration and iteration model) to sustain and scale ML initiatives across lines of business. It also creates reusable assets to speed up and scale ML in a repeatable way.
  • Reduce technical debt, pain points, and cost from existing on-premises ML models – The on-premises solutions may have challenges related to higher costs, inability to scale infrastructure, undifferentiated infrastructure management, and lack of advanced feature sets such as hyperparameter optimization, explainability for predictions, and more. Adoption of AWS ML services such as SageMaker reduces these issues.

Contact your AWS account team (Account Manager or Customer Solutions Manager) to learn more and get started.


About the Authors

Ritesh Shah is Senior Customer Solutions Manager at Amazon Web Services. He helps large US-Central enterprises accelerate their cloud-enabled transformation and build modern cloud-native solutions. He is passionate about accelerating customers’ ML journeys. In his free time, Ritesh enjoys spending time with his daughter, cooking, and learning something new, while also evangelizing cloud and ML. Connect with him on LinkedIn.

Nicholaus Lawson is a Solution Architect at AWS and part of the AIML specialty group. He has a background in software engineering and AI research. Outside of work, Nicholaus is often coding, learning something new, or woodworking. Connect with him on LinkedIn.

Read More

Run your local machine learning code as Amazon SageMaker Training jobs with minimal code changes

Run your local machine learning code as Amazon SageMaker Training jobs with minimal code changes

We recently introduced a new capability in the Amazon SageMaker Python SDK that lets data scientists run their machine learning (ML) code authored in their preferred integrated developer environment (IDE) and notebooks along with the associated runtime dependencies as Amazon SageMaker training jobs with minimal code changes to the experimentation done locally. Data scientists typically carry out several iterations of experimentation in data processing and training models while working on any ML problem. They want to run this ML code and carry out the experimentation with ease of use and minimal code change. Amazon SageMaker Model Training helps data scientists run fully managed large-scale training jobs on AWS’s compute infrastructure. SageMaker Training also helps data scientists with advanced tools such as Amazon SageMaker Debugger and Profiler to debug and analyze their large-scale training jobs.

For customers with small budgets, small teams, and tight timelines, every single new concept and line of code rewritten to run on SageMaker makes them less productive towards their core tasks, namely data processing and training ML models. They want to write code once in the framework of their choice and be able to move seamlessly from running code in their notebooks or laptops to running code at scale using SageMaker capabilities.

With this new capability of the SageMaker Python SDK, data scientists can onboard their ML code to the SageMaker Training platform in a few minutes. You just need to add a single line of code to your ML code, and SageMaker intelligently comprehends your code along with the datasets and workspace environment setup and runs it as a SageMaker Training job. You can then take advantage of the key capabilities of the SageMaker Training platform, like the ability to scale jobs easily, and other associated tools like Debugger and Profiler. In this release, you can run your local machine learning (ML) Python code as a single-node Amazon SageMaker training job or multiple parallel jobs. Distributed training jobs(across multiple nodes) are not supported by remote functions.

In this post, we show you how to use this new capability to run local ML code as a SageMaker Training job.

Solution overview

You can now run your ML code written in your IDE or notebook as a SageMaker Training job by annotating the function, which acts as an entry point to the user’s code base, with a simple decorator. Upon invocation, this capability automatically takes a snapshot of all the associated variables, functions, packages, environment variables, and other runtime requirements from your ML code, serializes them, and submits them as a SageMaker Training job. It integrates with the recently announced SageMaker Python SDK feature for setting default values for parameters. This capability simplifies the SageMaker constructs that you need to learn to be able to run code using SageMaker Training. Data scientists can write, debug, and iterate their code in any preferred IDE (such as Amazon SageMaker Studio, notebooks, VS Code, or PyCharm). When ready, you can annotate your Python function with the @remote decorator and run it as a SageMaker job at scale.

This capability takes familiar open-source Python objects as arguments and outputs. Furthermore, you don’t need to understand container lifecycle management and can simply run your workloads across different compute contexts (such as a local IDE, Studio, or training jobs) with minimal configuration overheads. To run any local code as a SageMaker Training job, this capability infers the configurations required to run jobs, such as the AWS Identity and Access Management (IAM) role, encryption key, and network configuration, from the Studio or IDE settings (which can be the default settings) and passes them to the platform by default. You have the flexibility to customize your runtime in the SageMaker managed infrastructure using the inferred configuration or override them at the SDK-level by passing them as arguments to the decorator.

This new capability of the SageMaker Python SDK transforms your ML code in an existing workspace environment and any associated data processing code and datasets into a SageMaker Training job. This capability looks for ML code wrapped inside a @remote decorator and automatically translates it into a job that runs in either Studio or a local IDE such as PyCharm.

In the following sections, we walk through the features of this new capability and how to launch python functions as SageMaker Training jobs.

Prerequisites

To use this new SageMaker Python SDK capability and run the code associated with this post, you need the following prerequisites:

  • An AWS account that will contain all your AWS resources
  • An IAM role to access SageMaker
  • Access to Studio or a SageMaker notebook instance or an IDE such as PyCharm

Use the SDK from Studio and SageMaker notebooks

You can use this capability from Studio by launching a notebook and wrapping your code with a @remote decorator inside the notebook. You first need to import the remote function using the following code:

from sagemaker.remote_function import remote

When you use the decorator function, this capability will automatically interpret the function of your code and run it as a SageMaker Training job.

You can also use this capability from a SageMaker notebook instance. You first need to start a notebook instance, open Jupyter or Jupyter Lab on it, and launch a notebook. Then import the remote function as shown in the preceding code and wrap your code with the @remote decorator. We include an example of how to use the decorator function and the associated settings later in this post.

Use the SDK from your local environment

You can also use this capability from your local IDE. As a prerequisite, you must have the AWS Command Line Interface (AWS CLI), SageMaker Python SDK, and AWS SDK for Python (Boto3) installed in your local environment. You need to import these libraries in your code, set the SageMaker session, specify settings, and decorate your function with the @remote decorator. In the following example code, we run a simple divide function as a SageMaker Training job:

import boto3
import sagemaker
from sagemaker.remote_function import remote

sm_session = sagemaker.Session(boto_session=boto3.session.Session(region_name="us-west-2"))
settings = dict(
    sagemaker_session=sm_session,
    role=<IAM_ROLE_NAME>
    instance_type="ml.m5.xlarge",
)
@remote(**settings)
def divide(x, y):
    return x / y
if __name__ == "__main__":
    print(divide(2, 3.0))

We can use a similar methodology to run advanced functions as training jobs, as shown in the next section.

Launch Python functions as SageMaker jobs

The new SageMaker Python SDK feature allows you to run Python functions as SageMaker Training jobs. Any Python code, ML training code developed by data scientists using their preferred local IDEs (PyCharm, VS Code), SageMaker notebooks, or Studio notebooks can be launched as a managed SageMaker job.

In ML workloads using this capability, associated datasets, dependencies, and workspace environment setups are serialized using the ML code and run as a SageMaker job synchronously and asynchronously.

You can add a @remote decorator annotation to any Python code including a local ML processing or training function to launch it as a managed SageMaker Training job, thereby taking advantage of the scale, performance, and cost benefits of SageMaker. This can be achieved with minimal code changes by adding a decorator to the Python function code. Invocation to the decorated function is run synchronously, and the function run waits until the SageMaker job is complete.

In the following example, we use the @remote decorator to launch SageMaker jobs in decorator mode using an ml.m5.large instance. SageMaker uses training jobs to launch this function as a managed job.

from sagemaker.remote_function import remote
from numpy as np

@remote(instance_type="ml.m5.large")
def matrix_multiply(a, b):
    return np.matmul(a, b)

a = np.array([[1, 0], [0, 1]])
b = np.array([1, 2])

assert matrix_multiply(a, b) == np.array([1,2])

You can also use decorator mode to launch SageMaker jobs, Python packages, and dependencies. You can include environment variables such as VPC, subnets, and security groups to launch SageMaker training jobs in the environment.yml file. This allows ML engineers and admins to configure these environment variables so data scientists can focus on ML model building and iterate faster. See the following code:

from sagemaker.remote_function import remote

@remote(instance_type="ml.g4dn.xlarge",dependencies = "./environment.yml")
def train_hf_model(
    train_input_path,test_input_path,s3_output_path = None,
    *,epochs = 1, train_batch_size = 32, eval_batch_size = 64,
    warmup_steps = 500,learning_rate = 5e-5
    ):  
    model_name = "distilbert-base-uncased"
    model = AutoModelForSequenceClassification.from_pretrained(model_name)
    ... <TRUCNATED>
    return os.path.join(s3_output_path, model_dir), eval_result

You can use RemoteExecutor to launch Python functions as SageMaker jobs asynchronously. The executor asynchronously polls SageMaker Training jobs to update the status of the job. The RemoteExecutor class is an implementation of the concurrent.futures.Executor, which is used to submit SageMaker Training jobs asynchronously. See the following code:

from sagemaker.remote_function import RemoteExecutor

def train_hf_model(
    train_input_path,test_input_path,s3_output_path = None,
    *,epochs = 1, train_batch_size = 32, eval_batch_size = 64,
    warmup_steps = 500,learning_rate = 5e-5
    ):  
    model_name = "distilbert-base-uncased"
    model = AutoModelForSequenceClassification.from_pretrained(model_name)
    ...<TRUNCATED>
    return os.path.join(s3_output_path, model_dir), eval_result


with RemoteExecutor(instance_type="ml.g4dn.xlarge", dependencies = './requirements.txt') as e:
    future = e.submit(divide, train_input_path,test_input_path,s3_output_path,
                      epochs, train_batch_size, eval_batch_size,warmup_steps,learning_rate)

Customize the runtime environment

Decorator mode and RemoteExecutor allow you to define and customize the runtime environments for the SageMaker job. The runtime dependencies, including Python packages and environment variables for SageMaker jobs, can be specified to customize the runtime. In order to run local Python code as SageMaker managed jobs, the Python package and dependencies need to be made available to SageMaker. ML engineers or data science administrators can configure networking and security configurations such as VPC, subnets, and security groups for SageMaker jobs, so data scientists can use these centrally managed configurations while launching SageMaker jobs. You can use either a requirements.txt file or a Conda environment.yaml file.

When dependencies are defined with requirements.txt, the packages will be installed using pip in the job runtime. If the image used for running the job comes with Conda environments, packages will be installed in the Conda environment declared to use for jobs. The following code shows an example requirements.txt file:

datasets
transformers
torch
scikit-learn
s3fs==0.4.2
sagemaker>=2.148.0

You can pass your Conda environment.yaml file to create the Conda environment you would like your code to run in during the training job. If the image used for running the job declares a Conda environment to run the code under, we will update the declared Conda environment with the given specification. The following code is an example of a Conda environment.yaml file:

name: sagemaker_example
channels:
  - conda-forge
dependencies:
  - python=3.10
  - pandas
  - pip:
      - sagemaker

Alternatively, you can set dependencies=”auto_capture” to let the SageMaker Python SDK capture the installed dependencies in the active Conda environment. You must have an active Conda environment for auto_capture to work. Note that there are prerequisites for auto_capture to work; we recommend that you pass in your dependencies as a requirement.txt or Conda environment.yml file as described in the previous section.

For more details, refer to Run your local code as a SageMaker Training job.

Configurations for SageMaker jobs

Infrastructure-related settings can be offloaded to a configuration file that admin users could help set up. You only need to set it up one time. Infrastructure settings cover the network configuration, IAM roles, Amazon Simple Storage Service (Amazon S3) folder for input, output data, and tags. Refer to Configuring and using defaults with the SageMaker Python SDK for more details.

SchemaVersion: '1.0'
SageMaker:
  PythonSDK:
    Modules:
      RemoteFunction:
        Dependencies: path/to/requirements.txt
        EnvironmentVariables: {"EnvVarKey": "EnvVarValue"}
        ImageUri: 366666666666.dkr.ecr.us-west-2.amazonaws.com/my-image:latest
        InstanceType: ml.m5.large
        RoleArn: arn:aws:iam::366666666666:role/MyRole
        S3KmsKeyId: somekmskeyid
        S3RootUri: s3://my-bucket/my-project
        SecurityGroupIds:
          - sg123
        Subnets:
          - subnet-1234
        Tags:
          - {"Key": "someTagKey", "Value": "someTagValue"}
        VolumeKmsKeyId: somekmskeyid

Implementation

Deep learning models like PyTorch or TensorFlow can also be run within Studio by running the code as a training job within the notebook. To showcase this capability in Studio, you can clone this repo into your Studio and run the notebook located in the GitHub repository.

This example demonstrates an end-to-end binary text classification use case. We are using the Hugging Face transformers and datasets library to fine-tune a pre-trained transformer on binary text classification. In particular, the pre-trained model will be fine-tuned using the IMDb dataset.

When you clone the repository, you should locate the following files:

  • config.yaml – Most of the decorator arguments can be offloaded to the configuration file in order to separate out the infrastructure-related settings from the code base
  • huggingface.ipynb – This contains the code to train a pre-trained HuggingFace model, which will be fine-tuned using the IMDB dataset
  • requirements.txt – This file contains all the dependencies to run the function that will be used in this notebook for running the code and running the training remotely in a GPU instance as a training job

When you open the notebook, you will be prompted to set up the notebook environment. You can select the Data Science 3.0 image with the Python 3 kernel and ml.m5.large as the fast launch instance type for running the notebook code. This instance type is significantly faster in spinning up an environment.

The training job will be run in an ml.g4dn.xlarge instance as defined in the config.yaml file:

SchemaVersion: '1.0'
SageMaker:
  PythonSDK:
    Modules:
      RemoteFunction:
        # role arn is not required if in SageMaker Notebook instance or SageMaker Studio
        # Uncomment the following line and replace with the right execution role if in a local IDE
        # RoleArn: <IAM_ROLE_ARN>
        InstanceType: ml.g4dn.xlarge
        Dependencies: ./requirements.txt

The requirements.txt file dependencies to run the function for training the Hugging Face model include the following:

datasets
transformers
torch
scikit-learn
# lock s3fs to this specific version as more recent ones introduce dependency on aiobotocore, which is not compatible with botocore
s3fs==0.4.2
sagemaker>=2.148.0,<3

The Hugging Face notebook showcases how to run the training remotely via the @remote function, which is run synchronously. Therefore, the function run for training the model will wait until the SageMaker Training job is complete. The training will be run remotely with a GPU instance wherein the instance type is defined in the preceding configuration file.

from sagemaker.remote_function import remote

@remote(s3_root_uri=s3_root_folder, keep_alive_period_in_seconds=600)
def train_hf_model(
    train_input_path,
    test_input_path,
    s3_output_path = None,
    *,
    epochs = 1,
    train_batch_size = 32,
    eval_batch_size = 64,
    warmup_steps = 500,
    learning_rate = 5e-5
):  
    model_dir = 'model'

    train_dataset = load_from_disk(train_input_path, keep_in_memory=True)
    test_dataset = load_from_disk(test_input_path, keep_in_memory=True)
    
    model_name = 'distilbert-base-uncased'
    model = AutoModelForSequenceClassification.from_pretrained(model_name)
    
    training_args = TrainingArguments(
        output_dir=model_dir,
        num_train_epochs=epochs,
        per_device_train_batch_size=train_batch_size,
        per_device_eval_batch_size=eval_batch_size,
        warmup_steps=warmup_steps,
        evaluation_strategy="epoch",
        logging_dir="logs/",
        learning_rate=float(learning_rate),
    )

    # create Trainer instance
    trainer = Trainer(
        model=model,
        args=training_args,
        compute_metrics=compute_metrics,
        train_dataset=train_dataset,
        eval_dataset=test_dataset,
        tokenizer=tokenizer,
    )
    
    print("Starting model training..")
    trainer.train()
        
    trainer.save_model(model_dir)

After you run the training job, you can run the rest of the cells in the notebook to inspect the evaluation metrics and classify the text on our trained model.

You can also view the training job status that got remotely triggered in the GPU instance on the SageMaker dashboard by navigating back to the SageMaker console.

As soon as the training job is complete, it continues to run the instructions in the notebook for evaluation and classification. Similar jobs can be trained and run via the remote executor function embedded within Studio notebooks to carry out the runs asynchronously.

Integration with SageMaker experiments inside a @remote function

You can pass your experiment name, run name, and other parameters into your remote function to create a SageMaker experiments run. The following code example imports the experiment name, the name of the run, and the parameters to log for each run:

from sagemaker.remote_function import remote
from sagemaker.experiments.run import Run
# Define your remote function
@remote
def train(value_1, value_2, exp_name, run_name):
...
...
#Creates the experiment
with Run(
  experiment_name=exp_name,
  run_name=run_name,
  sagemaker_session=sagemaker_session
) as run:
...
...
#Define values for the parameters to log
run.log_parameter("param_1", value_1)
run.log_parameter("param_2", value_2)
...
...
#Define metrics to log
run.log_metric("metric_a", 0.5)
run.log_metric("metric_b", 0.1)

# Invoke your remote function
train(1.0, 2.0, "my-exp-name", "my-run-name")  

In the preceding example, the parameters p1 and p2 are logged over time inside a training loop. Common parameters may include batch size or epochs. In the example, the metrics A and B are logged for a run over time inside a training loop. Common metrics may include accuracy or loss. For more information, see Create an Amazon SageMaker Experiment.

Conclusion

In this post, we introduced a new SageMaker Python SDK capability that enables data scientists to run their ML code in their preferred IDE as SageMaker Training jobs. We discussed the prerequisites needed to use this capability along with its features. We also showed how to use this capability in Studio, SageMaker notebook instances, and your local IDE. In addition, we provided sample code examples to demonstrate how to use this capability. As a next step, we recommend trying this capability in your IDE or SageMaker by following the code examples referenced in this post.


About the Authors

Dipankar Patro is a Software Development Engineer at AWS SageMaker, innovating and building MLOps solutions to help customers adopt AI/ML solutions at scale. He has an MS in Computer Science and his areas of interest are Computer Security, Distributed Systems and AI/ML.

Farooq Sabir is a Senior Artificial Intelligence and Machine Learning Specialist Solutions Architect at AWS. He holds PhD and MS degrees in Electrical Engineering from the University of Texas at Austin and an MS in Computer Science from Georgia Institute of Technology. He has over 15 years of work experience and also likes to teach and mentor college students. At AWS, he helps customers formulate and solve their business problems in data science, machine learning, computer vision, artificial intelligence, numerical optimization, and related domains. Based in Dallas, Texas, he and his family love to travel and go on long road trips.

Manoj Ravi is a Senior Product Manager for Amazon SageMaker. He is passionate about building next-gen AI products and works on software and tools to make large-scale machine learning easier for customers. He holds an MBA from Haas School of Business and a Masters in Information Systems Management from Carnegie Mellon University. In his spare time, Manoj enjoys playing tennis and pursuing landscape photography.

Shikhar Kwatra is an AI/ML Specialist Solutions Architect at Amazon Web Services, working with a leading Global System Integrator. He has earned the title of one of the Youngest Indian Master Inventors with over 500 patents in the AI/ML and IoT domains. Shikhar aids in architecting, building, and maintaining cost-efficient, scalable cloud environments for the organization, and supports the GSI partner in building strategic industry solutions on AWS. Shikhar enjoys playing guitar, composing music, and practicing mindfulness in his spare time.

Vikram Elango is a Sr. AI/ML Specialist Solutions Architect at AWS, based in Virginia, US. He is currently focused on generative AI, LLMs, prompt engineering, large model inference optimization, and scaling ML across enterprises. Vikram helps financial and insurance industry customers with design and thought leadership to build and deploy machine learning applications at scale. In his spare time, he enjoys traveling, hiking, cooking, and camping.

Read More

Perform intelligent search across emails in your Google workspace using the Gmail connector for Amazon Kendra

Perform intelligent search across emails in your Google workspace using the Gmail connector for Amazon Kendra

Many organizations use Gmail for their business email needs. Gmail for Business is part of Google Workspace, which provides a set of productivity and collaboration tools like Google Drive, Google Docs, Google Sheets, and more. For any organization, emails contain a wealth of information, which could be within the subject of an email, the message content, or even email attachments. Performing an intelligent search on email interactions with coworkers can help find answers to questions, thereby improving employee productivity and enhancing the overall customer experience for the organization.

Amazon Kendra is a highly accurate and intelligent search service that allows your users to search unstructured and structured data using natural language processing (NLP) and advanced search algorithms. You can now use the Gmail connector for Amazon Kendra to index emails and email attachments in Gmail, and search for answers to your questions on this content using intelligent search in Amazon Kendra, powered by machine learning (ML).

This post walks you through the process of configuring the Gmail connector for Amazon Kendra for your organization’s Google Workspace, allowing you to index emails based on a defined scope and take advantage of the intelligent search capabilities of Amazon Kendra.

Solution overview

A data source is a data repository or location that Amazon Kendra connects to and indexes your documents or content. After you create an Amazon Kendra index, you can create one or many data sources and configure them to start ingesting documents from the data source. In our solution, we ingest emails and attachments from Gmail by configuring the new Gmail data source connector to filter for emails that meet a certain filter criterion. After the configuration is complete, we can synchronize the data source to index the documents, allowing you to perform intelligent search on the Amazon Kendra index.

Prerequisites

To enable the Gmail connector for Amazon Kendra, you need the following:

  • An AWS account
  • A Google Workspace account and an organization for your business with one or many users that have access to Gmail
  • Administrator account credentials to Google Workspace and the Google Cloud console

Configure Google Workspace

To enable Amazon Kendra to access and index emails from Gmail accounts within the organization and perform intelligent search on them, it’s essential to configure your organization’s Google Workspace. In the steps that follow, we create a service account that the Gmail connector uses to index emails. The service account is provided with authorization scopes to allow access to certain Gmail APIs. The authorization scopes express the permissions you request users to authorize for your app and are applicable for all emails within your organization’s Google Workspace.

  1. Log in to your organization’s Google Cloud account.
  2. Create a new project with an appropriate name and assign it to your organization. In our example, we name the project KendraGmailConnector.
  3. Choose Create.

  1. Monitor the progress of creation of the new project on the Notifications menu on the top right of the Google Cloud console.

  1. After the project is created, choose the options menu, choose API & Services¸ and choose Library to view the API Library.

  1. On the API Library, search for Admin SDK API and choose Enable. The Admin SDK API enables managing the Google Workspace account resources and audit usage.

  1. Similarly, search for Gmail API on the API Library page and choose Enable. The Gmail API can help in viewing and managing Gmail mailbox data like threads, messages, and labels.

We now create a service account, which the Gmail connector for Amazon Kendra uses to access your organization’s emails based on the allowed API scope.

  1. On the options menu, choose IAM & Admin, then choose Service Accounts.

  1. Choose Create service account.

  1. Enter a name for your service account. For this post, we name our service account AmazonKendraGmailConnector.
  2. Enter your service account ID and account description.
  3. Skip the optional steps Grant this service account access to project and Grant users access to this service account and choose Done.

  1. Choose the service account you created to open the service account details page.
  2. Note the unique ID for the service account (also known as a client ID), to use in a later step.

Next, we create keys for the service account, which allows it to be used by the Gmail connector for Amazon Kendra.

  1. On the Keys tab, choose Add key.

  1. For Key type, select JSON.
  2. Choose Create.

This step downloads the private key to your computer, which must be kept safe to allow configuration on the Amazon Kendra console.

  1. Choose Close.

The following screenshot shows an example of the credentials JSON file.

  1. On the Details tab, expand the Advanced settings section.
  2. Under Domain-wide delegation, choose View Google Workspace admin console.

Granting access to the service account via a domain-wide delegation to your organization’s data must be done with caution, and can be reversed by disabling or deleting the service account or removing access through the Google Workspace admin console.

  1. Log in to the admin console using your Google Workspace admin credentials.
  2. In the navigation pane, under Security, choose Access and data control, then choose API controls.
  3. In the Domain-wide delegation section, choose Manage domain-wide delegation.

  1. Choose Add new.

This brings up the Add a new client ID dialog.

  1. Enter the unique ID for the service account you created earlier, and enter the following scopes to allow the service account to access the emails from Gmail:
    1. https://www.googleapis.com/auth/gmail.readonly
    2. https://www.googleapis.com/auth/admin.directory.user.readonly
  2. Choose Authorize.

This concludes the configuration within the Google Cloud console and Google Workspace admin console.

Configure the Gmail connector for Amazon Kendra

In this section, we walk through the configuration steps for the Gmail connector for Amazon Kendra:

  1. On the Amazon Kendra console, create a new index or open an existing index. For this post, we use the existing index EnterpriseKendraIndex.

  1. Under Data management in the navigation pane, choose Data sources.
  2. Choose Add data source.

  1. On the list of data sources, find the Gmail connector and choose Add connector.

  1. On the Specify data source details page, complete the following steps:
    1. For Data source name, enter a name.
    2. For Description, enter an optional description.
    3. Leave the language as the default setting, English (en).

    Amazon Kendra supports a select set of languages with full semantic search. These languages include Spanish, Japanese, French, and others. For more information, see Adding documents in languages other than English.

    1. Add any tags to the index, then choose Next.

Next, we create an AWS Secrets Manager secret to store the Gmail authentication details, and use the values in the credentials JSON file that we downloaded earlier.

  1. On the Define access and security page, complete the following steps:
      1. In the Authentication section, choose Create and add new secret, which opens the Create an AWS Secrets Manager secret dialog.
      2. For Secret name, enter a name.
      3. For Client email, enter the client email ID from the credentials JSON file.
      4. For Admin account email, enter the admin email for the Google Cloud console.
      5. For Private key, enter the private key from the credentials JSON file.
      6. Choose Save to return to the Define access and security page.

      1. In the Configure VPC and security group section, you can choose a VPC and the subnets that will contain the data source and security group that will grant access to the host. For our configuration, we choose No VPC.
      2. In the IAM role section, choose Create a new role and enter a role name.
      3. Choose Next.

  1. On the Configure sync settings page, set the following parameters to sync all emails and email attachments sent from the admin email address:
    1. In the Sync scope section, select Message attachments.
    2. Under Additional configuration, configure filters for the emails to ingest into the Amazon Kendra index:
      1. For Date range, enter the start and end dates for emails to be crawled. Emails received on or after the start date and before the end date are included in the sync scope.
      2. For Email domains, enter the email from domains, email to domains, subject, CC, and BCC emails you wish to include or exclude in your index. For this post, we set the email from domain as the admin email address.
      3. For Keywords in subjects, include or exclude any documents with at least one keyword mentioned in their subjects.
      4. For Labels, add regular expression patterns to include or exclude certain labels or attachment types (up to 100 patterns).
      5. For Attachments, add regular expression patterns to include or exclude certain attachments (up to 100 patterns).

    1. In the Sync mode section, you can either specify a full sync to sync and index all contents in all entities regardless of the previous sync status, or only sync new, modified, or deleted content. For this post, we select Full sync.

    1. Lastly, we set an appropriate frequency for the sync. For this post, we choose Run on demand.
    2. Choose Next.
  1. On the Set field mappings page, you associate or create a mapping of the required data source fields with fields in your index. You can also create mappings for custom index fields. You can specify mapping for both messages and message attachments. For this post, we add field mappings in the Message section:
    1. Select the Gmail field mappings subject, from, and to.
    2. Choose Next.

  1. On the Review and create page, review all the steps and choose Add data source to create your Gmail connector data source.
  2. After the data source is created, on the Data sources page, select the data source (kendra-gmail-connector) and choose Sync now.

The amount of time the sync takes depends on the number of the emails that match the sync scope and the size of attachments that need be indexed. You can check the status of the sync operation for the Gmail data source if you choose the data source and scroll down to the Sync run history section. Choose the status of the individual sync to view more details.

This section shows the start and end times of the sync and also the number of documents that were added, deleted, failed, or modified during the sync. A status of Completed denotes a sync where there are no failures. In cases where a document being ingested is blank, the sync status is set to Completed with Errors with the number of failed documents listed as Failed, as shown in the following screenshot. In case of a sync failure, you can investigate the reason by either choosing the number of failed documents or by choosing the entry in the Details column, which brings up the Amazon CloudWatch logs. In the following example, two documents failed ingestion because they were blank.

After the sync is successful, you can perform a search on the Amazon Kendra index.

Search indexed content

To search on the indexed content, choose Search indexed content in the navigation pane on the Amazon Kendra console.

On the search console, enter any natural language question. In our example, we ask “What is SageMaker.” Amazon Kendra performs an intelligent search on the emails ingested into the index based on the scope of the sync and finds an answer, as shown in the following screenshot.

In this example, the Document fields section shows the field mappings that we specified while configuring our data source connector.

Clean up

To avoid incurring future costs, clean up the resources you created as part of this solution. If you created a new Amazon Kendra index while testing this solution, delete it. If you only added a new data source using the Gmail connector, delete the added data source.

Conclusion

In this post, we showed how organizations can now use the Gmail connector for Amazon Kendra to allow users to perform intelligent search on emails and email attachments, thereby improving employee productivity and customer satisfaction.

Additionally, we walked through how to define field mappings to the Amazon Kendra data source, allowing users to refine their search results.

To learn more about the Gmail connector for Amazon Kendra, refer to Gmail data source connector for Amazon Kendra.


About the Author

Roshan Thomas is a Senior Solutions Architect at Amazon Web Services. He is based in Melbourne, Australia, and works closely with power and utilities customers to accelerate their journey in the cloud. He is passionate about technology and helping customers architect and build solutions on AWS.

Read More

Amazon SageMaker Data Wrangler for dimensionality reduction

Amazon SageMaker Data Wrangler for dimensionality reduction

In the world of machine learning (ML), the quality of the dataset is of significant importance to model predictability. Although more data is usually better, large datasets with a high number of features can sometimes lead to non-optimal model performance due to the curse of dimensionality. Analysts can spend a significant amount of time transforming data to improve model performance. Additionally, large datasets are costlier and take longer to train. If time is a constraint, model performance may be limited as a result.

Dimension reduction techniques can help reduce the size of your data while maintaining its information, resulting in quicker training times, lower cost, and potentially higher-performing models.

Amazon SageMaker Data Wrangler is a purpose-built data aggregation and preparation tool for ML. Data Wrangler simplifies the process of data preparation and feature engineering like data selection, cleansing, exploration, and visualization from a single visual interface. Data Wrangler has more than 300 preconfigured data transformations that can effectively be used in transforming the data. In addition, you can write custom transformation in PySpark, SQL, and pandas.

Today, we’re excited to add a new transformation technique that is commonly used in the ML world to the list of Data Wrangler pre-built transformations: dimensionality reduction using Principal Component Analysis. With this new feature, you can reduce the high number of dimensions in your datasets to one that can be used with popular ML algorithms with just a few clicks on the Data Wrangler console. This can have significant improvements in your model performance with minimal effort.

In this post, we provide an overview of this new feature and show how to use it in your data transformation. We will show how to use dimensionality reduction on large sparse datasets.

Overview of Principal Component Analysis

Principal Component Analysis (PCA) is a method by which the dimensionality of features can be transformed in a dataset with many numerical features into one with fewer features while still retaining as much information as possible from the original dataset. This is done by finding a new set of features called components, which are composites of the original features that are uncorrelated with one another. Several features in a dataset often have less impact on the final result and may increase the processing time of ML models. It can become difficult for humans to understand and solve such high-dimensional problems. Dimensionality reduction techniques like PCA can help solve this for us.

Solution overview

In this post, we show how you can use the dimensionality reduction transform in Data Wrangler on the MNIST dataset to reduce the number of features by 85% and still achieve similar or better accuracy than the original dataset. The MNIST (Modified National Institute of Standards and Technology) dataset, which is the de facto “hello world” dataset in computer vision, is a dataset of handwritten images. Each row of the dataset corresponds to a single image that is 28 x 28 pixels, for a total of 784 pixels. Each pixel is represented by a single feature in the dataset with a pixel value ranging from 0–255.

To learn more about the new dimensionality reduction feature, refer to Reduce Dimensionality within a Dataset.

Prerequisites

This post assumes that you have an Amazon SageMaker Studio domain set up. For details on how to set it up, refer to Onboard to Amazon SageMaker Domain Using Quick setup.

To get started with the new capabilities of Data Wrangler, open Studio after upgrading to the latest release and choose the File menu, New, and Flow, or choose New data flow from the Studio launcher.

Perform a Quick Model analysis

The dataset we use in this post contains 60,000 training examples and labels. Each row consists of 785 values: the first value is the label (a number from 0–9) and the remaining 784 values are the pixel values (a number from 0–255). First, we perform a Quick Model analysis on the raw data to get performance metrics and compare them with the model metrics post-PCA transformations for evaluation. Complete the following steps:

  1. Download the MNIST dataset training dataset.
  2. Extract the data from the .zip file and upload into an Amazon Simple Storage Service (Amazon S3) bucket.
  3. In Studio, choose New and Data Wrangler Flow to create a new Data Wrangler flow.
    Data Wrangler Flow
  4. Choose Import data to load the data from Amazon S3.
    Import Data
  5. Choose Amazon S3 as your data source.
    Choose S3 data connection
  6. Select the dataset uploaded to your S3 bucket.
  7. Leave the default settings and choose Import.
    Import from S3

After the data is imported, Data Wrangler automatically validates the datasets and detects the data types for all the columns based on its sampling. In the MNIST dataset, because all the columns are long, we leave this step as is and go back to the data flow.

  1. Choose Data flow at the top of the Data types page to return to the main data flow.
    Dataset Data Types

The flow editor now shows two blocks showcasing that the data was imported from a source and the data types recognized. You can also edit the data types if needed.

After confirming that the data quality is acceptable, we go back to the data flow and use Data Wrangler’s Data Quality and Insights Report. This report performs an analysis on the imported dataset and provides information about missing values, outliers, target leakage, imbalanced data, and a Quick Model analysis. Refer to Get Insights On Data and Data Quality for more information.

For this analysis, we only focus on the Quick Model part of the Data Quality report.

  1. Choose the plus sign next to Data types, then choose Add analysis.
  2. For Analysis type¸ choose Data Quality And Insights Report.
  3. For Target column, choose label.
  4. For Problem type, select Classification (this step is optional).
  5. Choose Create.
    Create Analysis

For this post, we use the Data Quality and Insights Report to show how the model performance is mostly preserved using PCA. We recommend that you use a deep learning-based approach for better performance.

The following screenshot shows a summary of the dataset from the report. Fortunately, we don’t have any missing values. The time taken for the report to generate depends on the size of the dataset, number of features, and the instance size used by Data Wrangler.

Data Quality Summary

The following screenshot shows how the model performed on the raw dataset. Here we notice that the model has an accuracy of 93.7% utilizing 784 features.

Quick Model Before PCA

Use the Data Wrangler dimensionality reduction transform

Now let’s use the Data Wrangler dimensionality reduction transform to reduce the number of features in this dataset.

  1. On the data flow page, choose the plus sign next to Data types, then choose Add transform.
  2. Choose Add step.
    Add Step
  3. Choose Dimensionality Reduction.
    Dimensionality Reduction

If you don’t see the dimensionality reduction option listed, you need to update Data Wrangler. For instructions, refer to Update Data Wrangler.

  1. Configure the key variables that go into PCA:
    1. For Transform, choose the dimensionality reduction technique that you want to use. For this post, we choose Principal component analysis.
    2. For Input Columns, choose the columns that you want to include in the PCA analysis. For this example, we choose all the features except the target column label (you can also use the Select all feature to select all features and deselect features not needed). These columns need to be of numeric data type.
    3. For Number of principal components, specify the number of target dimensions.
    4. For Variance threshold percentage, specify the percentage of variation in the data that you want to explain by the principal components. The default value is 95; for this post, we use 80.
    5. Select Center to center the data with the mean before scaling.
    6. Select Scale to scale the data with the unit standard deviation.
      PCA gives more emphasis to variables with high variance. Therefore, if the dimensions are not scaled, we will get inconsistent results. For example, the value for one variable might lie in the range of 50–100, and another variable is 5–10. In this case, PCA will give more weight to the first variable. Such issues can be resolved by scaling the dataset before applying PCA.
    7. For Output Format, specify if you want to output components into separate columns or vectors. For this post, we choose Columns.
    8. For Output column, enter a prefix for column names generated by PCA. For this post, we enter PCA80_.
  2. Choose Preview to preview the data, then choose Update.

After applying PCA, the number of columns will be reduced from 784 to 115—this is an 85% reduction in the number of features.

We can now use the transformed dataset and generate another Data Quality and Insights Report as shown in the following screenshot to observe the model performance.

Quick Model After PCA

We can see in the second analysis that the model performance has improved and accuracy increased to 91.8% compared to the first Quick Model report. PCA reduced the number of features in our dataset by 85% while maintaining the model accuracy at similar levels.

Based on the Quick Model analysis from the report, model performance is at 91.8%. With PCA, we reduced the columns by 85% while still maintaining the model accuracy at similar levels. For better results, you can try deep learning models, which might offer even better performance.

We found the following comparison in training time using Amazon SageMaker Autopilot with and without PCA dimensionality reduction:

  • With PCA dimensional reduction – 25 minutes
  • Without PCA dimensional reduction – 45 minutes

Operationalizing PCA

As data changes over time, it’s often desirable to retrain our parameters to new unseen data. Data Wrangler offers this capability through the use of refitting parameters. For more information on refitting trained parameters, refer to Refit trained parameters on large datasets using Amazon SageMaker Data Wrangler.

Previously, we applied PCA to a sample of the MNIST dataset containing 50,000 sample rows. Consequently, our flow file contains a model that has been trained on this sample and used for all created jobs unless we specify that we want to relearn those parameters.

To refit your model parameters on the MNIST training dataset, complete the following steps:

  1. Create a destination for our flow file in Amazon S3 so we can create a Data Wrangler processing job.
    Data Wrangler Processing Job
  2. Create a job and select Refit to learn new training parameters.

The Trained parameters section shows that there are 784 parameters. That is one parameter for each column because we excluded the label column in our PCA reduction.

Note that if we don’t select Refit in this step, the trained parameters learned during interactive mode will be used.

Create Job

  1. Create the job.
    Job Created
  2. Choose the processing job link to monitor the job and find the location of the resulting flow file on Amazon S3.
    Processing Job Flow

This flow file contains the model learned on the entire MNIST train dataset.

  1. Load this file into Data Wrangler.

Clean up

To clean up the environment so you don’t incur additional charges, delete the datasets and artifacts in Amazon S3. Additionally, delete the data flow file in Studio and shut down the instance it runs on. Refer to Shut Down Data Wrangler for more information.

Conclusion

Dimensionality reduction is a great technique to remove the unwanted variables from a model. It can be used to reduce the model complexity and noise in the data, thereby mitigating the common problem of overfitting in machine learning and deep learning models. In this blog we demonstrated that by reducing the number of features, we were still able to accomplish similar or higher accuracy for our models.

For more information about using PCA, refer to Principal Component Analysis (PCA) Algorithm. To learn more about the dimensionality reduction transform, refer to Reduce Dimensionality within a Dataset.


About the authors

Adeleke Coker is a Global Solutions Architect with AWS. He works with customers globally to provide guidance and technical assistance in deploying production workloads at scale on AWS. In his spare time, he enjoys learning, reading, gaming and watching sport events.

Abigail is a Software Development Engineer at Amazon SageMaker. She is passionate about helping customers prepare their data in DataWrangler and building distributed machine learning systems. In her free time, Abigail enjoys traveling, hiking, skiing, and baking.

Vishaal Kapoor is a Senior Applied Scientist with AWS AI. He is passionate about helping customers understand their data in Data Wrangler. In his spare time, he mountain bikes, snowboards, and spends time with his family.

Raviteja Yelamanchili is an Enterprise Solutions Architect with Amazon Web Services based in New York. He works with large financial services enterprise customers to design and deploy highly secure, scalable, reliable, and cost-effective applications on the cloud. He brings over 11+ years of risk management, technology consulting, data analytics, and machine learning experience. When he is not helping customers, he enjoys traveling and playing PS5.

Read More