Content moderation design patterns with AWS managed AI services

User-generated content (UGC) grows exponentially, as well as the requirements and the cost to keep content and online communities safe and compliant. Modern web and mobile platforms fuel businesses and drive user engagement through social features, from startups to large organizations. Online community members expect safe and inclusive experiences where they can freely consume and contribute images, videos, text, and audio. The ever-increasing volume, variety, and complexity of UGC make traditional human moderation workflows challenging to scale to protect users. These limitations force customers into inefficient, expensive, and reactive mitigation processes that carry an unnecessary risk for users and the business. The result is a poor, harmful, and non-inclusive community experience that disengages users, negatively impacting community, and business objectives.

The solution is scalable content moderation workflows that rely on artificial intelligence (AI), machine learning (ML), deep learning (DL), and natural language processing (NLP) technologies. These constructs translate, transcribe, recognize, detect, mask, redact, and strategically bring human talent into the moderation workflow, to run the actions needed to keep users safe and engaged while increasing accuracy and process efficiency, and lowering operational costs.

This post reviews how to build content moderation workflows using AWS AI services. To learn more about business needs, impact, and cost reductions that automated content moderation brings to social media, gaming, e-commerce, and advertising industries, see Utilize AWS AI services to automate content moderation and compliance.

Solution overview

You don’t need expertise in ML to implement these workflows and can tailor these patterns to your specific business needs! AWS delivers these capabilities through fully managed services that remove operational complexity and undifferentiated heavy lifting, and without a data science team.

In this post, we demonstrate how to efficiently moderate spaces where customers discuss and review products using text, audio, images, video, and even PDF files. The following diagram illustrates the solution architecture.

Abstract diagram showing how AWS AI services come together.

Prerequisites

By default, these patterns demonstrate a serverless methodology, where you only pay for what you use. You continue paying for the compute resources, such as AWS Fargate containers, and storage, such as Amazon Simple Storage Service (Amazon S3), until you delete those resources. The discussed AWS AI services also follow a consumption pricing model per operation.

Non-production environments can test each of these patterns within the Free Tier, assuming your account’s eligibility.

Moderate plain text

First, you need to implement content moderation for plain text. This procedure serves as the foundation for more sophisticated media types and entails two high-level steps:

  1. Translate the text.
  2. Analyze the text.

Global customers want to collaborate with social platforms in their native language. Meeting this expectation can add complexity because design teams must construct a workflow or steps for each language. Instead, you can use Amazon Translate to convert text to over 70 languages and variants in over 15 regions. This capability enables you to write analysis rules for a single language and apply those rules across the global online community.

Amazon Translate is a neural machine translation service that delivers fast, high-quality, affordable, and customizable language translation. You can integrate it into your workflows to detect the dominant language and translate the text. The following diagram illustrates the workflow.

State machine for normalizing text

The APIs operate as follows:

Next, you can use NLP to uncover connections in text, like discovering key phrases, analyzing sentiment, and detecting personally identifiable information (PII). Amazon Comprehend APIs extract those valuable insights and pass them into custom function handlers.

Running those handlers inside AWS Lambda functions elastically scales your code without thinking about servers or clusters. Alternatively, you can process insights from Amazon Comprehend with microservices architecture patterns. Regardless of the runtime, your code focuses on using the results, not parsing text.

The following diagram illustrates the workflow.

State machine for moderating text

Lambda functions interact with the following APIs:

  • The DetectEntities API discovers and groups the names of real-world objects such as people and places in the text. You can use a custom vocabulary to redact inappropriate and business-specific entity types.
  • The DetectSentiment API identifies the overall sentiment of the text as positive, negative, or neutral. You can train custom classifiers to recognize the industry-specific situations of interest and extract the text’s conceptual meaning.
  • The DetectPIIEntities API identifies PII in your text, such as address, bank account number, or phone number. The output contains the type of PII entity and its corresponding location.

Moderate audio files

To moderate audio files, you must transcribe the file to text and then analyze it. This process has two variants depending on whether you’re processing individual files (synchronous) or live audio streams (asynchronous). Synchronous workflows are ideal for batch processing, with the caller receiving one complete response. In contrast, audio streams require periodic sampling with multiple transcription results.

Amazon Transcribe is an automatic speech recognition service that uses ML models to convert audio to text. You can integrate it into synchronous workflows by starting a transcription job and periodically querying the job’s status. After the job is complete, you can analyze the output using the plain text moderation workflow from the previous step.

The following diagram illustrates the workflow.

State machine for transcribing audio files

The APIs operate as follows:

  • The StartTranscriptionJob API starts an asynchronous job to transcribe speech to text.
  • The GetTranscriptionJob API returns information about a transcription job. To see the status of the job, check the TranscriptionJobStatus field. If the status property is COMPLETED, you can find the results at the location specified in the TranscriptFileUri field. If you enable content redaction, the redacted transcript appears in RedactedTranscriptFileUri.

Live audio streams need a different pattern that supports a real-time delivery model. Streaming can include pre-recorded media, such as movies, music, and podcasts, and real-time media, such as live news broadcasts. You can transcribe audio chunks instantaneously using Amazon Transcribe streaming over HTTP/2 and WebSockets protocols. After posting a chunk to the service, you receive one or more transcription result objects describing the partial and complete transcription segments. Segments that require moderation can reuse the plain text workflow from the previous section. The following diagram illustrates this process.

Flow diagram for moderating real-time audio streams

The StartStreamingTranscription API starts a bidirectional HTTP/2 stream where audio streams to Amazon Transcribe, streaming the transcription results to your application.

Moderate images and photos

Moderating images requires detecting inappropriate, unwanted, or offensive content containing nudity, suggestiveness, violence, and other categories from images and photos content.

Amazon Rekognition enables you to streamline or automate your image and video moderation workflows without requiring ML expertise. Amazon Rekognition returns a hierarchical taxonomy of moderation-related labels. This information makes it easy to define granular business rules per your standards and practices, user safety, and compliance guidelines. ML experience is not required to use these capabilities. Amazon Rekognition can detect and read the text in an image and return bounding boxes for each word found. Amazon Rekognition supports text detection written in English, Arabic, Russian, German, French, Italian, Portuguese, and Spanish!

You can use the machine predictions to automate specific moderation tasks entirely. This capability enables human moderators to focus on higher-order work. In addition, Amazon Rekognition can quickly review millions of images or thousands of videos using ML and flag the subset of assets requiring further action. Prefiltering helps provide comprehensive yet cost-effective moderation coverage while reducing the amount of content that human teams moderate.

The following diagram illustrates the workflow.

State machine for moderating images

The APIs operate as follows:

  • The DetectModerationLabels API detects unsafe content in specified JPEG or PNG formatted images. Use DetectModerationLabels to moderate pictures depending on your requirements. For example, you might want to filter images that contain nudity but not images containing suggestive content.
  • The DetectText API detects text in the input image and converts it into machine-readable text.

Moderate rich text documents

Next, you can use Amazon Textract to extract handwritten text and data from scanned documents. This process begins with invoking the StartDocumentAnalysis action to parse Microsoft Word and Adobe PDF files. You can monitor the job’s progress with the GetDocumentAnalysis action.

The analysis result specifies each uncovered page, paragraph, table, and key-value pair in the document. For example, suppose a health provider must mask patient names in only the claim description field. In that case, the analysis report can power intelligent document processing pipelines that moderate and redact the specific data field. The following diagram illustrates the pipeline.

State machine for moderating rich text documents

The APIs operate as follows:

  • The StartDocumentAnalysis API starts the asynchronous analysis of an input document for relationships between detected items such as key-value pairs, tables, and selection elements
  • The GetDocumentAnalysis API gets the results for an Amazon Textract asynchronous operation that analyzes text in a document

Moderate videos

A standard approach to video content moderation is through a frame sampling procedure. Many use cases don’t need to check every frame, and selecting one every 15–30 seconds is sufficient. Sampled video frames can reuse the state machine to moderate images from the previous section. Similarly, the existing process to moderate audio can support the file’s audible content. The following diagram illustrates this workflow.

State machine for moderating video files

The Invoke API runs a Lambda function and synchronously waits for the response.

Suppose the media file is an entire movie with multiple scenes. In that case, you can use the Amazon Rekognition Segment API, a composite API for detecting technical cues or shot detection. Next, you can use these time offsets to parallel process each segment with the previous video moderation pattern, as shown in the following diagram.

State machine for moderating rich text documents

The APIs operate as follows:

  • The StartSegmentationDetection API starts asynchronous detection of segment detection in a stored video
  • The GetSegmentationDetection API gets the segment detection results of an Amazon Rekognition Video analysis started by the StartSegmentDetection API

Extracting individual frames from the movie doesn’t require fetching the object from Amazon S3 multiple times. A naïve solution involves reading the video into memory and paginating to the end. This pattern is ideal for short clips and where assessments aren’t time-sensitive.

Another strategy entails moving the file once to Amazon Elastic File System (Amazon EFS), a fully managed, scalable, shared file system for other AWS services, such as Lambda. With Amazon EFS for Lambda, you can efficiently distribute data across function invocations. Each invocation efficiently handles a small chunk, unlocking the potential for massively parallel processing and faster processing times.

Clean up

After you experiment with the methods in this post, you should delete any content in S3 buckets to avoid future costs. If you implemented these patterns with provisioned compute resources like Amazon Elastic Compute Cloud (Amazon EC2) or Amazon Elastic Container Service (Amazon ECS), you should stop those instances to avoid further charges.

Conclusion

User-generated content and its value to gaming, social media, ecommerce, and financial and health services organizations will continue to grow. Still, startups and large organizations need to create efficient moderation processes to protect users, information, and the business, while lowering operational costs. This solution demonstrates how AI, ML, and NLP technologies can efficiently help you moderate content at scale. You can customize AWS AI services to address your specific moderation needs! These fully managed capabilities remove operational complexities. That flexibility strategically integrates contextual insights and human talent into your moderation processes.

For additional information, resources, and to get started for free today, visit the AWS content moderation homepage.


About the Authors

Nate Bachmeier is an AWS Senior Solutions Architect that nomadically explores New York, one cloud integration at a time. He specializes in migrating and modernizing applications. Besides this, Nate is a full-time student and has two kids.

Ram Pathangi is a Solutions Architect at Amazon Web Services in the San Francisco Bay Area. He has helped customers in agriculture, insurance, banking, retail, healthcare and life sciences, hospitality, and hi-tech verticals run their businesses successfully on the AWS Cloud. He specializes in databases, analytics, and machine learning.

Roop Bains is a Solutions Architect at AWS focusing on AI/ML. He is passionate about helping customers innovate and achieve their business objectives using artificial intelligence and machine learning. In his spare time, Roop enjoys reading and hiking.

Read More

Process larger and wider datasets with Amazon SageMaker Data Wrangler

Amazon SageMaker Data Wrangler reduces the time to aggregate and prepare data for machine learning (ML) from weeks to minutes in Amazon SageMaker Studio. Data Wrangler can simplify your data preparation and feature engineering processes and help you with data selection, cleaning, exploration, and visualization. Data Wrangler has over 300 built-in transforms written in PySpark, so you can process datasets up to hundreds of gigabytes efficiently on the default instance, ml.m5.4xlarge.

However, when you work with datasets up to terabytes of data using built-in transforms, you might experience longer processing time or potential out-of-memory errors. Based on your data requirements, you can now use additional Amazon Elastic Compute Cloud (Amazon EC2) M5 instances and R5 instances. For example, you can start with a default instance (ml.m5.4xlarge) and then switch to ml.m5.24xlarge or ml.r5.24xlarge. You have the option of picking different instance types and finding the best trade-off of running cost and processing times. The next time you’re working on time series transformation and running heavy transformers to balance your data, you can right-size your Data Wrangler instance to run these processes faster.

When processing tens of gigabytes or even more with a custom Pandas transform, you might experience out-of-memory errors. You can switch from the default instance (ml.m5.4xlarge) to ml.m5.24xlarge, and the transform will finish without any errors. We thoroughly benchmarked and observed linear speedup as we increased instance size across a portfolio of datasets.

In this post, we share our findings from two benchmark tests to demonstrate how you can process larger and wider datasets with Data Wrangler.

Data Wrangler benchmark tests

Let’s review two tests we ran, aggregation queries and one-hot encoding, with different instance types using PySpark built-in transformers and custom Pandas transforms. Transformations that don’t require aggregation finish quickly and work well with the default instance type, so we focused on aggregation queries and transformations with aggregation. We stored our test dataset on Amazon Simple Storage Service (Amazon S3). This dataset’s expanded size is around 100 GB with 80 million rows and 300 columns. We used UI metrics to time benchmark tests and measure end-to-end customer-facing latency. When importing our test dataset, we disabled sampling. Sampling is enabled by default, and Data Wrangler only processes the first 100 rows when enabled.x

As we increased the Data Wrangler instance size, we observed a roughly linear speedup of Data Wrangler built-in transforms and custom Spark SQL. Pandas aggregation query tests only finished when we used instances larger than ml.m5.16xl, and Pandas needed 180 GB of memory to process aggregation queries for this dataset.

The following table summarizes the aggregation query test results.

Instance vCPU Memory (GiB) Data Wrangler built-in Spark transform time Pandas Time
(Custom Transform)
ml.m5.4xl 16 64 229 seconds Out of memory
ml.m5.8xl 32 128 130 seconds Out of memory
ml.m5.16xl 64 256 52 seconds 30 minutes

The following table summarizes the one-hot encoding test results.

Instance vCPU Memory (GiB) Data Wrangler built-in Spark transform time Pandas Time
(Custom Transform)
ml.m5.4xl 16 64 228 seconds Out of memory
ml.m5.8xl 32 128 130 seconds Out of memory
ml.m5.16xl 64 256 52 seconds Out of memory

Switch the instance type of a data flow

To switch the instance type of your flow, complete the following steps:

  1. On the Amazon SageMaker Data Wrangler console, navigate to the data flow that you’re currently using.
  2. Choose the instance type on the navigation bar.
  3. Select the instance type that you want to use.
  4. Choose Save.

A progress message appears.

When the switch is complete, a success message appears.

Data Wrangler uses the selected instance type for data analysis and data transformations. The default instance and the instance you switched to (ml.m5.16xlarge) are both running. You can change the instance type or switch back to the default instance before running a specific transformation.

Shut down unused instances

You are charged for all running instances. To avoid incurring additional charges, shut down the instances that you aren’t using manually. To shut down an instance that is running, complete the following steps:

  1. On your data flow page, choose the instance icon in the left pane of the UI under Running instances.
  2. Choose Shut down.

If you shut down an instance used to run a flow, you can’t access the flow temporarily. If you get an error in opening the flow running an instance you previously shut down, wait for approximately 5 minutes and try opening it again.

Conclusion

In this post, we demonstrated how to process larger and wider datasets with Data Wrangler by switching instances to larger M5 or R5 instance types. M5 instances offer a balance of compute, memory, and networking resources. R5 instances are memory-optimized instances. Both M5 and R5 provide instance types to optimize cost and performance for your workloads.

To learn more about using data flows with Data Wrangler, refer to Create and Use a Data Wrangler Flow and Amazon SageMaker Pricing. To get started with Data Wrangler, see Prepare ML Data with Amazon SageMaker Data Wrangler.


About the Authors

Haider Naqvi is a Solutions Architect at AWS. He has extensive software development and enterprise architecture experience. He focuses on enabling customers to achieve business outcomes with AWS. He is based out of New York.

Huong Nguyen is a Sr. Product Manager at AWS. She is leading the data ecosystem integration for SageMaker, with 14 years of experience building customer-centric and data-driven products for both enterprise and consumer spaces.

Meenakshisundaram Thandavarayan is a Senior AI/ML specialist with AWS. He helps hi-tech strategic accounts on their AI and ML journey. He is very passionate about data-driven AI.

Sriharsha M Sr is an AI/ML Specialist Solutions Architect in the Strategic Specialist team at Amazon Web Services. He works with strategic AWS customers who are taking advantage of AI/ML to solve complex business problems. He provides technical guidance and design advice to implement AI/ML applications at scale. His expertise spans application architecture, big data, analytics, and machine learning.

Nikita Ivkin is an Applied Scientist, Amazon SageMaker Data Wrangler.

Read More

Fine-tune transformer language models for linguistic diversity with Hugging Face on Amazon SageMaker

Approximately 7,000 languages are in use today. Despite attempts in the late 19th century to invent constructed languages such as Volapük or Esperanto, there is no sign of unification. People still choose to create new languages (think about your favorite movie character who speaks Klingon, Dothraki, or Elvish).

Today, natural language processing (NLP) examples are dominated by the English language, the native language for only 5% of the human population and spoken only by 17%.

The digital divide is defined as the gap between those who can access digital technologies and those who can’t. Lack of access to knowledge or education due to language barriers also contributes to the digital divide, not only between people who don’t speak English, but also for the English-speaking people who don’t have access to non-English content, which reduces diversity of thought and knowledge. There is so much to learn mutually.

In this post, we summarize the challenges of low-resource languages and experiment with different solution approaches covering over 100 languages using Hugging Face transformers on Amazon SageMaker.

We fine-tune various pre-trained transformer-based language models for a question and answering task. We use Turkish in our example, but you could apply this approach to other supported language. Our focus is on BERT [1] variants, because a great feature of BERT is its unified architecture across different tasks.

We demonstrate several benefits of using Hugging Face transformers on Amazon SageMaker, such as training and experimentation at scale, and increased productivity and cost-efficiency.

Overview of NLP

There have been several major developments in NLP since 2017. The emergence of deep learning architectures such as transformers [2], the unsupervised learning techniques to train such models on extremely large datasets, and transfer learning have significantly improved the state-of-the-art in natural language understanding. The arrival of pre-trained model hubs has further democratized access to the collective knowledge of the NLP community, removing the need to start from scratch.

A language model is an NLP model that learns to predict the next word (or any masked word) in a sequence. The genuine beauty of language models as a starting point are three-fold: First, research has shown that language models trained on a large text corpus data learn more complex meanings of words than previous methods. For instance, to be able to predict the next word in a sentence, the language model has to be good at understanding the context, the semantics, and also the grammar. Second, to train a language model, labeled data—which is scarce and expensive—is not required during pre-training. This is important because an enormous amount of unlabeled text data is publicly available on the web in many languages. Third, it has been demonstrated that once the language model is smart enough to predict the next word for any given sentence, it’s relatively easy to perform other NLP tasks such as sentiment analysis or question answering with very little labeled data, because fine-tuning reuses representations from a pre-trained language model [3].

Fully managed NLP services have also accelerated the adoption of NLP. Amazon Comprehend is a fully managed service that enables text analytics to extract insights from the content of documents, and it supports a variety of languages. Amazon Comprehend supports custom classification and custom entity recognition and enables you to build custom NLP models that are specific to your requirements, without the need for any ML expertise.

Challenges and solutions for low-resource languages

The main challenge for a large number of languages is that they have relatively less data available for training. These are called low-resource languages. The m-BERT paper [4] and XLM-R paper [7] refer to Urdu and Swahili as low-resource languages.

The following figure specifies the ISO codes of over 80 languages, and the difference in size (in log-scale) between the two major pre-trainings [7]. In Wikipedia (orange), there are only 18 languages with over 1 million articles and 52 languages with over 1,000 articles, but 164 languages with only 1–10,000 articles [9]. The CommonCrawl corpus (blue) increases the amount of data for low-resource languages by two orders of magnitude. Nevertheless, they are still relatively small compared to high-resource languages such as English, Russian, or German.

In terms of Wikipedia article numbers, Turkish is another language in the same group of over 100,000 articles (28th), together with Urdu (54th). Compared with Urdu, Turkish would be regarded as a mid-resource language. Turkish has some interesting characteristics, which could make language models more powerful by creating certain challenges in linguistics and tokenization. It’s an agglutinative language. It has a very free word order, a complex morphology, or tenses without English equivalents. Phrases formed of several words in languages like English can be expressed with a single word form, as shown in the following example.

Turkish English
Kedi Cat
Kediler Cats
Kedigiller Family of cats
Kedigillerden Belonging to the family of cats
Kedileştirebileceklerimizdenmişçesineyken When it seems like that one is one those we can make cat

Two main solution approaches are language-specific models or multilingual models (with or without cross-language supervision):

  • Monolingual language models – The first approach is to apply a BERT variant to a specific target language. The more the training data, the better the model performance.
  • Multilingual masked language models – The other approach is to pre-train large transformer models on many languages. Multilingual language modeling aims to solve the lack of data challenge for low-resource languages by pre-training on a large number of languages so that NLP tasks learned from one language can be transferred to other languages. Multilingual masked language models (MLMs) have pushed the state-of-the-art on cross-lingual understanding tasks. Two examples are:

    • Multilingual BERT – The multilingual BERT model was trained in 104 different languages using the Wikipedia corpus. However, it has been shown that it only generalizes well across similar linguistic structures and typological features (for example, languages with similar word order). Its multilinguality is diminished especially for languages with different word orders (for example, subject/object/verb) [4].
    • XLM-R – Cross-lingual language models (XLMs) are trained with a cross-lingual objective using parallel datasets (the same text in two different languages) or without a cross-lingual objective using monolingual datasets [6]. Research shows that low-resource languages benefit from scaling to more languages. XLM-RoBERTa is a transformer-based model inspired by RoBERTa [5], and its starting point is the proposition that multilingual BERT and XLM are under-tuned. It’s trained on 100 languages using both the Wikipedia and CommonCrawl corpus, so the amount of training data for low-resource languages is approximately two orders of magnitude larger compared to m-BERT [7].

Another challenge of multilingual language models for low-resource languages is vocabulary size and tokenization. Because all languages use the same shared vocabulary in multilingual language models, there is a trade-off between increasing vocabulary size (which increases the compute requirements) vs. decreasing it (words not present in the vocabulary would be marked as unknown, or using characters instead of words as tokens would ignore any structure). The word-piece tokenization algorithm combines the benefits of both approaches. For instance, it effectively handles out-of-vocabulary words by splitting the word into subwords until it is present in the vocabulary or until the individual character is reached. Character-based tokenization isn’t very useful except for certain languages, such as Chinese. Techniques exist to address challenges for low-resource languages, such as sampling with certain distributions [6].

The following table depicts how three different tokenizers behave for the word “kedileri” (meaning “its cats”). For certain languages and NLP tasks, this would make a difference. For instance, for the question answering task, the model returns the span of the start token index and end token index; returning “kediler” (“cats”) or “kedileri” (“its cats”) would lose some context and lead to different evaluation results for certain metrics.

Pretrained Model Vocabulary size Tokenization for “Kedileri”*
dbmdz/bert-base-turkish-uncased 32,000 Tokens [CLS] kediler ##i [SEP]
Input IDs 2 23714 1023 3
bert-base-multilingual-uncased 105,879 Tokens [CLS] ked ##iler ##i [SEP]
Input IDs 101 30210 33719 10116 102
deepset/xlm-roberta-base-squad2 250,002 Tokens <s> ▁Ke di leri </s>
Input IDs 0 1345 428 1341 .
*In English: (Its) cats

Therefore, although low-resource languages benefit from multilingual language models, performing tokenization across a shared vocabulary may ignore some linguistic features for certain languages.

In the next section, we compare three approaches by fine-tuning them for a question answering task using a QA dataset for Turkish: BERTurk [8], multilingual BERT [4], and XLM-R [7].

Solution overview

Our workflow is as follows:

  1. Prepare the dataset in an Amazon SageMaker Studio notebook environment and upload it to Amazon Simple Storage Service (Amazon S3).
  2. Launch parallel training jobs on SageMaker training deep learning containers by providing the fine-tuning script.
  3. Collect metadata from each experiment.
  4. Compare results and identify the most appropriate model.

The following diagram illustrates the solution architecture.

For more information on Studio notebooks, refer to Dive deep into Amazon SageMaker Studio Notebooks architecture. For more information on how Hugging Face is integrated with SageMaker, refer to AWS and Hugging Face collaborate to simplify and accelerate adoption of Natural Language Processing models.

Prepare the dataset

The Hugging Face Datasets library provides powerful data processing methods to quickly get a dataset ready for training in a deep learning model. The following code loads the Turkish QA dataset and explores what’s inside:

data_files = {}
data_files["train"] = 'data/train.json'
data_files["validation"] = 'data/val.json'

ds = load_dataset("json", data_files=data_files)

print("Number of features in dataset: n Train = {}, n Validation = {}".format(len(ds['train']), len(ds['validation'])))

There are about 9,000 samples.

The input dataset is slightly transformed into a format expected by the pre-trained models and contains the following columns:

df = pd.DataFrame(ds['train'])
df.sample(1)


The English translation of the output is as follows:

  • context – Resit Emre Kongar (b. 13 October 1941, Istanbul), Turkish sociologist, professor.
  • question – What is the academic title of Emre Kongar?
  • answer – Professor

Fine-tuning script

The Hugging Face Transformers library provides an example code to fine-tune a model for a question answering task, called run_qa.py. The following code initializes the trainer:

 # Initialize our Trainer
      trainer = QuestionAnsweringTrainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        eval_examples=eval_examples,
        tokenizer=tokenizer,
        data_collator=data_collator,
        post_process_function=post_processing_function,
        compute_metrics=compute_metrics,
    )

Let’s review the building blocks on a high level.

Tokenizer

The script loads a tokenizer using the AutoTokenizer class. The AutoTokenizer class takes care of returning the correct tokenizer that corresponds to the model:

tokenizer = AutoTokenizer.from_pretrained(
        model_args.model_name_or_path,
        cache_dir=model_args.cache_dir,
        use_fast=True,
        revision=model_args.model_revision,
        use_auth_token=None,
    )

The following is an example how the tokenizer works:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("deepset/xlm-roberta-base-squad2")

input_ids = tokenizer.encode('İstanbulun en popüler hayvanı hangisidir? Kedileri', return_tensors="pt")
tokens = tokenizer('İstanbulun en popüler hayvanı hangisidir? Kedileri').tokens()

Model

The script loads a model. AutoModel classes (for example, AutoModelForQuestionAnswering) directly create a class with weights, configuration, and vocabulary of the relevant architecture given the name and path to the pre-trained model. Thanks to the abstraction by Hugging Face, you can easily switch to a different model using the same code, just by providing the model’s name. See the following example code:

    model = AutoModelForQuestionAnswering.from_pretrained(
        model_args.model_name_or_path,
        config=config,
        cache_dir=model_args.cache_dir,
        revision=model_args.model_revision,
    )

Preprocessing and training

The prepare_train_features() and prepare_validation_features() methods preprocess the training dataset and validation datasets, respectively. The code iterates over the input dataset and builds a sequence from the context and the current question, with the correct model-specific token type IDs (numerical representations of tokens) and attention masks. The sequence is then passed through the model. This outputs a range of scores, for both the start and end positions, as shown in the following table.

Input Dataset Fields Preprocessed Training Dataset Fields for QuestionAnsweringTrainer
id input_ids
title attention_mask
context start_positions
question end_positions
Answers { answer_start, answer_text } .

Evaluation

The compute_metrics() method takes care of calculating metrics. We use the following popular metrics for question answering tasks:

  • Exact match – Measures the percentage of predictions that match any one of the ground truth answers exactly.
  • F1 score – Measures the average overlap between the prediction and ground truth answer. The F1 score is the harmonic mean of precision and recall:

    • Precision – The ratio of the number of shared words to the total number of words in the prediction.
    • Recall – The ratio of the number of shared words to the total number of words in the ground truth.

Managed training on SageMaker

Setting up and managing custom machine learning (ML) environments can be time-consuming and cumbersome. With AWS Deep Learning Container (DLCs) for Hugging Face Transformers libraries, we have access to prepackaged and optimized deep learning frameworks, which makes it easy to run our script across multiple training jobs with minimal additional code.

We just need to use the Hugging Face Estimator available in the SageMaker Python SDK with the following inputs:

# Trial configuration
config['model'] = 'deepset/xlm-roberta-base-squad2'
config['instance_type'] = 'ml.p3.16xlarge'
config['instance_count'] = 2

# Define the distribution parameters in the HuggingFace Estimator

config['distribution'] = {'smdistributed':{'dataparallel':{ 'enabled': True }}}
trial_configs.append(config)

# We can specify a training script that is stored in a GitHub repository as the entry point for our Estimator, 
# so we don’t have to download the scripts locally.
git_config = {'repo': 'https://github.com/huggingface/transformers.git'}


    hyperparameters_qa={
        'model_name_or_path': config['model'],
        'train_file': '/opt/ml/input/data/train/train.json',
        'validation_file': '/opt/ml/input/data/val/val.json',
        'do_train': True,
        'do_eval': True,
        'fp16': True,
        'per_device_train_batch_size': 16,
        'per_device_eval_batch_size': 16,
        'num_train_epochs': 2,
        'max_seq_length': 384,
        'pad_to_max_length': True,
        'doc_stride': 128,
        'output_dir': '/opt/ml/model'
    }

    huggingface_estimator = HuggingFace(entry_point='run_qa.py',
                                        source_dir='./examples/pytorch/question-answering',
                                        git_config=git_config,
                                        instance_type=config['instance_type'],
                                        instance_count=config['instance_count'],
                                        role=role,
                                        transformers_version='4.12.3',
                                        pytorch_version='1.9.1',
                                        py_version='py38',
                                        distribution=config['distribution'],
                                        hyperparameters=hyperparameters_qa,
                                        metric_definitions=metric_definitions,
                                        enable_sagemaker_metrics=True,)
    
    nlp_training_job_name = f"NLPjob-{model}-{instance}-{int(time.time())}"
    
    training_input_path = f's3://{sagemaker_session_bucket}/{s3_prefix_qa}/'
    test_input_path = f's3://{sagemaker_session_bucket}/{s3_prefix_qa}/'
    
    huggingface_estimator.fit(
        inputs={'train': training_input_path, 'val': test_input_path},
        job_name=nlp_training_job_name,
        experiment_config={
            "ExperimentName": nlp_experiment.experiment_name,
            "TrialName": nlp_trial.trial_name,
            "TrialComponentDisplayName": nlp_trial.trial_name,},
        wait=False,
    )

Evaluate the results

When the fine-tuning jobs for the Turkish question answering task are complete, we compare the model performance of the three approaches:

  • Monolingual language model – The pre-trained model fine-tuned on the Turkish question answering text is called bert-base-turkish-uncased [8]. It achieves an F1 score of 75.63 and an exact match score of 56.17 in only two epochs and with 9,000 labeled items. However, this approach is not suitable for a low-resource language when a pre-trained language model doesn’t exist, or there is little data available for training from scratch.
  • Multilingual language model with multilingual BERT – The pre-trained model is called bert-base-multilingual-uncased. The multilingual BERT paper [4] has shown that it generalizes well across languages. Compared with the monolingual model, it performs worse (F1 score 71.73, exact match 50:45), but note that this model handles over 100 other languages, leaving less room for representing the Turkish language.
  • Multilingual language model with XLM-R – The pre-trained model is called xlm-roberta-base-squad2. The XLM-R paper shows that it is possible to have a single large model for over 100 languages without sacrificing per-language performance [7]. For the Turkish question answering task, it outperforms the multilingual BERT and monolingual BERT F1 scores by 5% and 2%, respectively (F1 score 77.14, exact match 56.39).

Our comparison doesn’t take into consideration other differences between models such as the model capacity, training datasets used, NLP tasks pre-trained on, vocabulary size, or tokenization.

Additional experiments

The provided notebook contains additional experiment examples.

SageMaker provides a wide range of training instance types. We fine-tuned the XLM-R model on p3.2xlarge (GPU: Nvidia V100 GPU, GPU architecture: Volta (2017)), p3.16xlarge (GPU: 8 Nvidia V100 GPUs), and g4dn.xlarge (GPU: Nvidia T4 GPU, GPU architecture: Turing (2018)), and observed the following:

  • Training duration – According to our experiment, the XLM-R model took approximately 24 minutes to train on p3.2xlarge and 30 minutes on g4dn.xlarge (about 23% longer). We also performed distributed fine-tuning on two p3.16xlarge instances, and the training time decreased to 10 minutes. For more information on distributed training of a transformer-based model on SageMaker, refer to Distributed fine-tuning of a BERT Large model for a Question-Answering Task using Hugging Face Transformers on Amazon SageMaker.
  • Training costs – We used the AWS Pricing API to fetch SageMaker on-demand prices to calculate it on the fly. According to our experiment, training cost approximately $1.58 on p3.2xlarge, and about four times less on g4dn.xlarge ($0.37). Distributed training on two p3.16xlarge instances using 16 GPUs cost $9.68.

To summarize, although the g4dn.xlarge was the least expensive machine, it also took about three times longer to train than the most powerful instance type we experimented with (two p3.16xlarge). Depending on your project priorities, you could choose from a wide variety of SageMaker training instance types.

Conclusion

In this post, we explored fine tuning pre-trained transformer-based language models for a question answering task for a mid-resource language (in this case, Turkish). You can apply this approach to over 100 other languages using a single model. As of writing, scaling up a model to cover all of the world’s 7,000 languages is still prohibitive, but the field of NLP provides an opportunity to widen our horizons.

Language is the principal method of human communication, and is a means of communicating values and sharing the beauty of a cultural heritage. The linguistic diversity strengthens intercultural dialogue and builds inclusive societies.

ML is a highly iterative process; over the course of a single project, data scientists train hundreds of different models, datasets, and parameters in search of maximum accuracy. SageMaker offers the most complete set of tools to harness the power of ML and deep learning. It lets you organize, track, compare, and evaluate ML experiments at scale.

Hugging Face is integrated with SageMaker to help data scientists develop, train, and tune state-of-the-art NLP models more quickly and easily. We demonstrated several benefits of using Hugging Face transformers on Amazon SageMaker, such as training and experimentation at scale, and increased productivity and cost-efficiency.

You can experiment with NLP tasks on your preferred language in SageMaker in all AWS Regions where SageMaker is available. The example notebook code is available in GitHub.

To learn how Amazon SageMaker Training Compiler can accelerate the training of deep learning models by up to 50%, see New – Introducing SageMaker Training Compiler.

The authors would like to express their deepest appreciation to Mariano Kamp and Emily Webber for reviewing drafts and providing advice.

References

  1. J. Devlin et al., “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding”, (2018).
  2. A. Vaswani et al., “Attention Is All You Need”, (2017).
  3. J. Howard and S. Ruder, “Universal Language Model Fine-Tuning for Text Classification”, (2018).
  4. T. Pires et al., “How multilingual is Multilingual BERT?”, (2019).
  5. Y. Liu et al., “RoBERTa: A Robustly Optimized BERT Pretraining Approach”, (2019).
  6. G. Lample, and A. Conneau, “Cross-Lingual Language Model Pretraining”, (2019).
  7. A. Conneau et al., “Unsupervised Cross-Lingual Representation Learning at Scale”, (2019).
  8. Stefan Schweter. BERTurk – BERT models for Turkish (2020).
  9. Multilingual Wiki Statistics https://en.wikipedia.org/wiki/Wikipedia:Multilingual_statistics

About the Authors

Arnav Khare is a Principal Solutions Architect for Global Financial Services at AWS. His primary focus is helping Financial Services Institutions build and design Analytics and Machine Learning applications in the cloud. Arnav holds an MSc in Artificial Intelligence from Edinburgh University and has 18 years of industry experience ranging from small startups he founded to large enterprises like Nokia, and Bank of America. Outside of work, Arnav loves spending time with his two daughters, finding new independent coffee shops, reading and traveling. You can find me on LinkedIn and in Surrey, UK in real life.

Hasan-Basri AKIRMAK (BSc and MSc in Computer Engineering and Executive MBA in Graduate School of Business) is a Senior Solutions Architect at Amazon Web Services. He is a business technologist advising enterprise segment clients. His area of specialty is designing architectures and business cases on large scale data processing systems and Machine Learning solutions. Hasan has delivered Business development, Systems Integration, Program Management for clients in Europe, Middle East and Africa. Since 2016 he mentored hundreds of entrepreneurs at startup incubation programs pro-bono.

Heiko Hotz is a Senior Solutions Architect for AI & Machine Learning and leads the Natural Language Processing (NLP) community within AWS. Prior to this role, he was the Head of Data Science for Amazon’s EU Customer Service. Heiko helps our customers being successful in their AI/ML journey on AWS and has worked with organizations in many industries, including Insurance, Financial Services, Media and Entertainment, Healthcare, Utilities, and Manufacturing. In his spare time Heiko travels as much as possible.

Read More

Build a custom Q&A dataset using Amazon SageMaker Ground Truth to train a Hugging Face Q&A NLU model

In recent years, natural language understanding (NLU) has increasingly found business value, fueled by model improvements as well as the scalability and cost-efficiency of cloud-based infrastructure. Specifically, the Transformer deep learning architecture, often implemented in the form of BERT models, has been highly successful, but training, fine-tuning, and optimizing these models has proven to be a challenging problem. Thanks to the AWS and Hugging Face collaboration, it’s now simpler to train and optimize NLU models on Amazon SageMaker using the SageMaker Python SDK, but sourcing labeled data for these models is still difficult and time-consuming.

One NLU problem of particular business interest is the task of question answering. In this post, we demonstrate how to build a custom question answering dataset using Amazon SageMaker Ground Truth to train a Hugging Face question answering NLU model.

Question answering challenges

Question answering entails a model automatically producing an answer to a query given some body of text that may or may not contain the answer. For example, given the following question, “What workflows does SageMaker Ground Truth support?” a model should be able to identify the segment “annotation consolidation and audit” in the following paragraph:

SageMaker Ground Truth helps improve the quality of labels through annotation consolidation and audit workflows. Annotation consolidation is the process of collecting label inputs from two or more data labelers and combining them to create a single data label for your machine learning model. With built-in audit and review workflows, workers can perform label verification and make adjustments to improve accuracy.

This problem is challenging because it requires a model to comprehend the meaning of a question, rather than simply perform keyword search. Accurate models in this area can reduce customer support costs through powering intelligent chatbots, delivering high-quality voice assistant products, and driving online store revenue through personalized product question answering. One large dataset in this area is the Stanford Question Answering Dataset (SQuAD), a diverse question answering dataset that presents a model with short text passages and requires the model to predict the location of the answering text span in the passage. SQuAD is a reading comprehension dataset, consisting of questions posed by crowd workers on a set of Wikipedia articles, where the answer to every question is either a span of text from the corresponding passage, or otherwise marked impossible to answer.

One challenge in adapting SQuAD for business use cases is generating domain-specific custom datasets. This process of creating new question and answer datasets requires a specialized user interface that allows annotators to highlight spans and add questions to those spans. It must also be able to support the addition of impossible questions to support SQuAD 2.0 format, which includes non-answerable questions. These impossible questions help models gain additional understanding around which queries can’t be answered using the given passage. The custom worker templates in Ground Truth simplify the generation of these datasets by providing workers with a tailored annotation experience for creating question and answer datasets.

Solution overview

This solution creates and manages Ground Truth labeling jobs to label a domain-specific custom question-answer dataset using a custom annotation user interface. We use SageMaker to train, fine-tune, optimize, and deploy a Hugging Face BERT model built with PyTorch on a custom question answering dataset.

You can implement the solution by deploying the provided AWS CloudFormation template in your AWS account. AWS CloudFormation handles deploying the AWS Lambda functions that support pre-annotation and annotation consolidation for the annotation user interface. It also creates an Amazon Simple Storage Service (Amazon S3) bucket and the AWS Identity and Access Management (IAM) roles to use when creating a labeling job.

This post walks you through how to do the following:

  • Create your own question answering dataset, or augment an existing one using Ground Truth
  • Use Hugging Face datasets to combine and tokenize text
  • Fine-tune a BERT model on your question answering data using SageMaker training
  • Deploy your model to a SageMaker endpoint and visualize your results

Annotation user interface

We use a new custom worker task template with Ground Truth to add new annotations to the existing SQuAD dataset. This solution offers a worker task template as well as a pre-annotation Lambda function (which handles putting data into the user interface) and post-annotation Lambda function (which extracts results from the user interface after labeling is complete).

This custom worker task template gives you the ability to highlight text in the right pane, then add a corresponding question in the left pane that relates to the highlighted text. Highlighted text on the right pane can also be added to any previously created question. Moreover, you can add impossible questions according to SQuAD 2.0 format. Impossible questions allow models to reduce the number of unreliable false positive guesses when the passage is unable to answer a query.

This user interface uses the same JSON schema as the SQuAD 2.0 dataset, which means it can operate over multiple articles and paragraphs, displaying one paragraph at a time using the Previous and Next buttons. The user interface makes it easy to monitor and determine the labeling work each annotator needs to complete during the task submission step.

Because the annotation UI is contained in a single Liquid HTML file, you can customize the labeling experience with knowledge of basic JavaScript. You can also modify Liquid tags to pass additional information into the labeling UI, and you can modify the template itself to include more detailed worker instructions.

Estimated costs

Deploying this solution can incur a maximum cost of around $20, not accounting for human labeling costs. Amazon S3, Lambda, SageMaker, and Ground Truth all offer the AWS Free Tier, with charges for additional usage. For more information, see the following pricing pages:

Prerequisites

To implement this solution, you should have the following prerequisites:

The following GIF demonstrates how to create a private workforce. For instructions, see Create an Amazon Cognito Workforce Using the Labeling Workforces Page.

Launch the CloudFormation Stack

Now that you’ve seen the structure of the solution, you deploy it into your account so you can run an example workflow. All the deployment steps related to the labeling pipeline are managed by AWS CloudFormation. This means AWS CloudFormation creates your pre-annotation and annotation consolidation Lambda functions, as well as an S3 bucket to store input and output data.

You can launch the stack in AWS Region us-east-1 on the AWS CloudFormation console using the Launch Stack button. To launch the stack in a different Region, use the instructions found in the README of the GitHub repository.

Operate the notebook

After the solution has been deployed to your account, a notebook instance named gt-hf-squad-notebook is available in your account. To start operating the notebook, complete the following steps:

  1. On the Amazon SageMaker console, navigate to the notebook instance page.
  2. Choose Open JupyterLab to open the instance.
  3. Inside the instance, browse to the repository hf-gt-custom-qa and open the notebook hf_squad_finetuning.ipynb.
  4. Choose conda_pytorch_p38 as your kernel.

Now that you’ve created a notebook instance and opened the notebook, you can run cells in the notebook to operate the solution. The remainder of this post provides additional details to each section in the notebook as you go along.

Download and inspect the data

The SQuAD dataset contains a training dataset as well as test and development datasets. The notebook downloads the SQuAD2.0 dataset for you, but you can choose which version of SQuAD to use by modifying the notebook cell under Download and inspect the data.

SQuAD was created by Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. For more information, refer to the original paper and dataset. SQuAD has been licensed by the authors under the Creative Commons Attribution-ShareAlike 4.0 International Public License.

Let’s look at an example question and answer pair from SQuAD:

Paragraph title: Immune_system

The immune system is a system of many biological structures and processes within an organism that protects against disease. To function properly, an immune system must detect a wide variety of agents, known as pathogens, from viruses to parasitic worms, and distinguish them from the organism’s own healthy tissue. In many species, the immune system can be classified into subsystems, such as the innate immune system versus the adaptive immune system, or humoral immunity versus cell-mediated immunity. In humans, the blood–brain barrier, blood–cerebrospinal fluid barrier, and similar fluid–brain barriers separate the peripheral immune system from the neuroimmune system which protects the brain.

Question: The immune system protects organisms against what?

Answer: disease

Load model

Now that you’ve viewed an example question and answer pair in SQuAD, you can download a model that you can fine-tune for question answering. Hugging Face allows you to easily download a base model that has undergone large-scale pre-training and reinitialize it for a different downstream task. In this case, you download the distilbert-base-uncased model and repurpose it for question answering using the AutoModelForQuestionAnswering class from Hugging Face. You also utilize the AutoTokenizer class to retrieve the model’s pre-trained tokenizer. We dive deeper into the model we use later in the post.

View BERT input

BERT requires you to transform text data into a numeric representation known as tokens. There are a variety of tokenizers available; the following tokens were created by a tokenizer specifically designed for BERT that you instantiate with a set vocabulary. Each token maps to a word in the vocabulary. Let’s look at the transformed immune system question and context you supply BERT for inference.

{'input_ids': tensor([[    0,   133,  9161,   467, 15899, 28340,   136,    99,   116,     2,
             2,   133,  9161,   467,    16,    10,   467,     9,   171, 12243,
          6609,     8,  5588,   624,    41, 33993,    14, 15899,   136,  2199,
             4,   598,  5043,  5083,     6,    41,  9161,   467,   531, 10933,
            10,  1810,  3143,     9,  3525,     6,   684,    25, 35904,     6,
            31, 21717,     7, 43108, 31483,     6,     8, 22929,   106,    31,
             5, 33993,    18,   308,  2245, 11576,     4,    96,   171,  4707,
             6,     5,  9161,   467,    64,    28,  8967,    88, 44890,    29,
             6,   215,    25,     5, 36154,  9161,   467,  4411,     5, 28760,
          9161,   467,     6,    50, 10080, 15010, 17381,  4411,  3551,    12,
         43728, 17381,     4,    96,  5868,     6,     5,  1925,  2383, 36436,
          9639,     6,  1925,  2383,  1755,   241,  7450,  4182,  6204, 12293,
          9639,     6,     8,  1122, 12293,  2383, 36436,  7926,  2559,     5,
         27727,  9161,   467,    31,     5, 14913, 42866,   467,    61, 15899,
             5,  2900,     4,     2]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

Model inference

Now that you’ve seen what BERT takes as input, let’s look at how you can get inference results from the model. The following code demonstrates how to use the previously generated tokenized input and return inference results from the model. Similar to how BERT can’t accept raw text as input, it doesn’t generate raw text as output either. You translate BERT’s output by identifying the start and end points in the paragraph that BERT identified as the answer. Then you map that output to our tokens and back to English text.

outputs = model(**inputs, start_positions=start_positions, end_positions=end_positions)

answer_start_scores = outputs.start_logits
answer_end_scores = outputs.end_logits
answer_start = torch.argmax(
answer_start_scores
) # Get the most likely beginning of answer with the argmax of the score
answer_end = torch.argmax(answer_end_scores) + 1 # Get the most likely end of answer with the argmax of the score
answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))
print(f"Question: {sq['paragraphs'][0]['qas'][0]['question']}")
print(f"Answer: {answer}")

The translated results are as follows:

Question: The immune system protects organisms against what?

Answer: disease

Augment SQuAD

Next, to obtain additional labeled data, we use a custom worker task template in Ground Truth. We can first create a new article in SQuAD format. The notebook copies this file from the repo to Amazon S3, but feel free to make any edits before running the Augment SQuAD cell. The format of SQuAD is shown in the following code. Each SQuAD JSON file contains multiple articles stored in the data key. Each article has a title field and one or more paragraphs. These paragraphs contain segments of text called context and any associated questions in the qas list. Because we’re annotating from scratch, we can leave the qas list empty and just provide context. The user interface is able to loop across both paragraphs and articles, allowing you to make each worker task as large or small as desired.

s3://<my-bucket-name>/custom_squad.json:

{
  "version": "v2.0",
  "data": [
    {
      "title": "Ground Truth Marketing",
      "paragraphs": [
        {
          "qas": [],
          "context": "SageMaker Ground Truth helps improve the quality of labels through annotation consolidation and audit workflows. Annotation consolidation is the process of collecting label inputs from two or more data labelers and combining them to create a single data label for your machine learning model. With built-in audit and review workflows, workers can perform label verification and make adjustments to improve accuracy."
        },
        {
          "qas": [],
          "context": "SageMaker Ground Truth provides automated labeling features such as ‘auto-segment’, ‘automatic 3D cuboid snapping’, and ‘sensor fusion with 2D video frames’ through an intuitive user interface in order to reduce the time needed for data labeling tasks while also improving quality. For semantic segmentation, workers must label objects in an image. Using the auto-segment feature, workers can capture the object with 4 clicks vs. hundreds."
        },
        {
          "qas": [],
          "context": "SageMaker Ground Truth offers automatic data labeling. Using an active learning model, data is labeled and only routed to humans if the model cannot confidently label it. The human-labeled data is then used to train the machine learning model to improve its' accuracy. As a result, less data is then sent to humans in the next round of labeling which lowers data labeling costs by up to 70%."
        },
        {
          "qas": [],
          "context": "SageMaker Ground Truth provides options to work with labelers inside and outside of your organization. Using SageMaker Ground Truth, you can easily send labeling jobs to your own labelers or you can access a workforce of over 500,000 independent contractors who are already performing machine learning related tasks through Amazon Mechanical Turk. If your data requires confidentiality or special skills, you can use vendors pre-screened by AWS for quality and security procedures, including iVision, CapeStart Inc., Cogito, and iMerit."
        }
      ]
    }
  ]
}

After we generate a sample SQuAD data file, we need to create a Ground Truth augmented manifest file that refers to our input data. We do this by generating a JSON lines-formatted file with a “source” key corresponding to the location in Amazon S3 where we stored our input SQuAD data:

s3://<my-bucket-name>/input.manifest

{"source": "s3://<my-bucket-name>/custom_squad.json"}
{"source": "s3://<my-bucket-name>/custom_squad_2.json"}
{"source": "s3://<my-bucket-name>/custom_squad_3.json"}

Access labeling portal

After you send the job to Ground Truth, you can view the generated labeling job on the Ground Truth console.

To perform labeling, you need to log in to the worker portal account you created as a part of the prerequisite steps. Your job is available in the worker portal after a few minutes of pre-processing. After opening the task, you’re presented with the custom worker template for Q&A annotation. You can add questions by highlighting sections of text in the context, then choosing Add Question.

Check labeling job status

After submission, you can run the Check labeling job status cell to see if your labeling job is complete. Wait for completion before proceeding to further cells.

Load labeled data

After labeling, the output manifest contains an entry with your label attribute name (in this case squad-1626282229) containing an S3 URI to SQuAD-formatted data that you can use during training. See the following output manifest contents:

{
    "source": "s3://<my-bucket-name>/custom_squad.json",
    "squad-1626282229": {
        "s3Uri": "s3://<my-bucket-name>/.../annotations/responses/0/squad.json"
    },
    "squad-1626282229-metadata": {
        "type": "groundtruth/custom",
        "job-name": "squad-1626282229",
        "human-annotated": "yes",
        "creation-date": "2021-07-14T17:39:24.910000"
    }
}
{
    "source": "s3://<my-bucket-name>/custom_squad_2.json",
    "squad-1626282229": {
        "s3Uri": "s3://<my-bucket-name>/.../annotations/responses/0/squad.json"
    },
    "squad-1626282229-metadata": {
        "type": "groundtruth/custom",
        "job-name": "squad-1626282229",
        "human-annotated": "yes",
        "creation-date": "2021-07-14T17:39:24.910000"
    }
}
{
    "source": "s3://<my-bucket-name>/custom_squad_3.json",
    "squad-1626282229": {
        "s3Uri": "s3://<my-bucket-name>/.../annotations/responses/0/squad.json"
    },
    "squad-1626282229-metadata": {
        "type": "groundtruth/custom",
        "job-name": "squad-1626282229",
        "human-annotated": "yes",
        "creation-date": "2021-07-14T17:39:24.910000"
    }
}

Each line in the manifest corresponds to a single worker task.

Load SQuAD train set

Hugging Face has a dataset package that provides you with the ability to download and preprocess SQuAD, but to add our custom questions and answers, we need to do a bit of processing. SQuAD is structured around sets of topics. Each topic has a variety of different context statements and each context statement has question and answer pairs. Because we want to create our own questions for training, we need to combine our questions with SQuAD. Luckily for us, our annotations are already in SQuAD format, so we can take our example labels and append them as a new topic to the existing SQuAD data.

Create a Hugging Face Dataset object

To get our data into Hugging Face’s dataset format, we have several options. We can use the load_dataset option, in which case we can supply a CSV, JSON, or text file that is loaded as a dataset object. You also can supply load_dataset with a processing script to convert your file into the desired format. For this post, we instead use the Dataset.from_dict() method, which allows us to supply an in-memory dictionary to create a dataset object. We also define our dataset features. We can view the features by using Hugging Face’s dataset viewer, as shown in the following screenshot.

Our features are as follows:

  • ID – The ID of the text
  • title – The associated title for the topic
  • context – The context statement the model must search to find an answer
  • question – The question the model is being asked
  • answer – The accepted answer text and location in the context statement

Hugging Face datasets easily allow us to define this schema:

squad_dataset = Dataset.from_dict(dataset_dict,
features=datasets.Features(
    {
    "id": datasets.Value("string"),
    "title": datasets.Value("string"),
    "context": datasets.Value("string"),
    "question": datasets.Value("string"),
    "answers": datasets.features.Sequence(
        {
        "text": datasets.Value("string"),
        "answer_start": datasets.Value("int32"),
        }
    ),
    # These are the features of your dataset like images, labels ...
    }
))

After we create our dataset object, we have to tokenize the text. Because models can’t accept raw text as an input, we need to convert our text into a numeric input that it can understand, otherwise known as tokenization. Tokenization is model specific, so let’s understand the model we’re going to fine-tune. We’re using a distilbert-base-uncased model. It looks very similar to BERT: it uses input embeddings, multi-head attention (for more information about this operation, refer to The Illustrated Transformer), and feed forward layers, but has half the parameters of the original BERT base model. See the following initial model layers:

Let’s break down each component of the model’s title. The name distilbert denotes the fact that this is a distilled version of the BERT base model, which is obtained through a process called knowledge distillation. Knowledge distillation allows us to train a smaller student model on not only the training data but also the responses to the same training set from a larger pre-trained teacher model. base refers to the size of the model, in this case the model was distilled from a BERT base model (as opposed to a BERT large model). uncased refers to the text it was trained on. In this case the text didn’t account for case; all the text it was trained on was lowercase. The uncased aspect directly affects the way we tokenize our text. Thankfully, in addition to providing easy access to downloading transformer models, Hugging Face also provides the model’s accompanying tokenizer. We also downloaded a customized tokenizer for our distilbert-base-uncased model that we now use to transform our text:

# loadbase_model_prefix 
model_name = "distilbert-base-uncased"

# Load model & tokenizer
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# set model to evaluation mode
evl = model.eval()

Another feature of the dataset class is it allows us to run preprocessing and tokenization in parallel with its map function. We define a processing function and then pass it to the map method.

For question answering, Hugging Face needs several components (which are also defined in the glossary):

  • attention mask – A mask indicating to the model which tokens to pay attention to, used primarily for differentiating between actual text and padding tokens
  • start_positions – The start position of the answer in the text
  • end_positions – The end position of the answer in the text
  • input_ids – The token indices mapping the tokens to the vocabulary

Our tokenizer will tokenize the text, but we need to explicitly capture the start and end positions of our answer, which is why we have defined a custom preprocessing function. Now that we have our inputs ready, let’s start training!

Launch training job

We can run training in our notebook, but the types of instances we need to train our Q&A model in a reasonable amount of time, p3 and p4 instances, are rather powerful. These instances tend to be overkill for running a notebook or as a persistent Amazon Elastic Compute Cloud (Amazon EC2) instance. This is where SageMaker training comes in. SageMaker training allows you to launch a training job on a specified instance or instances that are only up for the duration of the training job. This allows us to run on larger instances like the p4d.24xlarge, with 8 NVIDIA A100 GPUs, but without worrying about running up a huge bill in case we forget to turn it off. It also gives us easy access to other SageMaker functionalities, like SageMaker Experiments for tracking your ML training runs and SageMaker Debugger for understanding and profiling your training jobs.

Local training

Let’s start by understanding how training a model in Hugging Face works locally, then go over the adjustments we make to run it in SageMaker.

Hugging Face makes training easy through the use of their trainer class. The trainer class allows us to pass in our model, our train and validation datasets, our hyperparameters, and even our tokenizer. Because we already have our model as well as our training and validation sets, we only need to define our hyperparameters. We can do this through the TrainingArguments class. This allows us to specify things like the learning rate, batch size, number of epochs, and more in-depth parameters like weight decay or a learning rate scheduling strategy. After we define our TrainingArguments, we can pass in our model, training set, validation set, and arguments to instantiate our trainer class. Then we can simply call trainer.train() to start training our model. The following code block demonstrates how to run local training:

doc_stride=128
max_length=512
tokenized_train = squad_dataset.map(prepare_train_features, batched=True, remove_columns=squad_dataset.column_names, fn_kwargs = {'tokenizer':tokenizer, 'max_length':max_length, 'doc_stride':doc_stride})
tokenized_test = squad_test.map(prepare_train_features, batched=True, remove_columns=squad_test.column_names, fn_kwargs = {'tokenizer':tokenizer, 'max_length':max_length, 'doc_stride':doc_stride})

hf_args = TrainingArguments(
    'test_local',
    evaluation_strategy = "epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=1,
    weight_decay=0.0001,
)

trainer = Trainer(
    model,
    hf_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    data_collator=default_data_collator,
    tokenizer=tokenizer,
)

trainer.train()

Send data to S3

Doing the same thing in SageMaker training is straightforward. The first step is putting our data in Amazon S3 so that our model can access it. SageMaker training allows you to specify a data source; you can use sources like Amazon S3, Amazon Elastic File System (Amazon EFS), or Amazon FSx for Lustre for high-performance data ingestion. In our case, our augmented SQuAD dataset isn’t particularly large, so Amazon S3 is a good choice. We upload our training data to a folder in Amazon S3 and when SageMaker spins up our training instance, it downloads the data from our specified location.

Instantiate the model

To launch our training job, we can use the built-in Hugging Face estimator in the SageMaker SDK. SageMaker uses the estimator class to define the parameters for a training job as well as the number and type of instances to use for training. SageMaker training is built around the use of Docker containers. You can use the default containers in SageMaker or supply your own custom container for training. In the case of Hugging Face models, SageMaker has built-in Hugging Face containers with all the dependencies you need to run Hugging Face training jobs. All we need to do is define our training script, which our Hugging Face container uses as its entry point.

In this training script, we define our arguments, which we pass to our entry point in the form of a set of hyperparameters, as well as our training code. Our training code is the same as if we were running it locally; we can simply use the TrainingArguments and then pass them to a trainer object. The only difference is we need to specify the output location for our model to be in /opt/ml/model so that SageMaker training can take it, package it, and send it to Amazon S3. The following code block shows how to instantiate our Hugging Face estimator:

# hyperparameters, which are passed into the training job
hyperparameters={
    'model_name': model_name,
    'dataset_name':'squad',
    'do_train': True,
    'do_eval': True,
    'fp16': True,
    'train_batch_size': 32,
    'eval_batch_size': 32,
    'weight_decay':0.01,
    'warmup_steps':500,
    'learning_rate':5e-5,
    'epochs': 2,
    'max_length': 384,
    'max_steps': 100,
    'pad_to_max_length': True,
    'doc_stride': 128,
    'output_dir': '/opt/ml/model'
}

# estimator
huggingface_estimator = HuggingFace(entry_point='run_qa.py',
    source_dir='container_training',
    metric_definitions=metric_definitions,
    instance_type='ml.p3.8xlarge',
    instance_count=1,
    volume_size=100,
    role=role,
    transformers_version='4.4.2',
    pytorch_version='1.6.0',
    py_version='py36',
    hyperparameters = hyperparameters)

Fine-tune the model

For our specific training job, we use a p3.8xlarge instance consisting of 4 V100 GPUs. The trainer class automatically supports training on multi-GPU instances so we don’t need any additional setup to account for this. We train our model for two epochs, with a batch size of 16, and a learning rate of 4e5. We’re also enabling mixed precision training, which uses mixed precision in areas where we can reduce numerical precision without impacting our model’s accuracy. This increases our available memory and training speeds. To launch the training job, we call the fit method from our huggingface_estimator class.

huggingface_estimator.fit(data_channels, wait=False, job_name=f'hf-distilbert-squad-{int(time.time())}')

When our model is done training, we can download the model locally and load it into our notebook’s memory to test it, which is demonstrated in the notebook. We will focus on another option, deploying it as a SageMaker endpoint!

Deploy trained model

In addition to providing utilities for training, SageMaker can also allow data scientists and ML engineers to easily deploy REST endpoints for their trained models. You can deploy models trained in or outside of SageMaker. For more information, refer to Deploy a Model in Amazon SageMaker.

Because our model was trained in SageMaker, it’s already in the correct format to deploy as an endpoint. Similar to training, we define a SageMaker model class that defines the model, serving code, and the number and type of instances we want to deploy as endpoints. Also similar to training, serving is based on Docker containers, and we can use either of the built-in SageMaker containers or supply our own. For this post, we use a built-in PyTorch serving container, so we simply need to define a few things to get our endpoint up and running. Our serving code needs four functions:

  • model_fn – Defines how the endpoint loads the model (it only does this once, and then keeps it in memory for subsequent predictions)
  • input_fn – Defines how the input is deserialized and processed
  • predict_fn – Defines how our model makes predictions on our input
  • output_fn – Defines how the endpoint formats and sends back the output data to the client making the request

After we define these functions, we can deploy our endpoint and pass it context statements and questions and return its predicted answer:

endpoint_name = 'hf-distilbert-QA-string-endpoint4-185'
model_data = f"{huggingface_estimator.output_path}{huggingface_estimator.jobs[0].job_name}/output/model.tar.gz"

# We are going to use a SageMaker serving container
torch_model = PyTorchModel(model_data=model_data,
                           source_dir = 'container_serving',
                           role=role,
                          entry_point='transform_script.py',
                          framework_version='1.8.1',
                          py_version='py3',
                          predictor_cls = StringPredictor)
bert_end = torch_model.deploy(instance_type='ml.m5.2xlarge', initial_instance_count=1, #'ml.g4dn.xlarge'
                          endpoint_name=endpoint_name)

Visualize model results

Because we deployed a SageMaker endpoint that allows us to send context statements and receive answers, we can go back and visualize the resulting inferences within the original SQuAD viewer to better visualize what our model found in the passage context. We do this by reformatting the results of inference back into SQuAD format, then replacing the Liquid tags in the worker template with the SQuAD-formatted JSON. We can then iframe the resulting UI inside our worker template to iteratively review results within the context of a single notebook, as shown in the following screenshot. Each question on the left can be clicked to highlight the spans of text on the right matching the query. With no question selected, all text spans are highlighted on the right as shown below.

Clean up

To avoid incurring future charges, run the Clean up section of the notebook to delete all the resources, including SageMaker endpoints, S3 objects that contains the raw and processed dataset, and the CloudFormation stack. When the deletion is complete, make sure to stop and delete the notebook instance that is hosting the current notebook script.

Conclusion

In this post, you learned how to create your own question answering dataset using Ground Truth and combine it with SQuAD to train and deploy your own question answering model using SageMaker. After you complete the notebook, you have a deployed SageMaker endpoint that was trained on your custom Q&A dataset. This endpoint is ready for integration into your production NLU workflows, because SageMaker endpoints are available through standard REST APIs. You also have an annotated custom dataset in SQuAD 2.0 format, which allows you to retrain your existing model or try training other question answering model architectures. Finally, you have a mechanism to quickly visualize the results from your inference by loading the worker template in your local notebook.

Try out the notebook, augment it with your own questions, and train and deploy your own custom question answering model for your NLU use cases!

Happy building!


About the Authors

Jeremy Feltracco is a Software Development Engineer with the Amazon ML Solutions Lab at Amazon Web Services. He uses his background in computer vision, robotics, and machine learning to help AWS customers accelerate their AI adoption.

Vidya Sagar Ravipati is a Manager at the Amazon ML Solutions Lab, where he leverages his vast experience in large-scale distributed systems and his passion for machine learning to help AWS customers across different industry verticals accelerate their AI and cloud adoption. Previously, he was a Machine Learning Engineer in Connectivity Services at Amazon who helped to build personalization and predictive maintenance platforms.

Isaac Privitera is a Senior Data Scientist at the Amazon Machine Learning Solutions Lab, where he develops bespoke machine learning and deep learning solutions to address customers’ business problems. He works primarily in the computer vision space, focusing on enabling AWS customers with distributed training and active learning.

Read More

Use custom vocabulary in Amazon Lex to enhance speech recognition

In our daily conversations, we come across new words or terms that we may not know. Perhaps these are related to a new domain that we’re just getting familiar with, and we pick these up as we understand more about the domain. For example, home loan terminology (“curtailment”), shortened words, (“refi”, “comps”), and acronyms (“HELOC”) are likely new to first-time home buyers. Similarly, brand names can be tricky when you first hear of them. For example, with the sentence “I’d like to get a Cyetus 7601”, unless you’re a coffee aficionado, you may not realize the speaker is talking about a coffee machine brand. Some of these words may not be even present in the dictionary. Once we understand their usage, meaning, and spelling, they become part of our vocabulary so we can have effective conversations.

Similarly, bots have to understand and add such words to their vocabulary. Starting today, you can give Amazon Lex additional information about how to process speech input by creating a custom vocabulary. You can create a list with domain-specific words or brand names for improved speech recognition and end-user experience.

In this post, we review the custom vocabulary capability and use it for home loan customer service transactions and queries.

Overview of the custom vocabulary capability

You define the custom vocabulary for a language in the bot. In this case, we add brand names and domain-specific words and phrases in English (GB). The bot uses these phrases for transcription of speech input when it’s trying to understand the user’s request (or intent classification). You can also augment the custom vocabulary with information you have to collect (or slot elicitation) by configuring a setting on the slot type. You can specify the degree of boosting for each word or phrase by configuring a weight of 1, 2, or 3. We recommend you try out different weights. Words that are usually difficult to recognize or pronounce should have a higher weight.

The following is an example conversation with our bot:

Virtual Agent: Welcome to ACME bank. How can I help you?

Caller: Do you offer CapitaRise loans?

Virtual Agent: We offer CapitaRise loans for personal purposes. Do you want to get an application started?

Caller: No, thanks.

Virtual Agent: Is there anything else I can help you with today?

Caller: How about your refi rates?

Virtual Agent: For refinancing, we offer 4.5% with no closing costs for a 30 years fixed. Other information you might be interested in?

Caller: Yes, how about SBA?

Virtual Agent: We offer a variety of small business admin loans. You can find more information on our website. Anything else I can do for you?

Caller: I would like to do a payment on my HELOC.

Virtual Agent: To proceed with a payment, what is the account ID?

Caller: 1234

Virtual Agent: Thanks. What’s the payment amount?

Caller: 120

Virtual Agent: Thank you. The payment should be processed within 1 to 2 business days. Anything else I can help you with?

Caller: No, that’s it.

Virtual Agent: Thank you. Have a great day.

Solution overview

Let’s review the overall architecture for the solution (see the following diagram):

  • We use an Amazon Lex bot integrated with an Amazon Connect contact flow to deliver the conversational experience
  • We define the custom vocabulary for the English (GB) language by adding words such as “CapitaRise,” “HELOC,” and “refi”, along with weights
  • After the caller is authenticated, the control is passed to the bot to perform transactions (for example, to process payment)

The custom vocabulary file is a tab-separated list of values that contain the phrase to recognize and a weight to give the boost. Phrases with a higher boost value are more likely to be used when they appear in the audio input.

phrase	weight
CapitaRise	3
HELOC	2
Refi	2
S. B. A.	1

Deploy the sample Amazon Lex bot

To create the sample bot and configure the custom vocabulary, perform the following steps. This creates an Amazon Lex bot calledFinanceBot, with intents PersonalLoan, BusinessLoan, InterestRateRefinancing, InterestRateCredit, Payment, Welcome, and Goodbye, as well as two slot types (accountNumber and confirmationSlot).

  1. Download the Amazon Lex bot.
  2. On the Amazon Lex console, choose Actions, Import.
  3. Choose the file FinanceBot.zip file that you downloaded, and choose Import.
  4. In the IAM Permissions section, for Runtime role, choose Create a new role with basic Amazon Lex permissions.
  5. On the Amazon Lex console, navigate to the bot FinanceBot.
  6. Download the .zip file with the phrases that you want to add to the custom vocabulary.
  7. On the bot detail page, in the Add languages section, choose View languages.
  8. From the list of languages, choose English (GB).
  9. In the Custom vocabulary section, choose Import.
  10. Browse to the file to import, enter a password if necessary, and then choose Import.
  11. Choose Build.
  12. Download the supporting AWS Lambda code.
  13. On the Lambda console, create a new function and select Author from scratch.
  14. For Function name¸ enter FinanceBotEnglish.
  15. For Runtime, choose Python 3.8.
  16. Choose Create function.
  17. In the Code source section, open lambda_function.py and delete the existing code.
  18. Download the code and open it in a text editor.
  19. Copy and paste the code into the empty lambda_function.py tab.
  20. Choose Deploy.
  21. On the Amazon Lex console, and open FinanceBot.
  22. Choose Deployment and then Aliases, followed by TestBotAlias.
  23. On the Aliases page, in the Languages section, navigate to English (GB).
  24. For Source, select FinanceBotEnglish.
  25. For Lambda version or alias, enter $LATEST.
  26. On the Amazon Connect console, choose Contact flows.
  27. Download the contact flow to integrate with the Amazon Lex bot.
  28. In the Amazon Lex section, select your Amazon Lex bot and make it available for use in the Amazon Connect contact flows.
  29. Select the contact flow to load it into the application.
  30. Make sure the right bot is configured in the “Get Customer Input” block.
  31. Choose a queue in the “Set working queue” block.
  32. Add a phone number to the contact flow.
  33. Test the IVR flow by calling in to the phone number.

Test the solution

You can call in to the Amazon Connect phone number and interact with the bot.

Conclusion

Custom vocabulary enables improved recognition of domain-specific words and brand names for speech modality. You can easily define the custom vocabulary for your Amazon Lex bot and augment it to the bot definition. With improved recognition, you can enable more effective conversations across a broader set of use cases. You can configure custom vocabulary using the Amazon Lex V2 console or via the API. The capability is available for English (US) and English (GB) in all AWS Regions where Amazon Lex operates. To learn more, refer to custom vocabulary documentation.


About the Authors

Kai Loreck is a professional services Amazon Connect consultant. He works on designing and implementing scalable customer experience solutions. In his spare time, he can be found playing sports, snowboarding, or hiking in the mountains.

Anubhav Mishra is a Product Manager with AWS. He spends his time understanding customers and designing product experiences to address their business challenges.

Mebz Qazi is a Senior Consultant working on global projects for AWS. He very much enjoys working on technological innovation in natural language and AI/ML.

Sravan Bodapati is an Applied Science Manager at AWS Lex. He focuses on building cutting edge Artificial Intelligence and Machine Learning solutions for AWS customers in ASR and NLP space. In his spare time, he enjoys hiking, learning economics, watching TV shows and spending time with his family.

Read More

Predict customer churn with no-code machine learning using Amazon SageMaker Canvas

Understanding customer behavior is top of mind for every business today. Gaining insights into why and how customers buy can help grow revenue. But losing customers (also called customer churn) is always a risk, and insights into why customers leave can be just as important for maintaining revenues and profits. Machine learning (ML) can help with insights, but up until now you needed ML experts to build models to predict churn, the lack of which could delay insight-driven actions by businesses to retain customers.

In this post, we show you how business analysts can build a customer churn ML model with Amazon SageMaker Canvas, no code required. Canvas provides business analysts with a visual point-and-click interface that allows you to build models and generate accurate ML predictions on your own—without requiring any ML experience or having to write a single line of code.

Overview of solution

For this post, we assume the role of a marketing analyst in the marketing department of a mobile phone operator. We have been tasked with identifying customers that are potentially at risk of churning. We have access to service usage and other customer behavior data, and want to know if this data can help explain why a customer would leave. If we can identify factors that explain churn, then we can take corrective actions to change predicted behavior, such as running targeted retention campaigns.

To do this, we use the data we have in a CSV file, which contains information about customer usage and churn. We use Canvas to perform the following steps:

  1. Import the churn dataset from Amazon Simple Storage Service (Amazon S3).
  2. Train and build the churn model.
  3. Analyze the model results.
  4. Test predictions against the model.

For our dataset, we use a synthetic dataset from a telecommunications mobile phone carrier. This sample dataset contains 5,000 records, where each record uses 21 attributes to describe the customer profile. The attributes are as follows:

  • State – The US state in which the customer resides, indicated by a two-letter abbreviation; for example, OH or NJ
  • Account Length – The number of days that this account has been active
  • Area Code – The three-digit area code of the customer’s phone number
  • Phone – The remaining seven-digit phone number
  • Int’l Plan – Whether the customer has an international calling plan (yes/no)
  • VMail Plan – Whether the customer has a voice mail feature (yes/no)
  • VMail Message – The average number of voice mail messages per month
  • Day Mins – The total number of calling minutes used during the day
  • Day Calls – The total number of calls placed during the day
  • Day Charge – The billed cost of daytime calls
  • Eve Mins, Eve Calls, Eve Charge – The billed cost for evening calls
  • Night Mins, Night Calls, Night Charge – The billed cost for nighttime calls
  • Intl Mins, Intl Calls, Intl Charge – The billed cost for international calls
  • CustServ Calls – The number of calls placed to customer service
  • Churn? – Whether the customer left the service (true/false)

The last attribute, Churn?, is the attribute that we want the ML model to predict. The target attribute is binary, meaning our model predicts the output as one of two categories (True or False).

Prerequisites

A cloud admin with an AWS account with appropriate permissions is required to complete the following prerequisites:

Create a customer churn model

First, let’s download the churn dataset and review the file to make sure all the data is there. Then complete the following steps:

  1. Sign in to the AWS Management Console, using an account with the appropriate permissions to access Canvas.
  2. Log in to the Canvas console.

This is where we can manage our datasets and create models.

  1. Choose Import.

Canvas Import Button Select

  1. Choose Upload and select the churn.csv file.
  2. Choose Import data to upload it to Canvas.

Canvas select data from s3

The import process takes approximately 10 seconds (this can vary depending on dataset size). When it’s complete, we can see the dataset is in Ready status.

Canvas Ready Dataset

  1. To preview the first 100 rows of the dataset, hover your mouse over the eye icon.

Canvas View Dataset

A preview of the dataset appears. Here we can verify that our data is correct.

Canvas Verify Data

After we confirm that the imported dataset is ready, we create our model.

  1. Choose New model.

Canvas New Models

  1. Select the churn.csv dataset and choose Select dataset.

Canvas Select Dataset

Now we configure the build model process.

  1. For Target columns, choose the Churn? column.

For Model type, Canvas automatically recommends the model type, in this case 2 category prediction (what a data scientist would call binary classification). This is suitable for our use case because we have only two possible prediction values: True or False, so we go with the recommendation Canvas made.

Canvas Build Model

We now validate some assumptions. We want to get a quick view into whether our target column can be predicted by the other columns. We can get a fast view into the model’s estimated accuracy and column impact (the estimated importance of each column in predicting the target column).

  1. Select all 21 columns and choose Preview model.

This feature uses a subset of our dataset and only a single pass at modeling. For our use case, the preview model takes approximately 2 minutes to build.

Canvas Preview Model

As shown in the following screenshot, the Phone and State columns have much less impact on our prediction. We want to be careful when removing text input because it can contain important discrete, categorical features contributing to our prediction. Here, the phone number is just the equivalent of an account number—not of value in predicting other accounts’ likelihood of churn, and the customer’s state doesn’t impact our model much.

  1. We remove these columns because they have no major feature importance.Canvas Feature Engineering
  2. After we remove the Phone and State columns, let’s run the preview again.

As shown in the following screenshot, the model accuracy increased by 0.1%. Our preview model has a 95.9% estimated accuracy, and the columns with the biggest impact are Night Calls, Eve Mins, and Night Charge. This gives us an insight into what columns impact the performance of our model the most. Here we need to be careful when doing feature selection because if a single feature is extremely impactful on a model’s outcome, it’s a primary indicator of target leakage, and the feature won’t be available at the time of prediction. In this case, few columns showed very similar impact, so we continue to build our model.

Canvas Feature Engineering After

Canvas offers two build options:

  • Standard build – Builds the best model from an optimized process powered by AutoML; speed is exchanged for greatest accuracy
  • Quick build – Builds a model in a fraction of the time compared to a standard build; potential accuracy is exchanged for speed.
  1. For this post, we choose the Standard build option because we want to have the very best model and we are willing to spend additional time waiting the result.

Canvas Standard Build

The build process can take 2–4 hours. During this time, Canvas tests hundreds of candidate pipelines, selecting the best model to present to us. In the following screenshot, we can see the expected build time and progress.

Canvas Analyze Model

Evaluate model performance

When the model building process is complete, the model predicted churn 97.9% of the time. This seems fine, but as analysts we want to dive deeper and see if we can trust the model to make decisions based on it. On the Scoring tab, we can review a visual plot of our predictions mapped to their outcomes. This allows us a deeper insight into our model.

Canvas separates the dataset into training and test sets. The training dataset is the data Canvas uses to build the model. The test set is used to see if the model performs well with new data. The Sankey diagram in the following screenshot shows how the model performed on the test set. To learn more, refer to Evaluating Your Model’s Performance in Amazon SageMaker Canvas.

Canvas Analyze Model Score

To get more detailed insights beyond what is displayed in the Sankey diagram, business analysts can use a confusion matrix analysis for their business solutions. For example, we want to better understand the likelihood of the model making false predictions. We can see this in the Sankey diagram, but want more insights, so we choose Advanced metrics. We’re presented with a confusion matrix, which displays the performance of a model in a visual format with the following values, specific to the positive class—we’re measuring based on whether they will in fact churn, so our positive class is True in this example:

  • True Positive (TP) – The number of True results that were correctly predicted as True
  • True Negative (TN) – The number of False results that were correctly predicted as False
  • False Positive (FP) – The number of False results that were wrongly predicted as True
  • False Negative (FN) – The number of True results that were wrongly predicted as False

We can use this matrix chart to determine not only how accurate our model is, but when it is wrong, how often that might be and how it’s wrong.

Canvas F1 Matrix

The advanced metrics look good. We can trust the model result. We see very low false positives and false negatives. These are if the model thinks a customer in the dataset will churn and they actually don’t (false positive), or if the model thinks the customer will churn and they actually do (false negative). High numbers for either might make us think more on if we can use the model to make decisions.

Let’s go back to Overview tab, to review the impact of each column. This information can help the marketing team gain insights that lead to taking actions to reduce customer churn. For example, we can see that both low and high CustServ Calls increase the likelihood of churn. The marketing team can take actions to prevent customer churn based on these learnings. Examples include creating a detailed FAQ on websites to reduce customer service calls, and running education campaigns with customers on the FAQ that can keep engagement up.

Our model looks pretty accurate. We can directly perform an interactive prediction on the Predict tab, either in batch or single (real-time) prediction. In this example, we made a few changes to certain column values and performed a real-time prediction. Canvas shows us the prediction result along with the confidence level.

Canvas Predict Inference

Let’s say we have an existing customer who has the following usage: Night Mins is 40 and Eve Mins is 40. We can run a prediction, and our model returns a confidence score of 93.2% that this customer will churn (True). We might now choose to provide promotional discounts to retain this customer.

Let’s say we have an existing customer who has the following the usage: Night Mins is 40 and Eve Mins is 40. We can run a prediction, and our model returns a confidence score of 93.2% that this customer will churn (True). We might now choose to provide promotion discounts to retain this customer.

Running one prediction is great for individual what-if analysis, but we also need to run predictions on many records at once. Canvas is able to run batch predictions, which allows you to run predictions at scale.

Conclusion

In this post, we showed how a business analyst can create a customer churn model with SageMaker Canvas using sample data. Canvas allows your business analysts to create accurate ML models and generate predictions using a no-code, visual, point-and-click interface. A marketing analysist can now use this information to run targeted retention campaigns and test new campaign strategies faster, leading to a reduction in customer churn.

Analysts can take this to the next level by sharing their models with data scientist colleagues. The data scientists can view the Canvas model in Amazon SageMaker Studio, where they can explore the choices Canvas AutoML made, validate model results, and even productionalize the model with a few clicks. This can accelerate ML-based value creation and help scale improved outcomes faster.

To learn more about using Canvas, see Build, Share, Deploy: how business analysts and data scientists achieve faster time-to-market using no-code ML and Amazon SageMaker Canvas. For more information about creating ML models with a no-code solution, see Announcing Amazon SageMaker Canvas – a Visual, No Code Machine Learning Capability for Business Analysts.


About the Author

Henry Robalino is a Solutions Architect at AWS, based out of NJ. He is passionate about cloud and machine learning, and the role they can play in society. He achieves this by working with customers to help them achieve their business goals using the AWS Cloud. Outside of work, you can find Henry traveling or exploring the outdoors with his fur daughter Arly.

Chaoran Wang is a Solution Architect at AWS, based in Dallas, TX. He has been working at AWS since graduating from the University of Texas at Dallas in 2016 with a master’s in Computer Science. Chaoran helps customers build scalable, secure, and cost-effective applications and find solutions to solve their business challenges on the AWS Cloud. Outside work, Chaoran loves spending time with his family and two dogs, Biubiu and Coco.

Read More

Deploy and manage machine learning pipelines with Terraform using Amazon SageMaker

AWS customers are relying on Infrastructure as Code (IaC) to design, develop, and manage their cloud infrastructure. IaC ensures that customer infrastructure and services are consistent, scalable, and reproducible, while being able to follow best practices in the area of development operations (DevOps).

One possible approach to manage AWS infrastructure and services with IaC is Terraform, which allows developers to organize their infrastructure in reusable code modules. This aspect is increasingly gaining importance in the area of machine learning (ML). Developing and managing ML pipelines, including training and inference with Terraform as IaC, lets you easily scale for multiple ML use cases or Regions without having to develop the infrastructure from scratch. Furthermore, it provides consistency for the infrastructure (for example, instance type and size) for training and inference across different implementations of the ML pipeline. This lets you route requests and incoming traffic to different Amazon SageMaker endpoints.

In this post, we show you how to deploy and manage ML pipelines using Terraform and Amazon SageMaker.

Solution overview

This post provides code and walks you through the steps necessary to deploy AWS infrastructure for ML pipelines with Terraform for model training and inference using Amazon SageMaker. The ML pipeline is managed via AWS Step Functions to orchestrate the different steps implemented in the ML pipeline, as illustrated in the following figure.

Step Function Steps

Step Functions starts an AWS Lambda function, generating a unique job ID, which is then used when starting a SageMaker training job. Step Functions also creates a model, endpoint configuration, and endpoint used for inference. Additional resources include the following:

The ML-related code for training and inference with a Docker image relies mainly on existing work in the following GitHub repository.

The following diagram illustrates the solution architecture:

Architecture Diagram

We walk you through the following high-level steps:

  1. Deploy your AWS infrastructure with Terraform.
  2. Push your Docker image to Amazon ECR.
  3. Run the ML pipeline.
  4. Invoke your endpoint.

Repository structure

You can find the repository containing the code and data used for this post in the following GitHub repository.

The repository includes the following directories:

  • /terraform – Consists of the following subfolders:

    • ./infrastructure – Contains the main.tf file calling the ML pipeline module, in addition to variable declarations that we use to deploy the infrastructure
    • ./ml-pipeline-module – Contains the Terraform ML pipeline module, which we can reuse
  • /src – Consists of the following subfolders:

    • ./container – Contains example code for training and inference with the definitions for the Docker image
    • ./lambda_function – Contains the Python code for the Lambda function generating configurations, such as a unique job ID for the SageMaker training job
  • /data – Contains the following file:

    • ./iris.csv – Contains data for training the ML model

Prerequisites

For this walkthrough, you should have the following prerequisites:

Deploy your AWS infrastructure with Terraform

To deploy the ML pipeline, you need to adjust a few variables and names according to your needs. The code for this step is in the /terraform directory.

When initializing for the first time, open the file terraform/infrastructure/terraform.tfvars and adjust the variable project_name to the name of your project, in addition to the variable region if you want to deploy in another Region. You can also change additional variables such as instance types for training and inference.

Then use the following commands to deploy the infrastructure with Terraform:

export AWS_PROFILE=<your_aws_cli_profile_name>
cd terraform/infrastructure
terraform init
terraform plan
terraform apply

Check the output and make sure that the planned resources appear correctly, and confirm with yes in the apply stage if everything is correct. Then go to the Amazon ECR console (or check the output of Terraform in the terminal) and get the URL for your ECR repository that you created via Terraform.

The output should look similar to the following displayed output, including the ECR repository URL:

Apply complete! Resources: 19 added, 0 changed, 0 destroyed.

Outputs:

ecr_repository_url = <account_number>.dkr.ecr.eu-west-1.amazonaws.com/ml-pipeline-terraform-demo

Push your Docker image to Amazon ECR

For the ML pipeline and SageMaker to train and provision a SageMaker endpoint for inference, you need to provide a Docker image and store it in Amazon ECR. You can find an example in the directory src/container. If you have already applied the AWS infrastructure from the earlier step, you can push the Docker image as described. After your Docker image is developed, you can take the following actions and push it to Amazon ECR (adjust the Amazon ECR URL according to your needs):

cd src/container
export AWS_PROFILE=<your_aws_cli_profile_name>
aws ecr get-login-password --region eu-west-1 | docker login --username AWS --password-stdin <account_number>.dkr.ecr.eu-west-1.amazonaws.com
docker build -t ml-training .
docker tag ml-training:latest <account_number>.dkr.ecr.eu-west-1.amazonaws.com/<ecr_repository_name>:latest
docker push <account_number>.dkr.ecr.eu-west-1.amazonaws.com/<ecr_repository_name>

If you have already applied the AWS infrastructure with Terraform, you can push the changes of your code and Docker image directly to Amazon ECR without deploying via Terraform again.

Run the ML pipeline

To train and run the ML pipeline, go to the Step Functions console and start the implementation. You can check the progress of each step in the visualization of the state machine. You can also check the SageMaker training job progress and the status of your SageMaker endpoint.

Start Step Function

After you successfully run the state machine in Step Functions, you can see that the SageMaker endpoint has been created. On the SageMaker console, choose Inference in the navigation pane, then Endpoints. Make sure to wait for the status to change to InService.

SageMaker Endpoint Status

Invoke your endpoint

To invoke your endpoint (in this example, for the iris dataset), you can use the following Python script with the AWS SDK for Python (Boto3). You can do this from a SageMaker notebook, or embed the following code snippet in a Lambda function:

import boto3
from io import StringIO
import pandas as pd

client = boto3.client('sagemaker-runtime')

endpoint_name = 'Your endpoint name' # Your endpoint name.
content_type = "text/csv"   # The MIME type of the input data in the request body.

payload = pd.DataFrame([[1.5,0.2,4.4,2.6]])
csv_file = StringIO()
payload.to_csv(csv_file, sep=",", header=False, index=False)
payload_as_csv = csv_file.getvalue()

response = client.invoke_endpoint(
EndpointName=endpoint_name,
ContentType=content_type,
Body=payload_as_csv
)

label = response['Body'].read().decode('utf-8')
print(label)

Clean up

You can destroy the infrastructure created by Terraform with the command terraform destroy, but you need to delete the data and files in the S3 buckets first. Furthermore, the SageMaker endpoint (or multiple SageMaker endpoints if run multiple times) is created via Step Functions and not managed via Terraform. This means that the deployment happens when running the ML pipeline with Step Functions. Therefore, make sure you delete the SageMaker endpoint or endpoints created via the Step Functions ML pipeline as well to avoid unnecessary costs. Complete the following steps:

  1. On the Amazon S3 console, delete the dataset in the S3 training bucket.
  2. Delete all the models you trained via the ML pipeline in the S3 models bucket, either via the Amazon S3 console or the AWS CLI.
  3. Destroy the infrastructure created via Terraform:
    cd terraform/infrastructure
    terraform destroy

  4. Delete the SageMaker endpoints, endpoint configuration, and models created via Step Functions, either on the SageMaker console or via the AWS CLI.

Conclusion

Congratulations! You’ve deployed an ML pipeline using SageMaker with Terraform. This example solution shows how you can easily deploy AWS infrastructure and services for ML pipelines in a reusable fashion. This allows you to scale for multiple use cases or Regions, and enables training and deploying ML models with one click in a consistent way. Furthermore, you can run the ML pipeline multiple times, for example, when new data is available or you want to change the algorithm code. You can also choose to route requests or traffic to different SageMaker endpoints.

I encourage you to explore adding security features and adopting security best practices according to your needs and potential company standards. Additionally, embedding this solution into your CI/CD pipelines will give you further capabilities in adopting and establishing DevOps best practices and standards according to your requirements.


About the Author

Oliver Zollikofer is a Data Scientist at Amazon Web Services. He enables global enterprise customers to build, train and deploy machine learning models, as well as managing the ML model lifecycle with MLOps. Further, he builds and architects related cloud solutions.

Read More