Best practices for creating Amazon Lex interaction models

Best practices for creating Amazon Lex interaction models

Amazon Lex is an AWS service for building conversational interfaces into any application using voice and text, enabling businesses to add sophisticated, natural language chatbots across different channels. Amazon Lex uses machine learning (ML) to understand natural language (normal conversational text and speech). In this post, we go through a set of best practices for using ML to create a bot that will delight your customers by accurately understanding them. This allows your bot to have more natural conversations that don’t require the user to follow a set of strict instructions. Designing and building an intelligent conversational interface is very different than building a traditional application or website, and this post will help you develop some of the new skills required.

Let’s look at some of the terminology we use frequently in this post:

  • Utterance – The phrase the user says to your live bot.
  • Sample utterance – Some examples of what users might say. These are attached to intents and used to train the bot.
  • Intent – This represents what the user meant and should be clearly connected to a response or an action from the bot. For instance, an intent that responds to a user saying hello, or an intent that can respond and take action if a user wants to order a coffee. A bot has one or more intents that utterances can be mapped to.
  • Slot – A parameter that can capture specific types of information from the utterance (for example, the time of an appointment or the customer’s name). Slots are attached to intents.
  • Slot value – Either examples of what the slot should capture, or a specific list of values for a slot (for example, large, medium, and small as values for a slot for coffee sizes).

The below image shows how all these pieces fit together to make up your bot.

A diagram showing how an interaction with an Amazon Lex bot flows through automatic speech recognition, natural language understanding, fulfilment (including conversational user experience) and back to text to speech

Building a well-designed bot requires several different considerations. These include requirements gathering and discovery, conversational design, testing through automation and with users, and monitoring and optimizing your bot. Within the conversational design aspect, there are two main elements: the interaction model and the conversational or voice user experience (CUX/VUX). CUX and VUX encompass the personality of the bot, the types of responses, the flow of the conversation, variations for modality, and how the bot handles unexpected inputs or failures. The interaction model is the piece that can take what the user said (utterance) and map it to what they meant (intent). In this post, we only look at how to design and optimize your interaction model.

Because Amazon Lex uses machine learning, that puts the creator of the bot in the role of machine teacher. When we build a bot, we need to give it all the knowledge it needs about the types of conversations it will support. We do this both by how we configure the bot (intents and slots) and the training data we give it (sample utterances and slot values). The underlying service then enriches it with knowledge about language generally, enabling it to understand phrases beyond the exact data we have given it.

The best practices listed in the following sections can support you in building a bot that will give your customers a great user experience and work well for your use case.

Creating intents

Each intent is a concept you teach your bot to understand. For instance, it could be an intent that represents someone ordering a coffee, or someone greeting your bot. You need to make sure that you make it really clear and easy for the bot to recognize that a particular utterance should be matched to that intent.

Imagine if someone gave you a set of index cards with phrases on them, each sorted into piles, but with no other context or details. They then started to give you additional index cards with phrases and asked you to add them to the right pile, simply based on the phrases on the cards in each pile. If each pile represented a clear concept with similar phrasing, this would be easy. But if there were no clear topic in each, you would struggle to work out how to match them to a pile. You may even start to use other clues, like “these are all short sentences” or “only these have punctuation.”

Your bot uses similar techniques, but remember that although ML is smart, it’s not as smart as a human, and doesn’t have all the external knowledge and context a human has. If a human with no context of what your bot does might struggle to understand what was meant, your bot likely will too. The best practices in this section can help you create intents that will be recognizable and more likely to be matched with the desired utterance.

1. Each intent should represent a single concept

Each intent should represent one concept or idea, and not just a topic. It’s okay to have multiple intents that map to the same action or response if separating them gives each a clearer, cohesive concept. Let’s look at some dos and don’ts:

  • Don’t create generic intents that group multiple concepts together.

For example, the following intent combines phrases about a damaged product and more general complaint phrases:

DamageComplaint
I've received a damaged product
i received a damaged product
I'm really frustrated
Your company is terrible at deliveries
My product is broken
I got a damaged package
I'm going to return this order
I'll never buy from you again

The following intent is another example, which combines updating personal details with updating the mobile application:

UpdateNeeded
I need to update my address
Can I update the address you have for me
How do I update my telephone number
I can't get the update for the mobile app to work
Help me update my iphone app
How do I get the latest version of the mobile app

  • Do split up intents when they have very different meanings. For example, we can split up the UpdateNeeded intent from the previous example into two intents:

UpdatePersonalDetails
I need to update my address
Can I update the address you have for me
How do I update my telephone number

UpdateMobileApp
I can't get the update for the mobile app to work
Help me update my iphone app
How do I get the latest version of the mobile app

  • Do split up intents when they have the same action or response needed, but use very different phrasing. For example, the following two intents may have the same end result, but the first is directly telling us they need to tow their car, whereas the second is only indirectly hinting that they may need their car towed.

RoadsideAssistanceRequested
I need to tow my car

Can I get a tow truck
Can you send someone out to get my car

RoadsideAssistanceNeeded
I've had an accident

I hit an animal
My car broke down

2. Reduce overlap between intents

Let’s think about that stack of index cards again. If there were cards with the same (or very similar) phrases, it would be hard to know which stack to add a new card with that phrase onto. It’s the same in this case. We want really clear-cut sets of sample utterances in each intent. The following are a few strategies:

  • Don’t create intents with very similar phrasing that have similar meanings. For example, because Amazon Lex will generalize outside of the sample utterances, phrases that aren’t clearly one specific intent could get mismatched, for instance a customer saying “I’d like to book an appointment” when there are two appointment intents, like the following:

BookDoctorsAppointment
I’d like to book a doctors appointment

BookBloodLabAppointment
I’d like to book a lab appointment

  • Do use slots to combine intents that are on the same topic and have similar phrasing. For example, by combining the two intents in the previous example, we can more accurately capture any requests for an appointment, and then use a slot to determine the correct type of appointment:

BookAppointment
I’d like to book a {appointmentType} appointment

  • Don’t create intents where one intent is subset of another. For example, as your bot grows, it can be easy to start creating intents to capture more detailed information:

BookFlight
I'd like to book a flight
book me a round trip flight
i need to book flight one way

BookOneWayFlight
book me a one-way flight
I’d like to book a one way flight
i need to book flight one way please

  • Do use slots to capture different subsets of information within an intent. For example, instead of using different intents to capture the information on the type of flight, we can use a slot to capture this:

BookFlight
I'd like to book a flight
book me a {itineraryType} flight
i need to book flight {itineraryType}
I’d like to book a {itineraryType} flight

3. Have the right amount of data

In ML, training data is key. Hundreds or thousands of samples are often needed to get good results. You’ll be glad to hear that Amazon Lex doesn’t require a huge amount of data, and in fact you don’t want to have too many sample utterances in each intent, because they may start to diverge or add confusion. However, it is key that we provide enough sample utterances to create a clear pattern for the bot to learn from.

Consider the following:

  • Have at least 15 utterances per intent.
  • Add additional utterances incrementally (batches of 10–15) so you can test the performance in stages. A larger number of utterances is not necessarily better.
  • Review intents with a large number of utterances (over 100) to evaluate if you can either remove very similar utterances, or should split the intent into multiple intents.
  • Keep the number of utterances similar across intents. This allows recognition for each intent to be balanced, and avoids accidentally biasing your bot to certain intents.
  • Regularly review your intents based on learnings from your production bot, and continue to add and adjust the utterances. Designing and developing bot is an iterative process that never stops.

4. Have diversity in your data

Amazon Lex is a conversational AI—its primary purpose is to chat with humans. Humans tend to have a large amount of variety in how they phrase things. When designing a bot, we want to make sure we’re capturing that range in our intent configuration. It’s important to re-evaluate and update your configuration and sample data on a regular basis, especially if you’re expanding or changing your user base over time. Consider the following recommendations:

  • Do have a diverse range of utterances in each intent. The following are examples of the types of diversity you should consider:
    • Utterance lengths – The following is an example of varying lengths:

BookFlight
book flight
I need to book a flight
I want to book a flight for my upcoming trip

    • Vocabulary – We need to align this with how our customers talk. You can capture this through user testing or by using the conversational logs from your bot. For example:

OrderFlowers
I want to buy flowers
Can I order flowers
I need to get flowers

    • Phrasing – We need a mix of utterances that represent the different ways our customers might phrase things. The following example shows utterances using “book” as a verb, “booking” as a noun, “flight booking” as a subject, and formal and informal language:

BookFlight
I need to book a flight
can you help with a flight booking
Flight booking is what I am looking for
please book me a flight
I'm gonna need a flight

    • Punctuation – We should include a range of common usage. We should also include non-grammatical usage if this something a customer would use (especially when typing). See the following example:

OrderFlowers
I want to order flowers.
i wanted to get flowers!
Get me some flowers... please!!

    • Slot usage – Provide sample utterances that show both using and not using slots. Use different mixes of slots across those that include them. Make sure the slots have examples with different places they could appear in the utterance. For example:

CancelAppointment
Cancel appointment
Cancel my appointment with Dr. {DoctorLastName}
Cancel appointment on {AppointmentDate} with Dr. {DoctorLastName}
Cancel my appointment on {AppointmentDate}
Can you tell Dr. {DoctorLastName} to cancel my appointment
Please cancel my doctors appointment

  • Don’t keep adding utterances that are just small variances in phrasing. Amazon Lex is able to handle generalizing these for you. For example, you wouldn’t require each of these three variations as the differences are minor:

DamagedProductComplaint
I've received a damaged product
I received a damaged product
Received damaged product

  • Don’t add diversity to some intents but not to others. We need to be consistent with the forms of diversity we add. Remember the index cards from the beginning—when an utterance isn’t clear, the bot may start to use other clues, like sentence length or punctuation, to try to make a match. There are times you may want to use this to your advantage (for example, if you genuinely want to direct all one-word phrases to a particular intent), but it’s important you avoid doing this by accident.

Creating slots

We touched on some good practices involving slots in the previous section, but let’s look at some more specific best practices for slots.

5. Use short noun or adjective phrases for slots

Slots represent something that can be captured definitively as a parameter, like the size of the coffee you want to order, or the airport you’re flying to. Consider the following:

  • Use nouns or short adjectives for your slot values. Don’t use slots for things like carrier phrases (“how do I” or “what could I”) because this will reduce the ability of Amazon Lex to generalize your utterances. Try to keep slots for values you need to capture to fulfil your intent.
  • Keep slots generally to one or two words.

6. Prefer slots over explicit values

You can use slots to generalize the phrases you’re using, but we need to stick to the recommendations we just reviewed as well. To make our slot values as easy to identify as possible, we never use values included in the slot directly in sample utterances. Keep in mind the following tips:

  • Don’t explicitly include values that could be slots in the sample utterances. For example:

OrderFlowers
I want to buy roses
I want to buy lilies
I would love to order some orchids
I would love to order some roses

  • Do use slots to reduce repetition. For example:

OrderFlowers
I want to buy {flowers}
I would love to order some {flowers}

flowers
roses
lilies
orchids

  • Don’t mix slots and real values in the sample utterances. For example:

OrderFlowers
I want to buy {flowers}
I want to buy lilies
I would love to order some {flowers}

flowers
roses
lilies
orchids

  • Don’t have intents with only slots in the sample utterances if the slot types are AlphaNumeric, Number, Date, GRXML, are very broad custom slots, or include abbreviations. Instead, expand the sample utterances by adding conversational phrases that include the slot to the sample utterances.

7. Keep your slot values coherent

The bot has to decide whether to match a slot based only on what it can learn from the values we have entered. If there is a lot of similarity or overlap within slots in the same intent, this can cause challenges with the right slot being matched.

  • Don’t have slots with overlapping values in the same intent. Try to combine them instead. For example:

pets
cat
dog
goldfish

animals
horse
cat
dog

8. Consider how the words will be transcribed

Amazon Lex uses automated speech recognition (ASR) to transcribe speech. This means that all inputs to your Amazon Lex interaction model are processed as text, even when using a voice bot. We need to remember that a transcription may vary from how users might type the same thing. Consider the following:

  • Enter acronyms, or other words whose letters should be pronounced individually, as single letters separated by a period and a space. This will more closely match how it will be transcribed. For example:

A. T. M.
A. W. S.
P. A.

  • Review the audio and transcriptions on a regular basis, so you can adjust your sample utterances or slot types. To do this, turn on conversation logs, and enable both text and audio logs, whenever possible.

9. Use the right options available for your slots

Many different types of slots and options are available, and using the best options for each of our slots can help the recognition of those slot values. We always want to take the time to understand the options before deciding on how to design our slots:

  • Use the restrict option to limit slots to a closed set of values. You can define synonyms for each value. This could be, for instance, the menu items in your restaurant.
  • Use the expand option when you want to be able to identify more than just the sample values you provide (for example, Name).
  • Turn obfuscation on for slots that are collecting sensitive data to prevent the data from being logged.
  • Use runtime hints to improve slot recognition when you can narrow down the potential options at runtime. Choosing one slot might narrow down the options for another; for example, a particular type of furniture may not have all color options.
  • Use spelling styles to capture uncommon words or words with variations in spellings such as names.

10. Use custom vocabulary for specialist domains

In most cases, a custom vocabulary is not required, but can be helpful if your users will use specialist words not common in everyday language. In this case, adding one can be helpful in making sure that your transcriptions are accurate. Keep the following in mind:

  • Do use a custom vocabulary to add words that aren’t readily recognized by Amazon Lex in voice-based conversations. This improves the speech-to-text transcription and overall customer experience.
  • Don’t use short or common words like “on,” “it,” “to,” “yes,” or “no” in a custom vocabulary.
  • Do decide how much weight to give a word based on how often the word isn’t recognized in the transcription and how rare the word is in the input. Words that are difficult to pronounce require a higher weight. Use a representative test set to determine if a weight is appropriate. You can collect an audio test set by turning on audio logging in conversation logs.
  • Do use custom slot types for lists of catalog values or entities such as product names or mutual funds.

11. GRXML slots need a strict grammar

When migrating to Amazon Lex from a service that may already have grammars in place (such as traditional automatic speech recognition engines), it is possible to reuse GRXML grammars during the new bot design process. However, when creating a completely new Amazon Lex bot, we recommend first checking if other slot types might meet your needs before using GRXML. Consider the following:

  • Do use GRXML slots only for spoken input, and not text-based interactions.
  • Don’t add the carrier phrases for the GRXML slots in the GRXML file (grammar) itself.
  • Do put carrier phrases into the slot sample utterances, such as I live in {zipCode} or {zipCode} is my zip code.
  • Do author the grammar to only capture correct slot values. For example, to capture a five-digit US ZIP code, you should only accept values that are exactly five digits.

Summary

In this post, we walked through a set of best practices that should help you as you design and build your next bot. As you take away this information, it’s important to remember that best practices are always context dependent. These aren’t rules, but guidelines to help you build a high-performing chatbot. As you keep building and optimizing your own bots, you will find some of these are more important for your use case than others, and you might add your own additional best practices. As a bot creator, you have a lot of control over how you configure your Amazon Lex bot to get the best results for your use case, and these best practices should give you a great place to start.

We can summarize the best practices in this post as follows:

  • Keep each intent to a single clear concept with a coherent set of utterances
  • Use representative, balanced, and diverse sample utterance data
  • Use slots to make intents clearer and capture data
  • Keep each slot to a single topic with a clear set of values
  • Know and use the right type of slot for your use case

For more information on Amazon Lex, check out Getting started with Amazon Lex for documentation, tutorials, how-to videos, code samples, and SDKs.


About the Author

Picture of Gillian ArmstrongGillian Armstrong is a Builder Solutions Architect. She is excited about how the Cloud is opening up opportunities for more people to use technology to solve problems, and especially excited about how cognitive technologies, like conversational AI, are allowing us to interact with computers in more human ways.

Read More

Power recommendations and search using an IMDb knowledge graph – Part 3

Power recommendations and search using an IMDb knowledge graph – Part 3

This three-part series demonstrates how to use graph neural networks (GNNs) and Amazon Neptune to generate movie recommendations using the IMDb and Box Office Mojo Movies/TV/OTT licensable data package, which provides a wide range of entertainment metadata, including over 1 billion user ratings; credits for more than 11 million cast and crew members; 9 million movie, TV, and entertainment titles; and global box office reporting data from more than 60 countries. Many AWS media and entertainment customers license IMDb data through AWS Data Exchange to improve content discovery and increase customer engagement and retention.

The following diagram illustrates the complete architecture implemented as part of this series.

In Part 1, we discussed the applications of GNNs and how to transform and prepare our IMDb data into a knowledge graph (KG). We downloaded the data from AWS Data Exchange and processed it in AWS Glue to generate KG files. The KG files were stored in Amazon Simple Storage Service (Amazon S3) and then loaded in Amazon Neptune.

In Part 2, we demonstrated how to use Amazon Neptune ML (in Amazon SageMaker) to train the KG and create KG embeddings.

In this post, we walk you through how to apply our trained KG embeddings in Amazon S3 to out-of-catalog search use cases using Amazon OpenSearch Service and AWS Lambda. You also deploy a local web app for an interactive search experience. All the resources used in this post can be created using a single AWS Cloud Development Kit (AWS CDK) command as described later in the post.

Background

Have you ever inadvertently searched a content title that wasn’t available in a video streaming platform? If yes, you will find that instead of facing a blank search result page, you find a list of movies in same genre, with cast or crew members. That’s an out-of-catalog search experience!

Out-of-catalog search (OOC) is when you enter a search query that has no direct match in a catalog. This event frequently occurs in video streaming platforms that constantly purchase a variety of content from multiple vendors and production companies for a limited time. The absence of relevancy or mapping from a streaming company’s catalog to large knowledge bases of movies and shows can result in a sub-par search experience for customers that query OOC content, thereby lowering the interaction time with the platform. This mapping can be done by manually mapping frequent OOC queries to catalog content or can be automated using machine learning (ML).

In this post, we illustrate how to handle OOC by utilizing the power of the IMDb dataset (the premier source of global entertainment metadata) and knowledge graphs.

OpenSearch Service is a fully managed service that makes it easy for you to perform interactive log analytics, real-time application monitoring, website search, and more. OpenSearch is an open source, distributed search and analytics suite derived from Elasticsearch. OpenSearch Service offers the latest versions of OpenSearch, support for 19 versions of Elasticsearch (1.5 to 7.10 versions), as well as visualization capabilities powered by OpenSearch Dashboards and Kibana (1.5 to 7.10 versions). OpenSearch Service currently has tens of thousands of active customers with hundreds of thousands of clusters under management processing trillions of requests per month. OpenSearch Service offers kNN search, which can enhance search in use cases such as product recommendations, fraud detection, and image, video, and some specific semantic scenarios like document and query similarity. For more information about the natural language understanding-powered search functionalities of OpenSearch Service, refer to Building an NLU-powered search application with Amazon SageMaker and the Amazon OpenSearch Service KNN feature.

Solution overview

In this post, we present a solution to handle OOC situations through knowledge graph-based embedding search using the k-nearest neighbor (kNN) search capabilities of OpenSearch Service. The key AWS services used to implement this solution are OpenSearch Service, SageMaker, Lambda, and Amazon S3.

Check out Part 1 and Part 2 of this series to learn more about creating knowledge graphs and GNN embedding using Amazon Neptune ML.

Our OOC solution assumes that you have a combined KG obtained by merging a streaming company KG and IMDb KG. This can be done through simple text processing techniques that match titles along with the title type (movie, series, documentary), cast, and crew. Additionally, this joint knowledge graph has to be trained to generate knowledge graph embeddings through the pipelines mentioned in Part 1 and Part 2. The following diagram illustrates a simplified view of the combined KG.

To demonstrate the OOC search functionality with a simple example, we split the IMDb knowledge graph into customer-catalog and out-of-customer-catalog. We mark the titles that contain “Toy Story” as an out-of-customer catalog resource and the rest of the IMDb knowledge graph as customer catalog. In a scenario where the customer catalog is not enhanced or merged with external databases, a search for “toy story” would return any title that has the words “toy” or “story” in its metadata, with the OpenSearch text search. If the customer catalog was mapped to IMDb, it would be easier to glean that the query “toy story” doesn’t exist in the catalog and that the top matches in IMDb are “Toy Story,” “Toy Story 2,” “Toy Story 3,” “Toy Story 4,” and “Charlie: Toy Story” in decreasing order of relevance with text match. To get within-catalog results for each of these matches, we can generate five closest movies in customer catalog-based kNN embedding (of the joint KG) similarity through OpenSearch Service.

A typical OOC experience follows the flow illustrated in the following figure.

The following video shows the top five (number of hits) OOC results for the query “toy story” and relevant matches in the customer catalog (number of recommendations).

Here, the query is matched to the knowledge graph using text search in OpenSearch Service. We then map the embeddings of the text match to the customer catalog titles using the OpenSearch Service kNN index. Because the user query can’t be directly mapped to the knowledge graph entities, we use a two-step approach to first find title-based query similarities and then items similar to the title using knowledge graph embeddings. In the following sections, we walk through the process of setting up an OpenSearch Service cluster, creating and uploading knowledge graph indexes, and deploying the solution as a web application.

Prerequisites

To implement this solution, you should have an AWS account, familiarity with OpenSearch Service, SageMaker, Lambda, and AWS CloudFormation, and have completed the steps in Part 1 and Part 2 of this series.

Launch solution resources

The following architecture diagram shows the out-of-catalog workflow.

You will use the AWS Cloud Development Kit (CDK) to provision the resources required for the OOC search applications. The code to launch these resources performs the following operations:

  1. Creates a VPC for the resources.
  2. Creates an OpenSearch Service domain for the search application.
  3. Creates a Lambda function to process and load movie metadata and embeddings to OpenSearch Service indexes (**-ReadFromOpenSearchLambda-**).
  4. Creates a Lambda function that takes as input the user query from a web app and returns relevant titles from OpenSearch (**-LoadDataIntoOpenSearchLambda-**).
  5. Creates an API Gateway that adds an additional layer of security between the web app user interface and Lambda.

To get started, complete the following steps:

  1. Run the code and notebooks from Part 1 and Part 2.
  2. Navigate to the part3-out-of-catalog folder in the code repository.

  1. Launch the AWS CDK from the terminal with the command bash launch_stack.sh.
  2. Provide the two S3 file paths created in Part 2 as input:
    1. The S3 path to the movie embeddings CSV file.
    2. The S3 path to the movie node file.

  1. Wait until the script provisions all the required resources and finishes running.
  2. Copy the API Gateway URL that the AWS CDK script prints out and save it. (We use this for the Streamlit app later).

Create an OpenSearch Service Domain

For illustration purposes, you create a search domain on one Availability Zone in an r6g.large.search instance within a secure VPC and subnet. Note that the best practice would be to set up on three Availability Zones with one primary and two replica instances.

Create an OpenSearch Service index and upload data

You use Lambda functions (created using the AWS CDK launch stack command) to create the OpenSearch Service indexes. To start the index creation, complete the following steps:

  1. On the Lambda console, open the LoadDataIntoOpenSearchLambda Lambda function.
  2. On the Test tab, choose Test to create and ingest data into the OpenSearch Service index.

The following code to this Lambda function can be found in part3-out-of-catalog/cdk/ooc/lambdas/LoadDataIntoOpenSearchLambda/lambda_handler.py:

embedding_file = os.environ.get("embeddings_file")
movie_node_file = os.environ.get("movie_node_file")
print("Merging files")
merged_df = merge_data(embedding_file, movie_node_file)
print("Embeddings and metadata files merged")

print("Initializing OpenSearch client")
ops = initialize_ops()
indices = ops.indices.get_alias().keys()
print("Current indices are :", indices)

# This will take 5 minutes
print("Creating knn index")
# Create the index using knn settings. Creating OOC text is not needed
create_index('ooc_knn',ops)
print("knn index created!")

print("Uploading the data for knn index")
response = ingest_data_into_ops(merged_df, ops, ops_index='ooc_knn', post_method=post_request_emb)
print(response)
print("Upload complete for knn index")

print("Uploading the data for fuzzy word search index")
response = ingest_data_into_ops(merged_df, ops, ops_index='ooc_text', post_method=post_request)
print("Upload complete for fuzzy word search index")
# Create the response and add some extra content to support CORS
response = {
    "statusCode": 200,
    "headers": {
        "Access-Control-Allow-Origin": '*'
    },
    "isBase64Encoded": False
}

The function performs the following tasks:

  1. Loads the IMDB KG movie node file that contains the movie metadata and its associated embeddings from the S3 file paths that were passed to the stack creation file launch_stack.sh.
  2. Merges the two input files to create a single dataframe for index creation.
  3. Initializes the OpenSearch Service client using the Boto3 Python library.
  4. Creates two indexes for text (ooc_text) and kNN embedding search (ooc_knn) and bulk uploads data from the combined dataframe through the ingest_data_into_ops function.

This data ingestion process takes 5–10 minutes and can be monitored through the Amazon CloudWatch logs on the Monitoring tab of the Lambda function.

You create two indexes to enable text-based search and kNN embedding-based search. The text search maps the free-form query the user enters to the titles of the movie. The kNN embedding search finds the k closest movies to the best text match from the KG latent space to return as outputs.

Deploy the solution as a local web application

Now that you have a working text search and kNN index on OpenSearch Service, you’re ready to build a ML-powered web app.

We use the streamlit Python package to create a front-end illustration for this application. The IMDb-Knowledge-Graph-Blog/part3-out-of-catalog/run_imdb_demo.py Python file in our GitHub repo has the required code to la­­­­unch a local web app to explore this capability.

To run the code, complete the following steps:

  1. Install the streamlit and aws_requests_auth Python package in your local virtual Python environment through for following commands in your terminal:
pip install streamlit

pip install aws-requests-auth
  1. Replace the placeholder for the API Gateway URL in the code as follows with the one created by the AWS CDK:

api = '<ENTER URL OF THE API GATEWAY HERE>/opensearch-lambda?q={query_text}&numMovies={num_movies}&numRecs={num_recs}'

  1. Launch the web app with the command streamlit run run_imdb_demo.py from your terminal.

This script launches a Streamlit web app that can be accessed in your web browser. The URL of the web app can be retrieved from the script output, as shown in the following screenshot.

The app accepts new search strings, number of hits, and number of recommendations. The number of hits correspond to how many matching OOC titles we should retrieve from the external (IMDb) catalog. The number of recommendations corresponds to how many nearest neighbors we should retrieve from the customer catalog based on kNN embedding search. See the following code:

search_text=st.sidebar.text_input("Please enter search text to find movies and recommendations")
num_movies= st.sidebar.slider('Number of search hits', min_value=0, max_value=5, value=1)
recs_per_movie= st.sidebar.slider('Number of recommendations per hit', min_value=0, max_value=10, value=5)
if st.sidebar.button('Find'):
    resp= get_movies()

This input (query, number of hits and recommendations) is passed to the **-ReadFromOpenSearchLambda-** Lambda function created by the AWS CDK through the API Gateway request. This is done in the following function:

def get_movies():
    result = requests.get(api.format(query_text=search_text, num_movies=num_movies, num_recs=recs_per_movie)).json()

The output results of the Lambda function from OpenSearch Service is passed to API Gateway and is displayed in the Streamlit app.

Clean up

You can delete all the resources created by the AWS CDK through the command npx cdk destroy –app “python3 appy.py” --all in the same instance (inside the cdk folder) that was used to launch the stack (see the following screenshot).

Conclusion

In this post, we showed you how to create a solution for OOC search using text and kNN-based search using SageMaker and OpenSearch Service. You used custom knowledge graph model embeddings to find nearest neighbors in your catalog to that of IMDb titles. You can now, for example, search for “The Rings of Power,” a fantasy series developed by Amazon Prime Video, on other streaming platforms and reason how they could have optimized the search result.

For more information about the code sample in this post, see the GitHub repo. To learn more about collaborating with the Amazon ML Solutions Lab to build similar state-of-the-art ML applications, see Amazon Machine Learning Solutions Lab. For more information on licensing IMDb datasets, visit developer.imdb.com.


About the Authors

Divya Bhargavi is a Data Scientist and Media and Entertainment Vertical Lead at the Amazon ML Solutions Lab,  where she solves high-value business problems for AWS customers using Machine Learning. She works on image/video understanding, knowledge graph recommendation systems, predictive advertising use cases.

Gaurav Rele is a Data Scientist at the Amazon ML Solution Lab, where he works with AWS customers across different verticals to accelerate their use of machine learning and AWS Cloud services to solve their business challenges.

Matthew Rhodes is a Data Scientist I working in the Amazon ML Solutions Lab. He specializes in building Machine Learning pipelines that involve concepts such as Natural Language Processing and Computer Vision.

Karan Sindwani is a Data Scientist at Amazon ML Solutions Lab, where he builds and deploys deep learning models. He specializes in the area of computer vision. In his spare time, he enjoys hiking.

Soji Adeshina is an Applied Scientist at AWS where he develops graph neural network-based models for machine learning on graphs tasks with applications to fraud & abuse, knowledge graphs, recommender systems, and life sciences. In his spare time, he enjoys reading and cooking.

Vidya Sagar Ravipati is a Manager at the Amazon ML Solutions Lab, where he leverages his vast experience in large-scale distributed systems and his passion for machine learning to help AWS customers across different industry verticals accelerate their AI and cloud adoption.

Read More

AWS positioned in the Leaders category in the 2022 IDC MarketScape for APEJ AI Life-Cycle Software Tools and Platforms Vendor Assessment

AWS positioned in the Leaders category in the 2022 IDC MarketScape for APEJ AI Life-Cycle Software Tools and Platforms Vendor Assessment

The recently published IDC MarketScape: Asia/Pacific (Excluding Japan) AI Life-Cycle Software Tools and Platforms 2022 Vendor Assessment positions AWS in the Leaders category. This was the first and only APEJ-specific analyst evaluation focused on AI life-cycle software from IDC. The vendors evaluated for this MarketScape offer various software tools needed to support end-to-end machine learning (ML) model development, including data preparation, model building and training, model operation, evaluation, deployment, and monitoring. The tools are typically used by data scientists and ML developers from experimentation to production deployment of AI and ML solutions.

AI life-cycle tools are essential to productize AI/ML solutions. They go quite a few steps beyond AI/ML experimentation: to achieve deployment anywhere, performance at scale, cost optimization, and increasingly important, support systematic model risk management—explainability, robustness, drift, privacy protection, and more. Businesses need these tools to unlock the value of enterprise data assets at greater scale and faster speed.

Vendor Requirements for the IDC MarketScape

To be considered for the MarketScape, the vendor had to provide software products for various aspects of the end-to-end ML process under independent product stock-keeping units (SKUs) or as part of a general AI software platform. The products had to be based on the company’s own IP, and the products should have generated software license revenue or consumption-based software revenue for at least 12 months in APEJ as of March 2022. The company had to be among the top 15 vendors by the reported revenues of 2020–2021 in the APEJ region, according to IDC’s AI Software Tracker. AWS met the criteria and was evaluated by IDC along with eight other vendors.

The result of IDC’s comprehensive evaluation was published October 2022 in the IDC MarketScape: Asia/Pacific (Excluding Japan) AI Life-Cycle Software Tools and Platforms 2022 Vendor Assessment. AWS is positioned in the Leaders category based on current capabilities. The AWS strategy is to make continuous investments in AI/ML services to help customers innovate with AI and ML.

AWS position

“AWS is placed in the Leaders category in this exercise, receiving higher ratings in various assessment categories—the breadth of tooling services provided, options to lower cost for performance, quality of customer service and support, and pace of product innovation, to name a few.”

– Jessie Danqing Cai, Associate Research Director, Big Data & Analytics Practice, IDC Asia/Pacific.

The visual below is part of the MarketScape and shows the AWS position evaluated by capabilities and strategies.

The IDC MarketScape vendor analysis model is designed to provide an overview of the competitive fitness of ICT suppliers in a given market. The research methodology utilizes a rigorous scoring methodology based on both qualitative and quantitative criteria that results in a single graphical illustration of each vendor’s position within a given market. The Capabilities score measures vendor product, go-to-market, and business execution in the short term. The Strategy score measures alignment of vendor strategies with customer requirements in a 3–5-year time frame. Vendor market share is represented by the size of the icons.

Amazon SageMaker evaluated as part of the MarketScape

As part of the evaluation, IDC dove deep into Amazon SageMaker capabilities. SageMaker is a fully managed service to build, train, and deploy ML models for any use case with fully managed infrastructure, tools, and workflows. Since the launch of SageMaker in 2017, over 250 capabilities and features have been released.

ML practitioners such as data scientists, data engineers, business analysts, and MLOps professionals use SageMaker to break down barriers across each step of the ML workflow through their choice of integrated development environments (IDEs) or no-code interfaces. Starting with data preparation, SageMaker makes it easy to access, label, and process large amounts of structured data (tabular data) and unstructured data (photo, video, geospatial, and audio) for ML. After data is prepared, SageMaker offers fully managed notebooks for model building and reduces training time from hours to minutes with optimized infrastructure. SageMaker makes it easy to deploy ML models to make predictions at the best price-performance for any use case through a broad selection of ML infrastructure and model deployment options. Finally, the MLOps tools in SageMaker help you scale model deployment, reduce inference costs, manage models more effectively in production, and reduce operational burden.

The MarketScape calls out three strengths for AWS:

  • Functionality and offering – SageMaker provides a broad and deep set of tools for data preparation, model training, and deployment, including AWS-built silicon: AWS Inferentia for inference workloads and AWS Trainium for training workloads. SageMaker supports model explainability and bias detection through Amazon SageMaker Clarify.
  • Service delivery – SageMaker is natively available on AWS, the second largest public cloud platform in the APEJ region (based on IDC Public Cloud Services Tracker, IaaS+PaaS, 2021 data), with regions in Japan, Australia, New Zealand, Singapore, India, Indonesia, South Korea, and Greater China. Local zones are available to serve customers in ASEAN countries: Thailand, the Philippines, and Vietnam.
  • Growth opportunities – AWS actively contributes to open-source projects such as Gluon and engages with regional developer and student communities through many events, online courses, and Amazon SageMaker Studio Lab, a no-cost SageMaker notebook environment.

SageMaker launches at re:Invent 2022

SageMaker innovation continued at AWS re:Invent 2022, with eight new capabilities. The launches included three new capabilities for ML model governance. As the number of models and users within an organization increases, it becomes harder to set least-privilege access controls and establish governance processes to document model information (for example, input datasets, training environment information, model-use description, and risk rating). After models are deployed, customers also need to monitor for bias and feature drift to ensure they perform as expected. A new role manager, model cards, and model dashboard simplify access control and enhance transparency to support ML model governance.

There were also three launches related to Amazon SageMaker Studio notebooks. SageMaker Studio notebooks gives practitioners a fully managed notebook experience, from data exploration to deployment. As teams grow in size and complexity, dozens of practitioners may need to collaboratively develop models using notebooks. AWS continues to offer the best notebook experience for users, with the launch of three new features that help you coordinate and automate notebook code.

To support model deployment, new capabilities in SageMaker help you run shadow tests to evaluate a new ML model before production release by testing its performance against the currently deployed model. Shadow testing can help you catch potential configuration errors and performance issues before they impact end-users.

Finally, SageMaker launched support for geospatial ML, allowing data scientists and ML engineers to easily build, train, and deploy ML models using geospatial data. You can access geospatial data sources, purpose-built processing operations, pre-trained ML models, and built-in visualization tools to run geospatial ML faster and at scale.

Today, tens of thousands of customers use Amazon SageMaker to train models with billions of parameters and make over 1 trillion predictions per month. To learn more about SageMaker, visit the webpage and explore how fully managed infrastructure, tools, and workflows can help you accelerate ML model development.


About the author

Kimberly Madia is a Principal Product Marketing Manager with AWS Machine Learning. Her goal is to make it easy for customers to build, train, and deploy machine learning models using Amazon SageMaker. For fun outside work, Kimberly likes to cook, read, and run on the San Francisco Bay Trail.

Read More

How Thomson Reuters delivers personalized content subscription plans at scale using Amazon Personalize

How Thomson Reuters delivers personalized content subscription plans at scale using Amazon Personalize

This post is co-written by Hesham Fahim from Thomson Reuters.

Thomson Reuters (TR) is one of the world’s most trusted information organizations for businesses and professionals. It provides companies with the intelligence, technology, and human expertise they need to find trusted answers, enabling them to make better decisions more quickly. TR’s customers span across the financial, risk, legal, tax, accounting, and media markets.

Thomson Reuters provides market-leading products in the Tax, Legal and News campaign, which users can sign up to using a subscription licensing model. To enhance this experience for their customers, TR wanted to create a centralized recommendations platform that allowed their sales team to suggest the most relevant subscription packages to their customers, generating suggestions that help raise awareness of products that could help their customers serve the market better through tailored product selections.

Prior to building this centralized platform, TR had a legacy rules-based engine to generate renewal recommendations. The rules in this engine were predefined and written in SQL, which aside from posing a challenge to manage, also struggled to cope with the proliferation of data from TR’s various integrated data source. TR customer data is changing at a faster rate than the business rules can evolve to reflect changing customer needs. The key requirement for TR’s new machine learning (ML)-based personalization engine was centered around an accurate recommendation system that takes into account recent customer trends. The desired solution would be one with low operational overhead, the ability to accelerate delivering business goals, and a personalization engine that could be constantly trained with up-to-date data to deal with changing consumer habits and new products.

Personalizing the renewal recommendations based on what would be valuable products for TR’s customers was an important business challenge for the sales and marketing team. TR has a wealth of data that could be used for personalization that has been collected from customer interactions and stored within a centralized data warehouse. TR has been an early adopter of ML with Amazon SageMaker, and their maturity in the AI/ML domain meant that they had collated a significant dataset of relevant data within a data warehouse, which the team could train a personalization model with. TR has continued their AI/ML innovation and has recently developed a revamped recommendation platform using Amazon Personalize, which is a fully managed ML service that uses user interactions and items to generate recommendations for users. In this post, we explain how TR used Amazon Personalize to build a scalable, multi-tenanted recommender system that provides the best product subscription plans and associated pricing to their customers.

Solution architecture

The solution had to be designed considering TR’s core operations around understanding users through data; providing these users with personalized and relevant content from a large corpus of data was a mission-critical requirement. Having a well-designed recommendation system is key to getting quality recommendations that are customized to each user’s requirements.

The solution required collecting and preparing user behavior data, training an ML model using Amazon Personalize, generating personalized recommendations through the trained model, and driving marketing campaigns with the personalized recommendations.

TR wanted to take advantage of AWS managed services where possible to simplify operations and reduce undifferentiated heavy lifting. TR used AWS Glue DataBrew and AWS Batch jobs to perform the extract, transform, and load (ETL) jobs in the ML pipelines, and SageMaker along with Amazon Personalize to tailor the recommendations. From a training data volume and runtime perspective, the solution needed to be scalable to process millions of records within the time frame already committed to downstream consumers in TR’s business teams.

The following sections explain the components involved in the solution.

ML training pipeline

Interactions between the users and the content is collected in the form of clickstream data, which is generated as the customer clicks on the content. TR analyzes if this is part of their subscription plan or beyond their subscription plan so that they can provide additional details about the price and plan enrollment options. The user interactions data from various sources is persisted in their data warehouse.

The following diagram illustrates the ML training pipeline.
ML engine training pipeline
The pipeline starts with an AWS Batch job that extracts the data from the data warehouse and transforms the data to create interactions, users, and items datasets.

The following datasets are used to train the model:

  • Structured product data – Subscriptions, orders, product catalog, transactions, and customer details
  • Semi-structured behavior data – Users, usage, and interactions

This transformed data is stored in an Amazon Simple Storage Service (Amazon S3) bucket, which is imported into Amazon Personalize for ML training. Because TR wants to generate personalized recommendations for their users, they use the USER_PERSONALIZATION recipe to train ML models for their custom data, which is referred as creating a solution version. After the solution version is created, it’s used for generating personalized recommendations for the users.

The entire workflow is orchestrated using AWS Step Functions. The alerts and notifications are captured and published to Microsoft Teams using Amazon Simple Notification Service (Amazon SNS) and Amazon EventBridge.

Generating personalized recommendations pipeline: Batch inference

Customer requirements and preferences change very often, and the latest interactions captured in clickstream data serves as a key data point to understand the changing preferences of the customer. To adapt to ever-changing customer preferences, TR generates personalized recommendations on a daily basis.

The following diagram illustrates the pipeline to generate personalized recommendations.
Pipeline to generate personalized recommendations in Batch
A DataBrew job extracts the data from the TR data warehouse for the users who are eligible to provide recommendations during renewal based on the current subscription plan and recent activity. The DataBrew visual data preparation tool makes it easy for TR data analysts and data scientists to clean and normalize data to prepare it for analytics and ML. The ability to choose from over 250 pre-built transformations within the visual data preparation tool to automate data preparation tasks, all without the need to write any code, was an important feature. The DataBrew job generates an incremental dataset for interactions and input for the batch recommendations job and stores the output in a S3 bucket. The newly generated incremental dataset is imported into the interactions dataset. When the incremental dataset import job is successful, an Amazon Personalize batch recommendations job is triggered with the input data. Amazon Personalize generates the latest recommendations for the users provided in the input data and stores it in a recommendations S3 bucket.

Price optimization is the last step before the newly formed recommendations are ready to use. TR runs a cost optimization job on the recommendations generated and uses SageMaker to run custom models on the recommendations as part of this final step. An AWS Glue job curates the output generated from Amazon Personalize and transforms it into the input format required by the SageMaker custom model. TR is able to take the advantage of breadth of the services that AWS provides, using both Amazon Personalize and SageMaker in the recommendation platform to tailor recommendations based on the type of customer firm and end-users.

The entire workflow is decoupled and orchestrated using Step Functions, which gives the flexibility of scaling the pipeline depending on the data processing requirements. The alerts and notifications are captured using Amazon SNS and EventBridge.

Driving email campaigns

The recommendations generated along with the pricing results are used to drive email campaigns to TR’s customers. An AWS Batch job is used to curate the recommendations for each customer and enrich it with the optimized pricing information. These recommendations are ingested into TR’s campaigning systems, which drive the following email campaigns:

  • Automated subscription renewal or upgrade campaigns with new products that might interest the customer
  • Mid-contract renewal campaigns with better offers and more relevant products and legal content materials

The information from this process is also replicated to the customer portal so customers reviewing their current subscription can see the new renewal recommendations. TR has seen a higher conversion rate from email campaigns, leading to increased sales orders, since implementing the new recommendation platform.

What’s next: Real-time recommendations pipeline

Customer requirements and shopping behaviors change in real time, and adapting recommendations to the real-time changes is key to serving the right content. After seeing a great success deploying a batch recommendation system, TR is now planning to take this solution to the next level by implementing a real-time recommendations pipeline to generate recommendations using Amazon Personalize.

The following diagram illustrates the architecture to provide real-time recommendations.
Real-time recommendations pipeline
The real-time integration starts with collecting the live user engagement data and streaming it to Amazon Personalize. As the users are interacting with TR’s applications, they generate clickstream events, which are published into Amazon Kinesis Data Streams. Then the events are ingested into TR’s centralized streaming platform, which is built on top of Amazon Managed Streaming for Kafka (Amazon MSK). Amazon MSK makes it easy to ingest and process streaming data in real time with fully managed Apache Kafka. In this architecture, Amazon MSK serves as a streaming platform and performs any data transformations required on the raw incoming clickstream events. Then an AWS Lambda function is triggered to filter the events to the schema compatible with the Amazon Personalize dataset and push those events to an Amazon Personalize event tracker using a putEvent API. This allows Amazon Personalize to learn from your user’s most recent behavior and include relevant items in recommendations.

TR’s web applications invoke an API deployed in Amazon API Gateway to get recommendations, which triggers a Lambda function to invoke a GetRecommendations API call with Amazon Personalize. Amazon Personalize provides the latest set of personalized recommendations curated to the user behavior, which are provided back to the web applications via Lambda and API Gateway.

With this real-time architecture, TR can serve their customers with personalized recommendations curated to their most recent behavior and serve their needs better.

Conclusion

In this post, we showed you how TR used Amazon Personalize and other AWS services to implement a recommendation engine. Amazon Personalize enabled TR to accelerate the development and deployment of high-performance models to provide recommendations to their customers. TR is able to onboard a new suite of products within weeks now, compared to months earlier. With Amazon Personalize and SageMaker, TR is able to elevate the customer experience with better content subscription plans and prices for their customers.

If you enjoyed reading this blog and would like to learn more about Amazon Personalize and how it can help your organization build recommendation systems, please see the developer guide.


About the Authors

Hesham Fahim is a Lead Machine Learning Engineer and Personalization Engine Architect at Thomson Reuters. He has worked with organizations in academia and industry ranging from large enterprises to mid-sized startups. With a focus on scalable deep learning architectures, He has experience in mobile robotics, biomedical image analysis as well as recommender systems. Away from computers he enjoys astrophotography, reading and long distance biking.

Srinivasa Shaik is a Solutions Architect at AWS based in Boston. He helps Enterprise customers to accelerate their journey to the cloud. He is passionate about containers and machine learning technologies. In his spare time, he enjoys spending time with his family, cooking, and traveling.

Vamshi Krishna Enabothala is a Sr. Applied AI Specialist Architect at AWS. He works with customers from different sectors to accelerate high-impact data, analytics, and machine learning initiatives. He is passionate about recommendation systems, NLP, and computer vision areas in AI and ML. Outside of work, Vamshi is an RC enthusiast, building RC equipment (planes, cars, and drones), and also enjoys gardening.

Simone Zucchet is a Senior Solutions Architect at AWS. With over 6 years of experience as a Cloud Architect, Simone enjoys working on innovative projects that help transform the way organizations approach business problems. He helps support large enterprise customers at AWS and is part of the Machine Learning TFC. Outside of his professional life, he enjoys working on cars and photography.

Read More

Connecting Amazon Redshift and RStudio on Amazon SageMaker

Connecting Amazon Redshift and RStudio on Amazon SageMaker

Last year, we announced the general availability of RStudio on Amazon SageMaker, the industry’s first fully managed RStudio Workbench integrated development environment (IDE) in the cloud. You can quickly launch the familiar RStudio IDE and dial up and down the underlying compute resources without interrupting your work, making it easy to build machine learning (ML) and analytics solutions in R at scale.

Many of the RStudio on SageMaker users are also users of Amazon Redshift, a fully managed, petabyte-scale, massively parallel data warehouse for data storage and analytical workloads. It makes it fast, simple, and cost-effective to analyze all your data using standard SQL and your existing business intelligence (BI) tools. Users can also interact with data with ODBC, JDBC, or the Amazon Redshift Data API.

The use of RStudio on SageMaker and Amazon Redshift can be helpful for efficiently performing analysis on large data sets in the cloud. However, working with data in the cloud can present challenges, such as the need to remove organizational data silos, maintain security and compliance, and reduce complexity by standardizing tooling. AWS offers tools such as RStudio on SageMaker and Amazon Redshift to help tackle these challenges.

In this blog post, we will show you how to use both of these services together to efficiently perform analysis on massive data sets in the cloud while addressing the challenges mentioned above. This blog focuses on the Rstudio on Amazon SageMaker language, with business analysts, data engineers, data scientists, and all developers that use the R Language and Amazon Redshift, as the target audience.

If you’d like to use the traditional SageMaker Studio experience with Amazon Redshift, refer to Using the Amazon Redshift Data API to interact from an Amazon SageMaker Jupyter notebook.

Solution overview

In the blog today, we will be executing the following steps:

  1. Cloning the sample repository with the required packages.
  2. Connecting to Amazon Redshift with a secure ODBC connection (ODBC is the preferred protocol for RStudio).
  3. Running queries and SageMaker API actions on data within Amazon Redshift Serverless through RStudio on SageMaker

This process is depicted in the following solutions architecture:

Solution walkthrough

Prerequisites

Prior to getting started, ensure you have all requirements for setting up RStudio on Amazon SageMaker and Amazon Redshift Serverless, such as:

We will be using a CloudFormation stack to generate the required infrastructure.

Note: If you already have an RStudio domain and Amazon Redshift cluster you can skip this step

Launching this stack creates the following resources:

  • 3 Private subnets
  • 1 Public subnet
  • 1 NAT gateway
  • Internet gateway
  • Amazon Redshift Serverless cluster
  • SageMaker domain with RStudio
  • SageMaker RStudio user profile
  • IAM service role for SageMaker RStudio domain execution
  • IAM service role for SageMaker RStudio user profile execution

This template is designed to work in a Region (ex. us-east-1, us-west-2) with three Availability Zones, RStudio on SageMaker, and Amazon Redshift Serverless. Ensure your Region has access to those resources, or modify the templates accordingly.

Press the Launch Stack button to create the stack.

  1. On the Create stack page, choose Next.
  2. On the Specify stack details page, provide a name for your stack and leave the remaining options as default, then choose Next.
  3. On the Configure stack options page, leave the options as default and press Next.
  4. On the Review page, select the
  • I acknowledge that AWS CloudFormation might create IAM resources with custom names
  • I acknowledge that AWS CloudFormation might require the following capability: CAPABILITY_AUTO_EXPANDcheckboxes and choose Submit.

The template will generate five stacks.

Once the stack status is CREATE_COMPLETE, navigate to the Amazon Redshift Serverless console. This is a new capability that makes it super easy to run analytics in the cloud with high performance at any scale. Just load your data and start querying. There is no need to set up and manage clusters.

Note: The pattern demonstrated in this blog integrating Amazon Redshift and RStudio on Amazon SageMaker will be the same regardless of Amazon Redshift deployment pattern (serverless or traditional cluster).

Loading data in Amazon Redshift Serverless

The CloudFormation script created a database called sagemaker. Let’s populate this database with tables for the RStudio user to query. Create a SQL editor tab and be sure the sagemaker database is selected. We will be using the synthetic credit card transaction data to create tables in our database. This data is part of the SageMaker sample tabular datasets s3://sagemaker-sample-files/datasets/tabular/synthetic_credit_card_transactions.

We are going to execute the following query in the query editor. This will generate three tables, cards, transactions, and users.

CREATE SCHEMA IF NOT EXISTS synthetic;
DROP TABLE IF EXISTS synthetic.transactions;

CREATE TABLE synthetic.transactions(
    user_id INT,
    card_id INT,
    year INT,
    month INT,
    day INT,
    time_stamp TIME,
    amount VARCHAR(100),
    use_chip VARCHAR(100),
    merchant_name VARCHAR(100),
    merchant_city VARCHAR(100),
    merchant_state VARCHAR(100),
    merchant_zip_code VARCHAR(100),
    merchant_category_code INT,
    is_error VARCHAR(100),
    is_fraud VARCHAR(100)
);

COPY synthetic.transactions
FROM 's3://sagemaker-sample-files/datasets/tabular/synthetic_credit_card_transactions/credit_card_transactions-ibm_v2.csv'
IAM_ROLE default
REGION 'us-east-1' 
IGNOREHEADER 1 
CSV;

DROP TABLE IF EXISTS synthetic.cards;

CREATE TABLE synthetic.cards(
    user_id INT,
    card_id INT,
    card_brand VARCHAR(100),
    card_type VARCHAR(100),
    card_number VARCHAR(100),
    expire_date VARCHAR(100),
    cvv INT,
    has_chip VARCHAR(100),
    number_cards_issued INT,
    credit_limit VARCHAR(100),
    account_open_date VARCHAR(100),
    year_pin_last_changed VARCHAR(100),
    is_card_on_dark_web VARCHAR(100)
);

COPY synthetic.cards
FROM 's3://sagemaker-sample-files/datasets/tabular/synthetic_credit_card_transactions/sd254_cards.csv'
IAM_ROLE default
REGION 'us-east-1' 
IGNOREHEADER 1 
CSV;

DROP TABLE IF EXISTS synthetic.users;

CREATE TABLE synthetic.users(
    name VARCHAR(100),
    current_age INT,
    retirement_age INT,
    birth_year INT,
    birth_month INT,
    gender VARCHAR(100),
    address VARCHAR(100),
    apartment VARCHAR(100),
    city VARCHAR(100),
    state VARCHAR(100),
    zip_code INT,
    lattitude VARCHAR(100),
    longitude VARCHAR(100),
    per_capita_income_zip_code VARCHAR(100),
    yearly_income VARCHAR(100),
    total_debt VARCHAR(100),
    fico_score INT,
    number_credit_cards INT
);

COPY synthetic.users
FROM 's3://sagemaker-sample-files/datasets/tabular/synthetic_credit_card_transactions/sd254_users.csv'
IAM_ROLE default
REGION 'us-east-1' 
IGNOREHEADER 1 
CSV;

You can validate that the query ran successfully by seeing three tables within the left-hand pane of the query editor.

Once all of the tables are populated, navigate to SageMaker RStudio and start a new session with RSession base image on an ml.m5.xlarge instance.

Once the session is launched, we will run this code to create a connection to our Amazon Redshift Serverless database.

library(DBI)
library(reticulate)
boto3 <- import('boto3')
client <- boto3$client('redshift-serverless')
workgroup <- unlist(client$list_workgroups())
namespace <- unlist(client$get_namespace(namespaceName=workgroup$workgroups.namespaceName))
creds <- client$get_credentials(dbName=namespace$namespace.dbName,
                                durationSeconds=3600L,
                                workgroupName=workgroup$workgroups.workgroupName)
con <- dbConnect(odbc::odbc(),
                 Driver='redshift',
                 Server=workgroup$workgroups.endpoint.address,
                 Port='5439',
                 Database=namespace$namespace.dbName,
                 UID=creds$dbUser,
                 PWD=creds$dbPassword)

In order to view the tables in the synthetic schema, you will need to grant access in Amazon Redshift via the query editor.

GRANT ALL ON SCHEMA synthetic to "IAMR:SageMakerUserExecutionRole";
GRANT ALL ON ALL TABLES IN SCHEMA synthetic to "IAMR:SageMakerUserExecutionRole";

The RStudio Connections pane should show the sagemaker database with schema synthetic and tables cards, transactions, users.

You can click the table icon next to the tables to view 1,000 records.

Note: We have created a pre-built R Markdown file with all the code-blocks pre-built that can be found at the project GitHub repo.

Now let’s use the DBI package function dbListTables() to view existing tables.

dbListTables(con)

Use dbGetQuery() to pass a SQL query to the database.

dbGetQuery(con, "select * from synthetic.users limit 100")
dbGetQuery(con, "select * from synthetic.cards limit 100")
dbGetQuery(con, "select * from synthetic.transactions limit 100")

We can also use the dbplyr and dplyr packages to execute queries in the database. Let’s count() how many transactions are in the transactions table. But first, we need to install these packages.

install.packages(c("dplyr", "dbplyr", "crayon"))

Use the tbl() function while specifying the schema.

library(dplyr)
library(dbplyr)

users_tbl <- tbl(con, in_schema("synthetic", "users"))
cards_tbl <- tbl(con, in_schema("synthetic", "cards"))
transactions_tbl <- tbl(con, in_schema("synthetic", "transactions"))

Let’s run a count of the number of rows for each table.

count(users_tbl)
count(cards_tbl)
count(transactions_tbl)

So we have 2,000 users; 6,146 cards; and 24,386,900 transactions. We can also view the tables in the console.

transactions_tbl

We can also view what dplyr verbs are doing under the hood.

show_query(transactions_tbl)

Let’s visually explore the number of transactions by year.

transactions_by_year <- transactions_tbl %>%
  count(year) %>%
  arrange(year) %>%
  collect()

transactions_by_year
install.packages(c('ggplot2', 'vctrs'))
library(ggplot2)
ggplot(transactions_by_year) +
  geom_col(aes(year, as.integer(n))) +
  ylab('transactions') 

We can also summarize data in the database as follows:

transactions_tbl %>%
  group_by(is_fraud) %>%
  count()
transactions_tbl %>%
  group_by(merchant_category_code, is_fraud) %>%
  count() %>% 
  arrange(merchant_category_code)

Suppose we want to view fraud using card information. We just need to join the tables and then group them by the attribute.

cards_tbl %>%
  left_join(transactions_tbl, by = c("user_id", "card_id")) %>%
  group_by(card_brand, card_type, is_fraud) %>%
  count() %>% 
  arrange(card_brand)

Now let’s prepare a dataset that could be used for machine learning. Let’s filter the transaction data to just include Discover credit cards while only keeping a subset of columns.

discover_tbl <- cards_tbl %>%
  filter(card_brand == 'Discover', card_type == 'Credit') %>%
  left_join(transactions_tbl, by = c("user_id", "card_id")) %>%
  select(user_id, is_fraud, merchant_category_code, use_chip, year, month, day, time_stamp, amount)

And now let’s do some cleaning using the following transformations:

  • Convert is_fraud to binary attribute
  • Remove transaction string from use_chip and rename it to type
  • Combine year, month, and day into a data object
  • Remove $ from amount and convert to a numeric data type
discover_tbl <- discover_tbl %>%
  mutate(is_fraud = ifelse(is_fraud == 'Yes', 1, 0),
         type = str_remove(use_chip, 'Transaction'),
         type = str_trim(type),
         type = tolower(type),
         date = paste(year, month, day, sep = '-'),
         date = as.Date(date),
         amount = str_remove(amount, '[$]'),
         amount = as.numeric(amount)) %>%
  select(-use_chip, -year, -month, -day)

Now that we have filtered and cleaned our dataset, we are ready to collect this dataset into local RAM.

discover <- collect(discover_tbl)
summary(discover)

Now we have a working dataset to start creating features and fitting models. We will not cover those steps in this blog, but if you want to learn more about building models in RStudio on SageMaker refer to Announcing Fully Managed RStudio on Amazon SageMaker for Data Scientists.

Cleanup

To clean up any resources to avoid incurring recurring costs, delete the root CloudFormation template. Also delete all EFS mounts created and any S3 buckets and objects created.

Conclusion

Data analysis and modeling can be challenging when working with large datasets in the cloud. Amazon Redshift is a popular data warehouse that can help users perform these tasks. RStudio, one of the most widely used integrated development environments (IDEs) for data analysis, is often used with R language. In this blog post, we showed how to use Amazon Redshift and RStudio on SageMaker together to efficiently perform analysis on massive datasets. By using RStudio on SageMaker, users can take advantage of the fully managed infrastructure, access control, networking, and security capabilities of SageMaker, while also simplifying integration with Amazon Redshift. If you would like to learn more about using these two tools together, check out our other blog posts and resources. You can also try using RStudio on SageMaker and Amazon Redshift for yourself and see how they can help you with your data analysis and modeling tasks.

Please add your feedback to this blog, or create a pull request on the GitHub.


About the Authors

Ryan Garner is a Data Scientist with AWS Professional Services. He is passionate about helping AWS customers use R to solve their Data Science and Machine Learning problems.

Raj Pathak is a Senior Solutions Architect and Technologist specializing in Financial Services (Insurance, Banking, Capital Markets) and Machine Learning. He specializes in Natural Language Processing (NLP), Large Language Models (LLM) and Machine Learning infrastructure and operations projects (MLOps).

Aditi Rajnish is a Second-year software engineering student at University of Waterloo. Her interests include computer vision, natural language processing, and edge computing. She is also passionate about community-based STEM outreach and advocacy. In her spare time, she can be found rock climbing, playing the piano, or learning how to bake the perfect scone.

Saiteja Pudi is a Solutions Architect at AWS, based in Dallas, Tx. He has been with AWS for more than 3 years now, helping customers derive the true potential of AWS by being their trusted advisor. He comes from an application development background, interested in Data Science and Machine Learning.

Read More