Amazon Personalize announces recommenders optimized for Retail and Media & Entertainment

Today, we’re excited to announce the launch of personalized recommenders in Amazon Personalize that are optimized for retail and media and entertainment, making it even easier to personalize your websites, apps, and marketing campaigns. With this launch, we have drawn on Amazon’s rich experience creating unique personalized user experiences using machine learning (ML) to build recommenders for common personalization use cases. Use cases optimized recommendation solutions deliver personalized experiences for your users that consider the metrics that matter most to your business, the preferences of your individual users, and where your users are being served a personalized experience within the user journey. You can quickly integrate recommenders into any application via easy-to-use APIs.

This post walks you through the process of creating a recommender and getting recommendations for your users.

New personalized recommenders

To realize the true potential of personalization, businesses need to tailor their content to the user journey. For instance, an ecommerce website can recommend products to an existing customer based on their past browsing history (for example, a “Recommended for you” carousel) to drive greater engagement by providing item recommendations that are relevant to that user’s individual interests. On a product detail page, you can upsell products through a “Customers who viewed X also viewed” widget that uses the context of the product your customer is already engaging with. Finally, on the checkout page, a retailer may want to cross-sell products with “Frequently bought together” recommendations to increase average order value.

Similarly, a video-on-demand business can place a widget on their home page that shows the most popular recommendations to highlight the most viewed content across the world in the past week or month. You may want to build a “Because you watched this” widget after videos are watched to provide similar content with a greater chance of driving an increase in the time spent on your platform.

Each touchpoint requires intelligent personalization that understands the user, their current context, and their real-time interests or in-session preferences when delivering recommendations. Businesses today understand the need for and benefits of personalization, but building recommendation systems from the ground up requires significant investments of time and resources, in addition to extensive ML expertise.

With the launch of recommenders, you simply select the use cases you need from a library of recommenders within Amazon Personalize. “Most Viewed,” “Best Sellers”, “Frequently Bought Together,” “Customers who Viewed X also Viewed,” and “Recommended for you” are available for retail, and “Most Popular,” “Because you Watched X,” “More Like X,” and “Top Picks” are available for media and entertainment, with more to come. You select the recommenders for your use cases and Amazon Personalize does the heavy lifting of using ML to generate recommendations that you access through an easy-to-use API.

Recommenders learn from your users’ historical activity as well as their real-time interactions with items in your catalog to adjust to changing user preferences and deliver immediate value to your end users and business. Recommenders fully manage the lifecycle of maintaining and hosting personalized recommendation solutions. This accelerates the time needed to bring a solution to market and ensures that the recommendation solutions you deliver to production stay relevant for your users.

Amazon Personalize enables developers to build personalized user experiences with the same ML technology used by Amazon with no ML expertise required. We make it easy for developers to build applications capable of delivering a wide array of personalization experiences. You can start getting recommendations with Amazon Personalize quickly using a few simple API calls or some clicks on the AWS Management Console. You only pay for what you use, with no minimum fees or upfront commitments. All data is encrypted to be private and secure, and is only used to create your recommendations and segments.

Create a recommender

This section walks through the process of creating a recommender. The first step is to create a domain dataset group, which you can create by loading historic data in Amazon Simple Storage Service (Amazon S3) or from data gathered from real-time events.

Each dataset group can contain up to three datasets: Users, Items, and Interactions, with the Interactions dataset being mandatory to create a recommender. Datasets must adhere to the domain-specific schema in order to be used to create the domain-related recommenders.

In this post, we use the Amazon Prime Pantry dataset, which consists of purchase-related data for grocery items, to set up a retail recommender. We have uploaded the interactions dataset under the dataset group Prime-Pantry. You can monitor the status of the data upload through the dashboard for the Prime-Pantry dataset group on the Amazon Personalize console. After the data is imported successfully, choose Create recommenders.

As of this writing, Amazon Personalize offers five recipes for retail customers and four for media and entertainment customers.

The retail recipes are as follows:

  • Customers who viewed X also viewed – Recommendations for items that customers also viewed when they viewed a given item
  • Frequently bought together – Recommendations for items that customers buy together based on a specific item
  • Popular Items by Purchases – Popular items based on the items purchased by your users
  • Popular Items by Views – Popular items based on items viewed by your users
  • Recommended for you – Personalized recommendations for a given user ensuring that any items previously purchased are filtered out

The recipes for media and entertainment are as follows:

  • Most Popular – Most popular videos
  • Because you watched X – Videos similar to a given video watched by a user
  • More like X – Videos similar to a given video
  • Top picks for you – Personalized content recommendations for a specified user

The following screenshot shows how you can select recommenders based on your business needs and define the names of the recommenders. You use each recommender’s ARN to get recommendations when using the REST APIs. In this example, we create two recommenders. The first recommender is for the use case “Items frequently bought together” and is called PP-ItemsFrequentlyBoughtTogether. We also create a recommender for the use case “Popular Items by Purchases” called PP-PopularItemsByPurchases.

You can toggle Use default recommender configurations and Amazon Personalize automatically chooses the best configuration for the models underlying the recommenders. Then choose Create recommenders to start the model building process.

The time taken to create a recommender depends on the data and use cases selected. During this time, Amazon Personalize selects the optimal algorithm for each of the selected use cases, processes the underlying data, and trains a custom private model for your users. You can access all your recommenders and their current status on the Recommenders page.

When the recommender’s status changes to Active, you can choose it to review relevant details about the recommender and test it. Testing helps check the recommendations before you integrate the recommender into your website or application.

The following image shows the test output for a particular item ID for the recommender PP ItemsFrequentlyBoughtTogether.

At this step, you can also apply any filters on the recommendations; for example, to remove items purchased in the past.

Amazon Personalize also provides a recommender ARN in the details section, which you can use to produce recommendations through the Amazon Personalize REST APIs. The following code is an example of calling your API from Python for PP-FrequentlyBoughtTogetherRecommender:

get_recommendations_response = personalize_runtime.get_recommendations( 
campaignArn = arn:aws:personalize:us-west-2:261294318658:recommender/PP-ItemsFrequentlyBoughtTogether 
itemId = str(item_id) 
)

This API call produces the same results as if testing the recommender via the console.

Your recommender is now ready to feed into your website or app and personalize the journey of each of your customers.

Conclusion

Amazon Personalize packages our rich experience creating unique personalized user experiences with ML at Amazon and offers our expertise as a fully managed service to developers looking to personalize their websites and apps. With the launch of use case optimized recommenders, we’re going one step further to tailor our learnings to the unique marketing needs of each industry and each individual business. Recommenders allow you to easily and swiftly access recommendations that are optimized for your specific use case. By understanding the unique context of your customers and their touchpoints, Amazon Personalize allows you to harness the raw power of ML to derive more value for your business and your users.

To learn more about Amazon Personalize, visit the product page.


About the Authors

Anchit Gupta is a Senior Product Manager for Amazon Personalize. She focuses on delivering products that make it easier to build machine learning solutions. In her spare time, she enjoys cooking, playing board/card games, and reading.

Hao Ding is an Applied Scientist at AWS AI Labs and is working on developing next generation recommender system for Amazon Personalize. His research interests include Recommender System, Deep Learning, and Graph Mining.

Pranav Agarwal is a Sr. Software Development Engineer with Amazon Personalize and works on architecting software systems and building AI-powered recommender systems at scale. Outside of work, he enjoys reading, running and has started picking up ice-skating.

Nghia Hoang is a Senior Machine Learning Scientist at AWS AI Labs working on developing personalized learning methods with applications to recommender systems. His research interests include Probabilistic Inference, Deep Generative Learning, Personalized Federated Learning and Meta Learning.

Read More

Artificial intelligence that understands object relationships

When humans look at a scene, they see objects and the relationships between them. On top of your desk, there might be a laptop that is sitting to the left of a phone, which is in front of a computer monitor.

Many deep learning models struggle to see the world this way because they don’t understand the entangled relationships between individual objects. Without knowledge of these relationships, a robot designed to help someone in a kitchen would have difficulty following a command like “pick up the spatula that is to the left of the stove and place it on top of the cutting board.”

In an effort to solve this problem, MIT researchers have developed a model that understands the underlying relationships between objects in a scene. Their model represents individual relationships one at a time, then combines these representations to describe the overall scene. This enables the model to generate more accurate images from text descriptions, even when the scene includes several objects that are arranged in different relationships with one another.

This work could be applied in situations where industrial robots must perform intricate, multistep manipulation tasks, like stacking items in a warehouse or assembling appliances. It also moves the field one step closer to enabling machines that can learn from and interact with their environments more like humans do.

“When I look at a table, I can’t say that there is an object at XYZ location. Our minds don’t work like that. In our minds, when we understand a scene, we really understand it based on the relationships between the objects. We think that by building a system that can understand the relationships between objects, we could use that system to more effectively manipulate and change our environments,” says Yilun Du, a PhD student in the Computer Science and Artificial Intelligence Laboratory (CSAIL) and co-lead author of the paper.

Du wrote the paper with co-lead authors Shuang Li, a CSAIL PhD student, and Nan Liu, a graduate student at the University of Illinois at Urbana-Champaign; as well as Joshua B. Tenenbaum, a professor of computational cognitive science in the Department of Brain and Cognitive Sciences and a member of CSAIL; and senior author Antonio Torralba, the Delta Electronics Professor of Electrical Engineering and Computer Science and a member of CSAIL. The research will be presented at the Conference on Neural Information Processing Systems in December.

One relationship at a time

The framework the researchers developed can generate an image of a scene based on a text description of objects and their relationships, like “A wood table to the left of a blue stool. A red couch to the right of a blue stool.”

Their system would break these sentences down into two smaller pieces that describe each individual relationship (“a wood table to the left of a blue stool” and “a red couch to the right of a blue stool”), and then model each part separately. Those pieces are then combined through an optimization process that generates an image of the scene.

The researchers used a machine-learning technique called energy-based models to represent the individual object relationships in a scene description. This technique enables them to use one energy-based model to encode each relational description, and then compose them together in a way that infers all objects and relationships.

By breaking the sentences down into shorter pieces for each relationship, the system can recombine them in a variety of ways, so it is better able to adapt to scene descriptions it hasn’t seen before, Li explains.

“Other systems would take all the relations holistically and generate the image one-shot from the description. However, such approaches fail when we have out-of-distribution descriptions, such as descriptions with more relations, since these model can’t really adapt one shot to generate images containing more relationships. However, as we are composing these separate, smaller models together, we can model a larger number of relationships and adapt to novel combinations,” Du says.

The system also works in reverse — given an image, it can find text descriptions that match the relationships between objects in the scene. In addition, their model can be used to edit an image by rearranging the objects in the scene so they match a new description.

Understanding complex scenes

The researchers compared their model to other deep learning methods that were given text descriptions and tasked with generating images that displayed the corresponding objects and their relationships. In each instance, their model outperformed the baselines.

They also asked humans to evaluate whether the generated images matched the original scene description. In the most complex examples, where descriptions contained three relationships, 91 percent of participants concluded that the new model performed better.

“One interesting thing we found is that for our model, we can increase our sentence from having one relation description to having two, or three, or even four descriptions, and our approach continues to be able to generate images that are correctly described by those descriptions, while other methods fail,” Du says.

The researchers also showed the model images of scenes it hadn’t seen before, as well as several different text descriptions of each image, and it was able to successfully identify the description that best matched the object relationships in the image.

And when the researchers gave the system two relational scene descriptions that described the same image but in different ways, the model was able to understand that the descriptions were equivalent.

The researchers were impressed by the robustness of their model, especially when working with descriptions it hadn’t encountered before.

“This is very promising because that is closer to how humans work. Humans may only see several examples, but we can extract useful information from just those few examples and combine them together to create infinite combinations. And our model has such a property that allows it to learn from fewer data but generalize to more complex scenes or image generations,” Li says.

While these early results are encouraging, the researchers would like to see how their model performs on real-world images that are more complex, with noisy backgrounds and objects that are blocking one another.

They are also interested in eventually incorporating their model into robotics systems, enabling a robot to infer object relationships from videos and then apply this knowledge to manipulate objects in the world.

“Developing visual representations that can deal with the compositional nature of the world around us is one of the key open problems in computer vision. This paper makes significant progress on this problem by proposing an energy-based model that explicitly models multiple relations among the objects depicted in the image. The results are really impressive,” says Josef Sivic, a distinguished researcher at the Czech Institute of Informatics, Robotics, and Cybernetics at Czech Technical University, who was not involved with this research.

This research is supported, in part, by Raytheon BBN Technologies Corp., Mitsubishi Electric Research Laboratory, the National Science Foundation, the Office of Naval Research, and the IBM Thomas J. Watson Research Center.

Read More

A Very Thankful GFN Thursday: New Games, GeForce NOW Gift Cards and More

Happy Thanksgiving, members. It’s a very special GFN Thursday.

As the official kickoff to what’s sure to be a busy holiday season for our members around the globe, this week’s GFN Thursday brings a few reminders of the joys of PC gaming in the cloud.

Plus, kick back for the holiday with four new games coming to the GeForce NOW library this week.

Game Away the Holiday

With the power of the cloud, any laptop can be a gaming laptop — even a Mac or Chromebook.

The holidays are often spent celebrating with extended family — which is great, until Aunt Petunia starts trying to teach you cross-stitch or Grandpa Harold begins another one of his fishing trip stories. If you need a break from the relatives, get your gaming in, powered by the cloud.

With GeForce NOW, nearly any device can become a GeForce gaming rig. Grab Uncle Buck’s Chromebook and get a few rounds of Apex Legends in, or check in with Star-Lord and the crew from your mobile device in Marvel’s Guardians of the Galaxy. You can even squad up on some Macbooks with your cousins for a few Destiny 2 raids at the kid’s table, where we know the real fun is.

How about escaping for a bit to a tropical jungle? For a limited time, get a copy of Crysis Remastered free with the purchase of a six-month Priority membership or the new GeForce NOW RTX 3080 membership. Terms and conditions apply.

GeForce NOW members can experience the first game in the Crysis series — or 1,000+ more games — across nearly all of their devices, turning even a Mac or a mobile device into the ultimate gaming rig. It’s the perfect way to keep the gaming going after pumpkin pie is served.

The Gift of Gaming

The easiest upgrade in PC gaming makes a perfect gift for gamers.

GeForce NOW Priority Membership digital gift cards are now available in 2-month, 6-month or 12-month options. Give the gift of powerful PC gaming to a special someone who uses a low-powered device, a budding gamer using a Mac, or a squadmate who’s gaming on the go.

Gift cards can be redeemed on an existing GeForce NOW account or added to a new one. Existing Founders and Priority members will have the number of months added to their accounts.

Eat, Play and Be Merry

Ghostrunner on GeForce NOW
Make your way up from the bottom to the top, confront the tyrannical Keymaster and take your revenge in Ghostrunner, streaming on GeForce NOW.

Between bites of stuffing and mashed potatoes, members can look for the following games joining the GeForce NOW library:

We make every effort to launch games on GeForce NOW as close to their release as possible, but, in some instances, games may not be available immediately.

We initially planned to add Farming Simulator 2022 to GeForce NOW in November, but discovered an issue during our onboarding process. We hope to add the game in the coming weeks.

Whether you’re celebrating Thanksgiving or just looking forward to a gaming-filled weekend, tell us what you’re thankful for on Twitter or in the comments below.

The post A Very Thankful GFN Thursday: New Games, GeForce NOW Gift Cards and More appeared first on The Official NVIDIA Blog.

Read More

An Introduction to Keras Preprocessing Layers

Posted by Matthew Watson, Keras Developer

Determining the right feature representation for your data can be one of the trickiest parts of building a model. Imagine you are working with categorical input features such as names of colors. You could one-hot encode the feature so each color gets a 1 in a specific index ('red' = [0, 0, 1, 0, 0]), or you could embed the feature so each color maps to a unique trainable vector ('red' = [0.1, 0.2, 0.5, -0.2]). Larger category spaces might do better with an embedding, and smaller spaces as a one-hot encoding, but the answer is not clear cut. It will require experimentation on your specific dataset.

Ideally, we would like updates to our feature representation and updates to our model architecture to happen in a tight iterative loop, applying new transformations to our data while changing our model architecture. In practice, feature preprocessing and model building are usually handled by entirely different libraries, frameworks, or languages. This can slow the process of experimentation.

On the Keras team, we recently released Keras Preprocessing Layers, a set of Keras layers aimed at making preprocessing data fit more naturally into model development workflows. In this post we are going to use the layers to build a simple sentiment classification model with the imdb movie review dataset. The goal will be to show how preprocessing can be flexibly developed and applied. To start, we can import tensorflow and download the training data.

import tensorflow as tf
import tensorflow_datasets as tfds

train_ds = tfds.load('imdb_reviews', split='train', as_supervised=True).batch(32)

Keras preprocessing layers can handle a wide range of input, including structured data, images, and text. In this case, we will be working with raw text, so we will use the TextVectorization layer.

By default, the TextVectorization layer will process text in three phases:

  • First, remove punctuation and lower cases the input.
  • Next, split text into lists of individual string words.
  • Finally, map strings to numeric outputs using a vocabulary of known words.

A simple approach we can try here is a multi-hot encoding, where we only consider the presence or absence of terms in the review. For example, say a layer vocabulary is ['movie', 'good', 'bad'], and a review read 'This movie was bad.'. We would encode this as [1, 0, 1], where movie (the first vocab term) and bad (the last vocab term) are present.

text_vectorizer = tf.keras.layers.TextVectorization(
output_mode='multi_hot', max_tokens=2500)
features = train_ds.map(lambda x, y: x)
text_vectorizer.adapt(features)

Above, we create a TextVectorization layer with multi-hot output, and do two things to set the layer’s state. First, we map over our training dataset and discard the integer label indicating a positive or negative review. This gives us a dataset containing only the review text. Next, we adapt() the layer over this dataset, which causes the layer to learn a vocabulary of the most frequent terms in all documents, capped at a max of 2500.

Adapt is a utility function on all stateful preprocessing layers, which allows layers to set their internal state from input data. Calling adapt is always optional. For TextVectorization, we could instead supply a precomputed vocabulary on layer construction, and skip the adapt step.

We can now train a simple linear model on top of this multi-hot encoding. We will define two functions: preprocess, which converts raw input data to the representation we want for our model, and forward_pass, which applies the trainable layers.

def preprocess(x):
return text_vectorizer(x)

def forward_pass(x):
return tf.keras.layers.Dense(1)(x) # Linear model

inputs = tf.keras.Input(shape=(1,), dtype='string')
outputs = forward_pass(preprocess(inputs))
model = tf.keras.Model(inputs, outputs)
model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True))
model.fit(train_ds, epochs=5)

That’s it for an end-to-end training example, and already enough for 85% accuracy. You can find complete code for this example at the bottom of this post.

Let’s experiment with a new feature. Our multi-hot encoding does not contain any notion of review length, so we can try adding a feature for normalized string length. Preprocessing layers can be mixed with TensorFlow ops and custom layers as desired. Here we can combine the tf.strings.length function with the Normalization layer, which will scale the input to have 0 mean and 1 variance. We have only updated code up to the preprocess function below, but we will show the rest of training for clarity.

# This layer will scale our review length feature to mean 0 variance 1.
normalizer = tf.keras.layers.Normalization(axis=None)
normalizer.adapt(features.map(lambda x: tf.strings.length(x)))

def preprocess(x):
multi_hot_terms = text_vectorizer(x)
normalized_length = normalizer(tf.strings.length(x))
# Combine the multi-hot encoding with review length.
return tf.keras.layers.concatenate((multi_hot_terms, normalized_length))

def forward_pass(x):
return tf.keras.layers.Dense(1)(x) # Linear model.

inputs = tf.keras.Input(shape=(1,), dtype='string')
outputs = forward_pass(preprocess(inputs))
model = tf.keras.Model(inputs, outputs)
model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True))
model.fit(train_ds, epochs=5)

Above, we create the normalization layer and adapt it to our input. Within the preprocess function, we simply concatenate our multi-hot encoding and length features together. We learn a linear model over the union of the two feature representations.

The last change we can make is to speed up training. We have one major opportunity to improve our training throughput. Right now, every training step, we spend some time on the CPU performing string operations (which cannot run on an accelerator), followed by calculating a loss function and gradients on a GPU.

With all computation in a single model, we will first preprocess each batch on the CPU and then update parameter weights on the GPU. This leaves gaps in our GPU usage.
With all computation in a single model, we will first preprocess each batch on the CPU and then update parameter weights on the GPU. This leaves gaps in our GPU usage.

This gap in accelerator usage is totally unnecessary! Preprocessing is distinct from the actual forward pass of our model. The preprocessing doesn’t use any of the parameters being trained. It’s a static transformation that we could precompute.

To speed things up, we would like to prefetch our preprocessed batches, so that each time we are training on one batch we are preprocessing the next. This is easy to do with the tf.data library, which was built for uses like this. The only major change we need to make is to split our monolithic keras.Model into two: one for preprocessing and one for training. This is easy with Keras’ functional API.

inputs = tf.keras.Input(shape=(1,), dtype="string")
preprocessed_inputs = preprocess(inputs)
outputs = forward_pass(preprocessed_inputs)

# The first model will only apply preprocessing.
preprocessing_model = tf.keras.Model(inputs, preprocessed_inputs)
# The second model will only apply the forward pass.
training_model = tf.keras.Model(preprocessed_inputs, outputs)
training_model.compile(
loss=tf.keras.losses.BinaryCrossentropy(from_logits=True))

# Apply preprocessing asynchronously with tf.data.
# It is important to call prefetch and remember the AUTOTUNE options.
preprocessed_ds = train_ds.map(
lambda x, y: (preprocessing_model(x), y),
num_parallel_calls=tf.data.AUTOTUNE).prefetch(tf.data.AUTOTUNE)

# Now the GPU can focus on the training part of the model.
training_model.fit(preprocessed_ds, epochs=5)

In the above example, we pass a single keras.Input through our preprocess and forward_pass functions, but define two separate models over the transformed inputs. This slices our single graph of operations into two. Another valid option would be to only make a training model, and call the preprocess function directly when we map over our dataset. In this case, the keras.Input would need to reflect the type and shape of the preprocessed features rather than the raw strings.

Using tf.data to prefetch batches cuts our train step time by over 30%! Our compute time now looks more like the following:

With tf.data, we are now precomputing each preprocessed batch before the GPU needs it. This significantly speeds up training.
With tf.data, we are now precomputing each preprocessed batch before the GPU needs it. This significantly speeds up training.

We could even go a step further than this, and use tf.data to cache our preprocessed dataset in memory or on disk. We would simply add a .cache() call directly before the call to prefetch. In this way, we could entirely skip computing our preprocessing batches after the first epoch of training.

After training, we can rejoin our split model into a single model during inference. This allows us to save a model that can directly handle raw input data.

inputs = preprocessing_model.input
outputs = training_model(preprocessing_model(inputs))
inference_model = tf.keras.Model(inputs, outputs)
inference_model.predict(
tf.constant(["Terrible, no good, trash.", "I loved this movie!"]))

Keras preprocessing layers aim to provide a flexible and expressive way to build data preprocessing pipelines. Prebuilt layers can be mixed and matched with custom layers and other tensorflow functions. Preprocessing can be split from training and applied efficiently with tf.data, and joined later for inference. We hope they allow for more natural and efficient iterations on feature representation in your models.

To play around with the code from this post in a Colab, you can follow this link. To see a wide range of tasks you can do with preprocessing layers, see the Quick Recipes section of our preprocessing guide. You can also check out our complete tutorials for basic text classification, image data augmentation, and structured data classification.

Read More

Build MLOps workflows with Amazon SageMaker projects, GitLab, and GitLab pipelines

Machine learning operations (MLOps) are key to effectively transition from an experimentation phase to production. The practice provides you the ability to create a repeatable mechanism to build, train, deploy, and manage machine learning models. To quickly adopt MLOps, you often require capabilities that use your existing toolsets and expertise. Projects in Amazon SageMaker give organizations the ability to easily set up and standardize developer environments for data scientists and CI/CD (continuous integration, continuous delivery) systems for MLOps engineers. With SageMaker projects, MLOps engineers or organization administrators can define templates that bootstrap the ML workflow with source version control, automated ML pipelines, and a set of code to quickly start iterating over ML use cases. With projects, dependency management, code repository management, build reproducibility, and artifact sharing and management become easy for organizations to set up. SageMaker projects are provisioned using AWS Service Catalog products. Your organization can use project templates to provision projects for each of your users.

In this post, you use a custom SageMaker project template to incorporate CI/CD practices with GitLab and GitLab pipelines. You automate building a model using Amazon SageMaker Pipelines for data preparation, model training, and model evaluation. SageMaker projects builds on Pipelines by implementing the model deployment steps and using SageMaker Model Registry, along with your existing CI/CD tooling, to automatically provision a CI/CD pipeline. In our use case, after the trained model is approved in the model registry, the model deployment pipeline is triggered via a GitLab pipeline.

Prerequisites

For this walkthrough, you should have the following prerequisites:

This post provides a detailed explanation of the SageMaker projects, GitLab, and GitLab pipelines integration. We review the code and discuss the components of the solution. To deploy the solution, reference the GitHub repo, which provides step-by-step instructions for implementing a MLOps workflow using a SageMaker project template with GitLab and GitLab pipelines.

Solution overview

The following diagram shows the architecture we build using a custom SageMaker project template.

Let’s review the components of this architecture to understand the end-to-end setup:

  • GitLab – Acts as our code repository and enables CI/CD using GitLab pipelines. The custom SageMaker project template creates two repositories (model build and model deploy) in your GitLab account.
    • The first repository (model build) provides code to create a multi-step model building pipeline. This includes steps for data processing, model training, model evaluation, and conditional model registration based on accuracy. It trains a linear regression model using the XGBoost algorithm on the well-known UCI Machine Learning Abalone dataset.
    • The second repository (model deploy) contains the code and configuration files for model deployment, as well as the test scripts required to pass the quality benchmark. These are code stubs that must be defined for your use case.
    • Each repository also has a GitLab CI pipeline. The model build pipeline automatically triggers and runs the pipeline from end to end whenever a new commit is made to the model build repository. The model deploy pipeline is triggered whenever a new model version is added to the model registry, and the model version status is marked as Approved.
  • SageMaker Pipelines – Contains the directed acyclic graph (DAG) that includes data preparation, model training, and model evaluation.
  • Amazon S3 – An Amazon Simple Storage Service (Amazon S3) bucket stores the output model artifacts that are generated from the pipeline.
  • AWS Lambda – Two AWS Lambda functions are created, which we review in more detail later in this post:
    • One function seeds the code into your two GitLab repositories.
    • One function triggers the model deployment pipeline after the new model is registered in the model registry.
  • SageMaker Model Registry – Tracks the model versions and respective artifacts, including the lineage and metadata. A model package group is created that contains the group of related model versions. The model registry also manages the approval status of the model version for downstream deployment.
  • Amazon EventBridge Amazon EventBridge monitors all changes to the model registry. It also contains a rule that triggers the Lambda function for the model deploy pipeline, when the model package version state changes from PendingManualApproval to Approved in the model registry.
  • AWS CloudFormation AWS CloudFormation deploys the model and creates the SageMaker endpoints when the model deploy pipeline is triggered by the approval of the trained model.
  • SageMaker hosting – Creates two HTTPS real-time endpoints to perform inference. The hosting option is configurable, for example, for batch transform or asynchronous inference. The staging endpoint is created when the model deploy pipeline is triggered by the approval of the trained model. This endpoint is used to evaluate the deployed model by confirming it’s generating predictions that meet our target accuracy requirements. When the model is ready to be deployed in production, a production endpoint is provisioned by manually starting the job in the GitLab model deploy pipeline.

Use the new MLOps project template with GitLab and GitLab pipelines

In this section, we review the parameters required for the MLOps project template (see the following screenshot). This template allows you to utilize GitLab pipelines as your orchestrator.

The template has the following parameters:

  • GitLab Server URL – The URL of the GitLab server in https:// format. The GitLab accounts under your organization may contain a different customized server URL (domain). The server URL is required to authorize access to the python-gitlab API. You use the personal access token you created to allow permission to the Lambda functions to push the seed code into your GitLab repositories. We discuss the Lambda function code in more detail in the next section.
  • Base URL for your GitLab Repositories – The URL for your GitLab account to create the model build and deploy repositories in the format of https://<gitlab server>/<username> or https://<gitlab server><group>/<project>. You must create a personal access token under your GitLab user account in order to authenticate with the GitLab API.
  • Model Build Repository Name – The name of the repository mlops-gitlab-project-seedcode-model-build of the model build and training seed code.
  • Model Deploy Repository Name – The name of the repository mlops-gitlab-project-seedcode-model-deploy of the model deploy seed code.
  • GitLab Group ID – GitLab groups are important for managing access and permissions for projects. Enter the ID of the group that repositories are created for. In this example, we enter None, because we’re using the root group.
  • GitLab Secret Name (Secrets Manager) – The secret in AWS Secrets Manager contains the value of the GitLab personal access token that is used by the Lambda function to populate the seed code in the repositories. Enter the name of the secret you created in Secrets Manager.

Lambda functions code overview

As discussed earlier, we create two Lambda functions. The first function seeds the code into your GitLab repositories. The second function triggers your model deployment. Let’s review these functions in more detail.

Seedcodecheckin Lambda function

This function helps create the GitLab projects and repositories and pushes the code files into these repositories. These files are needed to set up the ML CI/CD pipelines.

The Secrets Manager secret is created to allow the function to retrieve the stored GitLab personal access token. This token allows the function to communicate with GitLab to create repositories and push the seed code. It also allows the environment variables to be passed in through the project.yml file. See the following code:

def get_secret():
    ''' '''
    secret_name = os.environ['SecretName']
    region_name = os.environ['Region']
    
    session = boto3.session.Session()
    client = session.client(
        service_name='secretsmanager',
        region_name=region_name
    )

The Secrets Manager secret was created when you ran the init.sh file earlier as part of the code repo prerequisites.

The deployment package for the function contains several libraries, including python-gitlab and cfn-response. Because our function’s source code is packaged as a .zip file and interacts with AWS CloudFormation, we use cfn-response. We use the python-gitlab API and the Amazon SDK for Python (Boto3) to download the seed code files and upload them to Amazon S3 to be pushed to our GitLab repositories. See the following code:

    # Configure SDKs for GitLab and S3
    gl = gitlab.Gitlab(gitlab_server_uri, private_token=gitlab_private_token)
    s3 = boto3.client('s3')
 
    model_build_filename = f'/tmp/{str(uuid.uuid4())}-model-build-seed-code.zip'
    model_deploy_filename = f'/tmp/{str(uuid.uuid4())}-model-deploy-seed-code.zip'
    model_build_directory = f'/tmp/{str(uuid.uuid4())}-model-build'
    model_deploy_directory = f'/tmp/{str(uuid.uuid4())}-model-deploy'

    # Get Model Build Seed Code from S3 for Gitlab Repo
    with open(model_build_filename, 'wb') as f:
        s3.download_fileobj(sm_seed_code_bucket, model_build_sm_seed_code_object_name, f)

    # Get Model Deploy Seed Code from S3 for Gitlab Repo
    with open(model_deploy_filename, 'wb') as f:
        s3.download_fileobj(sm_seed_code_bucket, model_deploy_sm_seed_code_object_name, f)

Two projects (repositories) are created in GitLab, and the seed code files are pushed into the repositories (model build and model deploy) using the python-gitlab API:

# Create the GitLab Project
    try:
        if group_id is None:
            build_project = gl.projects.create({'name': gitlab_project_name_build})
        else:
            build_project = gl.projects.create({'name': gitlab_project_name_build, 'namespace_id': int(group_id)})
    ....
    try:
        if group_id is None:
            deploy_project = gl.projects.create({'name': gitlab_project_name_deploy})
        else:
            deploy_project = gl.projects.create({'name': gitlab_project_name_deploy, 'namespace_id': int(group_id)})
    ....
    
    # Commit to the above created Repo all the files that were in the seed code Zip
    try:
        build_project.commits.create(build_data)
    except Exception as e:
        logging.error("Code could not be pushed to the model build repo.")
        logging.error(e)
        cfnresponse.send(event, context, cfnresponse.FAILED, response_data)
        return { 
            'message' : "GitLab seedcode checkin failed."
        }

    try:
        deploy_project.commits.create(deploy_data)
    except Exception as e:
        logging.error("Code could not be pushed to the model deploy repo.")
        logging.error(e)
        cfnresponse.send(event, context, cfnresponse.FAILED, response_data)
        return { 
            'message' : "GitLab seedcode checkin failed."
        }

The following screenshot shows the successful run of the Lambda function pushing the required seed code files into both projects in your GitLab account.

gitlab-trigger Lambda function

This Lambda function is triggered by EventBridge. The project.yml CloudFormation template contains an EventBridge rule that triggers the function when the model package state changes in the SageMaker model registry. See the following code:

ModelDeploySageMakerEventRule:
    Type: AWS::Events::Rule
    Properties:
      # Max length allowed: 64
      Name: !Sub sagemaker-${SageMakerProjectName}-${SageMakerProjectId}-event-rule # max: 10+33+15+5=63 chars
      Description: "Rule to trigger a deployment when SageMaker Model registry is updated with a new model package. For example, a new model package is registered with Registry"
      EventPattern:
        source:
          - "aws.sagemaker"
        detail-type:
          - "SageMaker Model Package State Change"
        detail:
          ModelPackageGroupName:
            - !Sub ${SageMakerProjectName}-${SageMakerProjectId}
      State: "ENABLED"
      Targets:
        -
          Arn: !GetAtt GitLabPipelineTriggerLambda.Arn
          Id: !Sub sagemaker-${SageMakerProjectName}-trigger

The following screenshot contains a subset of the function code that triggers the GitLab pipeline in the .gitlab-ci.yml file. It deploys the SageMaker model endpoints using the CloudFormation template endpoint-config-template.yml in your model deploy repository.

To better understand the solution, review the entire code for the functions as needed.

GitLab and GitLab pipelines overview

As described earlier, GitLab plays a key role as the source code repo and enabling CI/CD pipelines in this solution. Let’s look into our GitLab account to understand the components.

After the project is successfully created, using our custom template in SageMaker projects per the steps in the code repo, navigate to your GitLab account to see two new repositories. Each repository has a GitLab CI pipeline associated with it that runs as soon as the project is created.

The first run of each pipeline fails because GitLab doesn’t have the AWS credentials. For each repository, navigate to Settings, CI/CD, Variables. Create two new variables, AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY, with the associated information for your GitLab role.

Model build pipeline in GitLab

Let’s review the GitLab pipelines, starting with the model build pipeline. We define the pipelines in GitLab by creating the .gitlab-ci.yml file, where we define the various stages and related jobs. As shown in the following screenshot, this pipeline has only one stage (training) and the related script shows how a SageMaker pipeline file is triggered. (You can learn more about the SageMaker pipeline by exploring the pipeline.py file on GitHub.)

When this GitLab pipeline is triggered, it starts the Abalone SageMaker pipeline to build your model.

When the model build is complete, you can locate this model in the model registry in SageMaker Studio.

Use this template for your custom use case

The model build repository contains code for preprocessing, training, and evaluating the model for the UCI Abalone dataset. You need to modify the files to address your custom use case.

  1. Navigate to the pipelines folder in your model build repository.

  1. Upload your dataset to a S3 bucket. Replace the bucket URL in this section of your pipeline.py file.

  1. Navigate to .gitlab-ci.yml and modify this section with the folder and file of your use case.

Model deployment pipeline in GitLab

When the SageMaker pipeline that trains the model is complete, a model is added to the SageMaker model registry. If that model is approved, the GitLab pipeline in the model deploy repository starts and the model deployment process begins.

To approve the model in the model registry, complete the following steps:

  1. Choose the Components and registries icon.
  2. Choose Model registry, and choose (right-click) the model version.
  3. Choose Update model version status.
  4. Change the status from Pending to Approved.

This triggers the deploy pipeline.

Now, let’s review the .gitlab-ci.yml file in the model deploy repository. As shown in the following screenshot, this model deploy pipeline has four stages: build, staging deploy, test staging, and production deploy. This pipeline uses AWS CloudFormation to deploy the model and create the SageMaker endpoints.

A manual step in the GitLab pipeline exists for model promotion from staging to production that creates an endpoint with the suffix -prod. If you choose manual, this job runs and upon completion deploys the SageMaker endpoint.

To verify that the endpoints were created, navigate to the Endpoints page on the SageMaker console. You should see two endpoints: <model_name>-staging and <model_name>-prod.

GitLab implementation patterns

In this section, we discuss two patterns for implementing GitLab: hosting with Amazon Virtual Private Cloud (Amazon VPC), or with two-factor authentication.

Hosting GitLab in an Amazon VPC

You may choose to deploy GitLab in an Amazon VPC to use a private network and provide access to AWS resources. In this scenario, the Lambda functions also must be deployed in a VPC to access the GitLab API. We accomplish this by updating the project.yml file and the AWS Identity and Access Management (IAM) role AmazonSageMakerServiceCatalogProductsUseRole.

The IAM user that you used to create the VPC requires the following user permissions for Lambda to verify network resources:

  • ec2:DescribeSecurityGroups
  • ec2:DescribeSubnets
  • ec2:DescribeVpcs

The Lambda functions’ execution role requires the following permissions to create and manage network interfaces:

  • ec2:CreateNetworkInterface
  • ec2:DescribeNetworkInterfaces
  • ec2:DeleteNetworkInterface
  1. On the IAM console, search for AmazonSageMakerServiceCatalogProductsUseRole.
  2. Choose Attach policies.
  3. Search for the AWSLambdaVPCAccessExecutionRole managed policy.
  4. Choose Attach policy.

Next, we update project.yml to configure the functions to deploy in a VPC by providing the VPC security groups and subnets.

    1. Add the subnet IDs and security group IDs to the Parameters section, for example:
      SubnetId1:
      Type: AWS::EC2::Subnet::Id
      Description: Subnet Id for Lambda function
      
      SubnetId2:
      Type: AWS::EC2::Subnet::Id
      Description: Subnet Id for Lambda function
      
      SecurityGroupId:
      Type: AWS::EC2::SecurityGroup::Id
      Description: Security Group Id for Lambda function to Execute
      

    2. Add the VpcConfig information under Properties for the GitSeedCodeCheckinLambda and GitLabPipelineTriggerLambda functions, for example:
      SubnetId1:
      GitSeedCodeCheckinLambda:
      Type: 'AWS::Lambda::Function'
      Properties:
      Description: To trigger the codebuild project for the seedcode checkin
      .....
      VpcConfig:
      SecurityGroupIds:
      - !Ref SecurityGroupId
      SubnetIds:
      - !Ref SubnetId1
      - !Ref SubnetId2
      

Two-factor authentication enabled

If you enabled two-factor authentication on your GitLab account, you need to use your personal access token to clone the repositories in SageMaker Studio. The token requires the read_repository and write_repository flags. To clone the model build and model deploy repositories, enter the following commands:

git clone https://oauth2:PERSONAL_ACCESS_TOKEN@gitlab.com/username/gitlab-project-seedcode-model-build-<project-id>
git clone https://oauth2:PERSONAL_ACCESS_TOKEN@gitlab.com/username/gitlab-project-seedcode-model-deploy-<project-id>

Because you previously created a secret for your personal access token, no changes are required to the code when two-factor authentication is enabled.

Summary

In this post, we walked through using a custom SageMaker MLOps project template to automatically build and configure a CI/CD pipeline. This pipeline incorporated your existing CI/CD tooling with SageMaker features for data preparation, model training, model evaluation, and model deployment. In our use case, we focused on using GitLab and GitLab pipelines with SageMaker projects and pipelines. For more detailed implementation information, review the GitHub repo. Try it out and let us know if you have any questions in the comments section!


About the Authors

Kirit Thadaka is an ML Solutions Architect working in the Amazon SageMaker Service SA team. Prior to joining AWS, Kirit spent time working in early stage AI startups followed by some time in consulting in various roles in AI research, MLOps, and technical leadership.

Lauren Mullennex is a Solutions Architect based in Denver, CO. She works with customers to help them architect solutions on AWS. In her spare time, she enjoys hiking and cooking Hawaiian cuisine.

Indrajit Ghosalkar is a Sr. Solutions Architect at Amazon Web Services based in Singapore. He loves helping customers achieve their business outcomes through cloud adoption and realize their data analytics and ML goals through adoption of DataOps / MLOps practices and solutions. In his spare time, he enjoys playing with his son, traveling and meeting new people.

Read More

Simplified MLOps with Deep Java Library

This is a guest post by Lucas Baker, Andrea Duque, and Viet Yen Nguyen of Hypefactors.  

At Hypefactors, we build tech for media intelligence and reputation management. The solution is a software as a service (SaaS) product that does large-scale media monitoring of social media, news sites, TV, radio, and reviews across the world. The tracked data is streamed continuously and enriched in real time. This yields insights that can reveal early business opportunities (for example, GameStop hype), track the success of product launches, and preempt disasters.

To this end, over a hundred million network requests are made daily from data pipelines for web crawling, social media firehoses, and other REST-based media data integrations. This yields millions of new articles and posts each day. This data can be segmented into three classes (as illustrated with the following examples):

  • Owned – Articles or posts written by a company and published on their own website or social media feed.
  • Paid – Information written by a company and published on third-party websites or social media. This is known colloquially as advertisement.
  • Earned – Information written by a third party and published on that party’s website or social media.
Owned media Earned media Paid media

Differentiating between earned articles and owned or paid ones is of existential importance. Earned information is more independent and therefore interpreted as more trustworthy—no matter if it’s positive or negative for the company. Advertisement, on the other hand, is written by the company and portrays the best interests of the company. Therefore, to accurately track reputation, we must filter out advertisements.

This post goes deeper into our deep learning natural language processing (NLP) based advertisement predictor, how we integrated the predictor into one of our pipelines using Deep Java Library (DJL), and how that change made our architecture simpler and MLOps easier. DJL is an open source Java framework for deep learning created by AWS.

Printed newspapers and magazines: Challenges

We receive thousands of different magazines and newspapers directly from publishing houses in the form of digital files. One of the data teams within Hypefactors has developed a data pipeline, which we call the Print-ETL. The Print-ETL processes the raw data and ingests it into a database. The ingested data is made searchable in a user-friendly way by the Hypefactors web platform.

Processing and realigning data from different data providers is generally challenging. This is also the case with handling different publishing houses as data providers. The challenges are technical, organizational, and a combination thereof. That is partly because media houses are legacy both in their data delivery and data formats.

Organizational challenges include disagreement between different media houses on how media data should be delivered, and the lack of a common schema. A common strategy media houses use is to provide print data via an SFTP server. This can be consumed by periodically connecting and fetching the data. Most of the time we retrieve only the digital PDF files of the editions, but they can also arrive in other formats, such as XML or ZIP. On top of that, files often come with no relevant metadata about the publication. Such metadata is useful, for example, to identify the title of the newspaper or the magazine.

The technical challenges are various. However, when it comes to PDFs, one of the biggest challenges is that a PDF may or may not be vectorized. A vectorized PDF, as opposed to a bitmapped one, is one that contains all the raw data that appears on the page. When a PDF is vectorized, it’s easy to retrieve its text. But when it’s not, all we have are bitmapped images. To make articles searchable for users, the content of a bitmapped PDF needs to be transformed to a text format using optical character recognition (OCR) solutions.

Another big challenge is that PDFs can have any number of pages. Typically, there is no information telling us which pages constitute an article. There can be several articles sharing one PDF page, or several PDF pages containing a single article. Advertisements also appear anywhere—they can cover the whole page, several pages, or just a small section close to an article.

To mitigate these difficulties, we developed elaborate development and operations procedures. These are assisted by automated procedures, such as automated unit and end-to-end testing, as well as automated testing, staging, and production rollouts. Operations therefore play an essential role to keep the overall solution running.

Print-ETL architecture

The data pipeline processes events, in which each event contains a file retrieved from a media house. These events are processed in a distributed and concurrent manner by subscribing to a message topic. We use Monix, a Scala library for asynchronous computation, to process the events with high performance. Ideally, we process data as soon as it arrives, but we don’t have control over when data is released. Therefore, we have periodic peak loads of these events. At other times, there are no events at all. The whole system is deployed in the cloud to make use of its elasticity. Cloud instances are auto scaled proportionally to the number of events received, so naturally the more data we receive, the more resources we use to process that data.

The Print-ETL uses deep learning and other AI techniques to solve most print media challenges and extract the relevant information out of the raw print data. There are several AI and machine learning (ML) models in place. These include computer vision models (for page segmentation) and NLP models (for ad prediction, headline detection, and next sentence prediction).

Today’s practices are that deploying deep learning models incurs complexity by itself. Correspondingly, new practices come into the spotlight for managing the ML lifecycle in production reliably and efficiently—the emerging field of MLOps. In our use case, we use Deep Java Library (DJL) to integrate ML models into our data pipelines written in Scala. We found that this strategy simplifies model deployment and maintenance alike. In this post, we focus on the model we use to filter paid advertisements: the ad predictor.

The following diagram illustrates the Print-ETL architecture.

First version ad predictor: Serverless inference

We approached the advertisement classification challenge as a supervised binary text classification problem. We fine-tuned a BERT (Bidirectional Encoder Representations from Transformers) pre-trained multilingual base model with a binary classification layer on top of the transformer output. For training, we used a custom-built dataset containing advertisement data that we collected. The input of the model is a sequence of tokens, and the output is a classification score from 0–1, which is the probability of being an ad. This score is calculated by applying a sigmoid function to the linear layer prediction outputs (logits).

On our first iteration, we deployed a standalone ad predictor endpoint on an external service. This made operations harder. Predictions had a higher latency because of network calls and boot up times, causing timeouts and issues resulting from predictor unavailability due to instance interruptions. We also had to auto scale both the data pipeline and the prediction service, which was non-trivial given the unpredictable load of events. However, this strategy also had a few benefits. The service was packaged separately as an API and developed in Python, a language more familiar to data scientists than Scala. Also, the predictor wasn’t integrated into the Print-ETL system, so it wasn’t necessary to be familiar with the system to maintain the predictor.

The following diagram illustrates our BERT model for text classification.

The following is an example of our ads data.

Second version with DJL

Our solution to these challenges centered on combining the benefits of two frameworks: Open Neural Network Exchange (ONNX) and Deep Java Library.

With ONNX and DJL, we deployed a new multilingual ad predictor model directly in our pipeline. This replaced our first solution, the serverless ad predictor. The new model was fine-tuned on a new, larger set of data that contained over 450,000 sentences in Danish, English, and Portuguese. They reflect a sample of the production data being processed at the moment.

When deploying the model, DJL enabled us to adopt an API-free strategy. This strategy improved our data processing in myriad ways. For instance, it helped us achieve our latency requirements and use ML inferences in real time. Also, by replacing our standalone ad predictor, we no longer needed to mock an external service API in our tests. That allowed us to simplify our test suite. This in turn led to more test stability. Following our successful deployment, DJL allowed us to integrate other ML models that improved data processing even further.

Let’s go into the details of ONNX and DJL.

ONNX

ONNX is an open-source ecosystem of AI and ML tools designed to provide extensive interoperability between different deep learning frameworks. It manages models from different languages and environments. Their tools and common file format enable us to train a model using one framework, dynamically quantize it using tools from another, and deploy that model using yet another framework. That increased interoperability, along with help from DJL, allowed us to easily integrate our model with the JVM—and consequently our Scala pipeline as well.

More specifically, we used a tool called ONNX Runtime. We converted our original PyTorch model to the standard ONNX file format, and then applied dynamic quantization techniques using ONNX Runtime. This shrunk our original model size by about a factor of four with little to no loss in model performance. It also gave our model a speed boost on CPU-based inferences. In particular, from prior rollouts we had simple, yet cost-effective performance with 8 bits quantization when running a CPU with AVX-512 instructions. We were confident that this strategy would give us the results we were looking for.

Deep Java Library

DJL presented the other half of our solution. DJL is an open-source library that defines a Java-based deep learning framework. DJL abstracts away complexities involved with deep learning deployments, making training and inference a breeze. It’s engine agnostic, and is therefore compatible with a wide variety of deep learning engines. Those engines include PyTorch, TensorFlow, and MXNet, among others. Most importantly for us, DJL supports the ONNX Runtime engine.

Our DJL-based deployment brought several advantages over our original ad predictor deployment. First and foremost, from an engineering perspective, it was simpler. The direct native integration of ad prediction with our Scala data pipeline streamlined our architecture considerably. It allowed us to avoid the computational overhead of serializing and deserializing data, as well as the latency of making network calls to an external service.

Additionally, this meant that there was no longer any need for complicated autoscaling of an external service—the pipeline’s existing autoscaling infrastructure was sufficient to meet all our data processing requirements. Moreover, DJL’s predictor architecture worked well with Monix’s concurrent data processing, allowing us to make multiple inferences simultaneously across different threads.

Those simplifications led us to eliminate our standalone ad predictor service entirely. This eliminated all operational costs associated with running and maintaining that service.

Another consequence of those simplifications was the further streamlining of our test suite. For example, we no longer needed to mock our ad predictor. We could instead directly ensure the correctness and performance of our model on every commit using our continuous integration (CI). Upon every new commit pushed to the Print-ETL, our CI would run our suite of tests, which included tests for the DJL-based ad predictor. This maintains our confidence that our deep learning model works properly whenever we change our code base.

The following screenshot is a snippet of our ad detection CI in action.

Our testing strategy is now twofold: first, we use tests to determine the validity of our ad predictor model’s output; namely, the model should detect the same ads with the same, or higher, level of accuracy as previous iterations of the model. Second, the model’s robustness is stressed by passing particularly long, short, strange, or fragmented text samples. End-to-end performance tests that take advantage of the ad predictor’s services add a second layer of accountability. This makes sure that current and future deployments of our ad predictor function as intended. If the ad predictor isn’t performing as expected, our tests immediately reflect that incapability. The following code is an example of some sample test cases:

  /** Some sample test cases */
  it should "detect ads in danish, english, and portuguese" in {
    val daAdSentence = "Lidt bedre end andre gode oste"
    val daAdLikelihood = AdDetector.predict(daAdSentence)
    daAdLikelihood.success.value should be > 0.9d

    val enAdSentence = "Save 10% when you buy in the next ten minutes!"
    val enAdLikelihood = AdDetector.predict(enAdSentence)
    enAdLikelihood.success.value should be > 0.9d

    val ptAdSentence = "Defenda a sua saúde, tomando YOGHURT"
    val ptAdLikelihood = AdDetector.predict(ptAdSentence)
    ptAdLikelihood.success.value should be > 0.9d
  }

This, in turn, simplified our operations strategy as well. It’s now easier to spot, track, and reproduce inference errors if and when they occur. Such an error immediately tells us which input the model failed to predict on, the exact error message given by ONNX Runtime, along with relevant information for reproducing the error. Also, because our ad predictor is now integrated with our data pipeline, we only need to consult one log stream when analyzing error messages. After the associated bug is reproduced and fixed, we can add a new test case to ensure the same bug doesn’t occur again.

Conclusion and next steps

We have been happy with our DJL-based deployment. Our success with DJL has empowered us to utilize the same strategy to deploy other deep learning models for other purposes, such as headline detection and next sentence prediction. In each of those cases, we experienced similar results as with our ad predictor—deployment was easy, simple, and economical.

In the future, one avenue we would be excited to explore with DJL is GPU-based inference. Our current DJL deployments are exclusively CPU based—partially due to its cost-effectiveness, and partially due to its simplicity when compared to a GPU-based alternative. Given our experiences with DJL, however, we believe that DJL could drastically streamline any GPU-based deployment that we pursue. To learn more and get started on DJL, visit the website. You can also visit the GitHub repodemo repository, examples, Slack channel, and Twitter for more documentation and examples of DJL!

The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.


About the Authors

Lukas Baker works in the intersection of data engineering and applied machine learning. At Hypefactors, he occasionally builds a data pipeline and designs and trains a model in between.

Andrea Duque is an all-round engineer and scientist with a history of connecting the dots with MLOps. At Hypefactors, she designs and rollouts ML-heavy data pipelines end-to-end.

Viet Yen Nguyen is the CTO of Hypefactors and leads the teams on data science, web app development and data engineering. Prior to Hypefactors, he developed technology for designing mission-critical systems, including the European Space Agency.

Read More

How Careem is detecting identity fraud using graph-based deep learning and Amazon Neptune

This post was co-written with Kevin O’Brien, Senior Data Scientist in Careem’s Integrity Team.

Dubai-based Careem became the Middle East’s first unicorn when it was acquired by Uber for $3.1 billion in 2019. A pioneer of the region’s ride-hailing economy, Careem is now expanding its services to include mass transportation, delivery, and payments as an everyday super app.

But its size and popularity—it has around 50 million customer accounts—have also made it a prime target for fraudsters constantly looking for new loopholes to exploit and different ways to hijack genuine accounts.

In this post, we share how Careem detects identity fraud using graph-based deep learning and Amazon Neptune.

The challenge

Due to Careem’s massive popularity, fraudsters are constantly looking for new loopholes to exploit, create identity-faked accounts (first-party fraud), and different ways to hijack genuine accounts—also known as account takeover (third-party fraud). In Careem’s data science and analytics backed Integrity team, they needed more advanced ways to detect and stop losses from fraud that may be damaging to both their revenue and brand reputation. This solution would ideally cover both first- and third-party fraud.

Traditionally, tackling these different kinds of fraudulent activities was a never-ending game of cat and mouse. Careem’s Integrity team would often create rules or machine learning (ML) models for each specific type of fraud, but this was sometimes problematic on two levels:

  • It only allowed them to identify and block an account after the fraud had been committed and detected, which means the money had already been lost
  • Fraudsters were quickly able to find a new loophole to exploit once an existing fraud pattern had been detected

As a result, instead of continuously creating overly specific tools to detect very specific fraud patterns, they wanted to build an intelligent system that was almost a blanket detection mechanism over all users, wherever they were performing actions on the platform.

The new approach

Careem needed to be proactive rather than reactive. A smarter and faster way to detect fraudulent activities and stop them before the act was committed was required.

After much experimentation, Careem decided to focus on the identity of users, and came up with a powerful way to outsmart any efforts of identity fraud. They opted to use a graph structure as a way of mapping different aspects and data points of each user’s identity together, and more importantly, characteristics shared across the identities of different users. This would allow them to detect potentially fraudulent patterns in real time across user and account activity.

Architecture overview

Before we dive deep into how Careem used Neptune an identity graph for fraud detection, let’s look at the current architecture underpinning the solution. Careem chose AWS and its automated real-time analysis and monitoring capabilities due to the existing integrated cloud setup they already had.

Data ingestion

Data ingestion comprises two stages: a one-time extract, transform, and load (ETL) for all historical data, and a live streaming service of real-time data.

  • Historical data – Careem uses Apache Hive running on Amazon Simple Storage Service (Amazon S3) to extract data and push it to Amazon EMR with PySpark. Amazon EMR pushes this historical data to Neptune.
  • Real-time data – Careem uses their existing event processor to feed the data from all actions performed by users through Amazon Simple Queue Service (Amazon SQS). These events are consumed by a Python interface running on AWS Elastic Beanstalk, which takes these events and writes them to Neptune in real time.

Data querying

The data ingested from these sources is then queried, again using the Python interface running on Elastic Beanstalk. A simple set of logical rules is used to process the data returned for a query on a particular user, and a decision is made on whether the action performed was likely to be done by a fraudster. Based on the value of the user’s historical transaction, the fraudulent account is either blocked automatically, if it’s a low-value customer, or sent for manual review, if they’re a high-value customer.

Data consumption

The Integrity team at Careem developed a data consumption API that is used by the other teams at Careem to query users in the graph to retrieve data about their identities.

Implementing the graph data model on Neptune

The basic building blocks of any directed graph are vertices (or nodes) and edges. A vertex is an object that represents an entity in your data. For example, a customer can be a node, and the features and information about this customer are called node properties. An edge represents a connection between different nodes. For example, we may have an edge with a label called has_device that connects a customer node to a device node. A large collection of different nodes and edges are called a graph, as illustrated in the following diagram.

One type of graph architecture is called an identity graph. Identity graphs provide a single unified view of different identities by linking multiple node identifiers such as device IDs, IP addresses, emails, or credit cards to a known person or anonymous profile using privacy-compliant methods. Typically, identity graphs are part of a larger identity resolution architecture. Identity resolution is the process of matching a human identity across a set of devices used by the same person or a household of persons for the purposes of building a representative identity, or known attributes. We can then use this identity graph to find patterns in our data that could indicate fraud activities. We can evaluate identities in the context of other identities or transactions and determine if constellations of data in the graph represent fraudulent activity.

The task we are solving in this case is called node classification. Node classification is a supervised ML approach whereby we predict the categorical feature of a node property. In this case, we decided to build a graph model to predict the is_fraud property of customer nodes using Amazon Neptune ML. Neptune ML is a feature in Neptune that makes it easy to build and train ML models on large graphs using graph neural networks (GNNs). It uses Amazon SageMaker and the Deep Graph Library (DGL) to scale the training and tuning of the graph model.

Data labeling strategy and maturity

In addition to building the graph from different data sources, we needed a robust data labeling and data maturity strategy for the supervised learning task. Data maturity is the process of making sure that the fraud labels have had sufficient time to mature. In other words, enough time has passed to ensure legitimate and fraud records have been correctly and accurately identified. The maturity period can vary depending on the business. For example, for chargeback fraud, it can take somewhere between 30 days and 2 months to accurately identify fraudulent events.

Careem’s customer nodes in the graph were labeled as fraudulent if they had historically been blocked for fraud either manually or by another one of Careem’s automated fraud detection systems that are rule based. These labels are added to the graph either in the historical ETL, for users who are already blocked, or in live streaming, which blocks users in real time. They ensured the maturity of these labels by only using fraud labels for blocked users who hadn’t contacted customer care requesting for their block to be reviewed within a period of time after being blocked.

One issue that arose was that there were many fraudulent accounts that had gone undetected. The volume of these mislabeled customer nodes was substantial enough to affect training performance of the model. To combat this, a strict set of heuristics, based on domain knowledge of the platform, was applied to the customers in the graph, which allowed a large number of these labels to be corrected using a script in the training dataset with high confidence. This allowed more accurate learning of the model due to a reduction in noisy labels.

Collaboration with AWS on Neptune ML

Throughout this project, Careem’s Integrity team worked closely with the AWS ML Specialist and Neptune ML teams to develop this project with maximum efficiency and effectiveness. This included first-hand, on-call support and troubleshooting, as well as working together to build, scale, and optimize our graph.

In addition, Careem has a large volume of properties on the edges in their graph, which were previously not being used in the model’s training and predictions. Careem provided input on the development of a modified version of the RGCN architecture in Neptune ML, which uses edge properties from the graph to learn representations, not just node properties alone, which is what the traditional RGCN model does. Throughout this process, the Neptune ML team also worked on critical features that enabled Careem to train and optimize the graph at scale. These features include multi-GPU training, custom performance metrics, training instance size estimation, scalable and parallel processing, and hyperparameters custom tuning. All of these features are available now in the latest Neptune ML release, which became generally available as of July 2021.

Looking to the future

Careem is currently working with the AWS team to build and train a deep learning model to more accurately detect fraud on their user identity graph. Testing results for the initial phase are looking promising so far, with a precision of around 85% and a recall of over 50%. In other words, the model is able to correctly identify over 50% of all users that have ever historically been blocked for fraud on the platform, with an accuracy of 85%. All of this without knowing anything about the user’s transaction history, bookings, food and grocery orders, and other details—just data about their identity.

Work is now being done to deploy this trained model to production, allowing it to detect fraud in cases such as when a fraudster sets up a new account or compromises the account of an existing genuine user. This will all be done as users perform actions in real time.

In the future, Careem also plans to add Captains (what Careem’s drivers are known as) to the graph to similarly detect fraudulent Captains, or even fraudulent activity produced by collusion between users and Captains. To learn more about Amazon Neptune ML, visit the website.


About the Authors

Kevin O’Brien is a Senior Data Scientist at Careem. He is a member of the Integrity team, whose goal is to detect and prevent fraud on the platform, through data science and analytics. Kevin leads the Identity Risk squad of the Integrity team.

Waleed (Will) Badr is a Principal AI/ML Specialist Solutions Architect who works as part of the global Amazon Machine Learning team. Will has an extensive experience in fraud detection and prevention systems and is passionate about using technology in innovative ways to positively impact the community.

Kamran Habib is a Senior Solutions Architect who works with our Digital Native Business (DNB) customers in the Middle East and North Africa (MENA) region. Kamran’s technical expertise focuses on Containers, Networking and Security and he is passionate about solving customer’s business problems with innovative technical solutions. In his spare time, he enjoys travel, listening to podcasts and cricket.

Read More

Announcing the winners of the 2021 People’s Expectations and Experiences with Digital Privacy RFP

In August, Meta, formerly known as Facebook, launched the 2021 People’s Expectations and Experiences with Digital Privacy request for proposals (RFP). Today, we’re announcing the winners of this award.
VIEW RFPIn 2020, we launched a similar research award opportunity in this space. We have continued this support for academics from across the social sciences and technical disciplines, empowering them to broaden and deepen our collective knowledge of global privacy expectations and experiences.

“Both this year’s and last year’s RFPs are about expanding research on two vital topics for the advancement of privacy science: privacy measurement and inclusive privacy,” said Meta Head of Privacy Research Liz Keneski in a Q&A about the RFP.

This year, areas of interest included the following:

  • Improving understanding of users’ privacy attitudes, concerns, preferences, needs, behaviors, and outcomes
  • Novel assessments of digital transparency and control that are meaningful for diverse populations, context, and data types

The RFP attracted 89 proposals from 74 universities and institutions around the world. Thank you to everyone who took the time to submit a proposal, and congratulations to the winners.

Research award winners

Principal investigators are listed first unless otherwise noted.

Addressing biases in measurement of self-reported privacy constructs
Heng Xu, Nan Zhang (American University, Washington, D.C.)

Designing digital privacy education interventions for older adults
Kaileigh Byrne, Bart Knijnenburg (Clemson University)

Exploring privacy concerns and policy negotiation strategies in smart homes
Chuan Yue (Colorado School of Mines)

Global south citizens’ privacy perceptions and management of targeted ads
Yang Wang (University of Illinois Urbana-Champaign)

Privacy regulations and consumer search for products and information
Pinar Yildirim, Yu Zhao (University of Pennsylvania), Pradeep Chintagunta (University of Chicago)

Understanding the digital privacy of the rural women in Bangladesh
Syed Ishtiaque Ahmed (University of Toronto), Mahdi Nasrullah Al-Ameen (Utah State University), Sharifa Sultana (Cornell University)

Understanding the roles of discrete emotions in privacy management
Hyunjin Kang (Nanyang Technological University), Jeeyun Oh (University of Texas at Austin)

Finalists

Designing “just-in-time” privacy nudges and education with and for teens
Pamela Wisniewski (University of Central Florida)

Developing tailored interventions to boost privacy control and resilience
Sophie Boerman, Joanna Strycharz (University of Amsterdam)

Digital payment system (DPS) privacy and user satisfaction in three countries
Raghav Rao, Oluwafemi Akanfe, Rohit Valecha (University of Texas at San Antonio)

Examining privacy concerns for interventions in online news systems
Matthew Louis Mauriello (University of Delaware)

Inclusive privacy design for vulnerable populations beyond W.E.I.R.D
Maryam Mustafa, Mobin Javed (Lahore University of Management Sciences)

Interactive image control: Addressing privacy concerns in public spaces
Maria D. Molina, Ruth Shillair (Michigan State University)

Privacy-preserving approaches to combating misinformation globally
Aditya Vashistha (Cornell University)

The socioeconomic & cultural influence: Adolescents’ perception of privacy
Kuskridho Ambardi, Adityo Hidayat, Anisa Pratita Kirana Mantovani, Dewa Ayu Diah Angendari, Faiz Rahman, Paska Bayu Darmawan (Universitas Gadjah Mada/Center for Digital Society)

The post Announcing the winners of the 2021 People’s Expectations and Experiences with Digital Privacy RFP appeared first on Facebook Research.

Read More