November 2020 – Page 8

Applying MinDiff to Improve Model Fairness

Posted by Summer Misherghi and Thomas Greenspan, Software Engineers, Google Research

Last December, we open-sourced Fairness Indicators, a platform that enables sliced evaluation of machine learning model performance. This type of responsible evaluation is a crucial first step toward avoiding bias as it allows us to determine how our models are working for a wide variety of users. When we do identify that our model underperforms on certain slices of our data, we need a strategy to mitigate this to avoid creating or reinforcing unfair bias, in line with Google’s AI Principles.

Today, we’re announcing MinDiff, a technique for addressing unfair bias in machine learning models. Given two slices of data, MinDiff works by penalizing your model for differences in the distributions of scores between the two sets. As the model trains, it will try to minimize the penalty by bringing the distributions closer together. MinDiff is the first in what will ultimately be a larger Model Remediation Library of techniques, each suitable for different use cases. To learn about the research and theory behind MinDiff, please see our post on the Google AI Blog.

MinDiff Walkthrough

You can follow along and run the code yourself in this MinDiff notebook. In this walkthrough, we’ll emphasize important points in the notebook, while providing context on fairness evaluation and remediation.

In this example, we are training a text classifier to identify written content that could be considered “toxic.” For this task, our baseline model will be a simple Keras sequential model pre-trained on the Civil Comments dataset. Since this text classifier could be used to automatically moderate forums on the internet (for example, to flag potentially toxic comments), we want to ensure that it works well for everyone. You can read more about how fairness problems can arise in automated content moderation in this blog post.

To attempt to mitigate potential fairness concerns, we will:

Evaluate our baseline model’s performance on text containing references to sensitive groups.
Improve performance on any underperforming groups by training with MinDiff.
Evaluate the new model’s performance on our chosen metric.

Our purpose is to demonstrate usage of the MinDiff technique for you with a minimal workflow, not to lay out a complete approach to fairness in machine learning. Our evaluation will only focus on one sensitive category and a single metric. We also don’t address potential shortcomings in the dataset, nor tune our configurations.

In a production setting, you would want to approach each of these with more rigor. For example:

Consider the application space and the potential societal impact of your model; what are the implications of different types of model errors?
Consider additional categories for which underperformance might have fairness implications. Do you have sufficient examples for groups in each category?
Consider any privacy implications to storing the sensitive categories.
Consider any metric for which poor performance could translate into harmful outcomes.
Conduct thorough evaluation for all relevant metrics on multiple sensitive categories.
Experiment with the configuration of MinDiff by tuning hyperparameters to get optimal performance.

For the purpose of this blog post, we’ll skip building and training our baseline model, and jump right to evaluating its performance. We’ve used some utility functions to compute our metrics and we’re ready to visualize evaluation results (See “Render Evaluation Results” in the notebook):

 widget_view.render_fairness_indicator(eval_result)

Let’s look at the evaluation results. Try selecting the metric false positive rate (FPR) with threshold 0.450. We can see that the model does not perform as well for some religious groups as for others, displaying a much higher FPR. Note the wide confidence intervals on some groups because they have too few examples. This makes it difficult to say with certainty that there is a significant difference in performance for these slices. We may want to collect more examples to address this issue. We can, however, attempt to apply MinDiff for the two groups that we are confident are underperforming.

We’ve chosen to focus on FPR, because a higher FPR means that comments referencing these identity groups are more likely to be incorrectly flagged as toxic. This could lead to inequitable outcomes for users engaging in dialogue about religion, but note that disparities in other metrics can lead to other types of harm.

Now, we’ll try to improve the FPR for religious groups for which our model underperforms. We’ll attempt to do so using MinDiff, a remediation technique that seeks to balance error rates across slices of your data by penalizing disparities in performance during training. When we apply MinDiff, model performance may degrade slightly on other slices. As such, our goals with MinDiff will be to improve performance for underperforming groups, while sustaining strong performance for other groups and overall.

To use MinDiff, we create two additional data splits:

A split for non-toxic examples referencing minority groups: In our case, this will include comments with references to our underperforming identity terms. We don’t include some of the groups because there are too few examples, leading to higher uncertainty with wide confidence interval ranges.
A split for non-toxic examples referencing the majority group.

It’s important to have sufficient examples belonging to the underperforming classes. Based on your model architecture, data distribution, and MinDiff configuration, the amount of data needed can vary significantly. In past applications, we have seen MinDiff work well with at least 5,000 examples in each data split.

In our case, the groups in the minority splits have example quantities of 9,688 and 3,906. Note the class imbalances in the dataset; in practice, this could be cause for concern, but we won’t seek to address them in this notebook since our intention is just to demonstrate MinDiff.

We select only negative examples for these groups, so that MinDiff can optimize on getting these examples right. It may seem counterintuitive to carve out sets of ground truth negative examples if we’re primarily concerned with disparities in false positive rate, but remember that a false positive prediction is a ground truth negative example that’s incorrectly classified as positive, which is the issue we’re trying to address.

To prepare our data splits, we create masks for the sensitive & non-sensitive groups:

minority_mask = data_train.religion.apply(
  lambda x: any(religion in x for religion in ('jewish', 'muslim')))
majority_mask = data_train.religion.apply(
  lambda x: x == "['christian']")

Next, we select negative examples, so MinDiff will be able to reduce FPR for sensitive groups:

true_negative_mask = data_train['toxicity'] == 0
 
data_train_main = copy.copy(data_train)
data_train_sensitive = (
  data_train[minority_mask & true_negative_mask])
data_train_nonsensitive = (
  data_train[majority_mask & true_negative_mask])

To start training with MinDiff, we need to convert our data to TensorFlow Datasets (not shown here — see “Create MinDiff Datasets” in the notebook for details). Don’t forget to batch your data for training. In our case, we set the batch sizes to the same value as the original dataset but this is not a requirement and in practice should be tuned.

dataset_train_sensitive = dataset_train_sensitive.batch(BATCH_SIZE)
dataset_train_nonsensitive = (
  dataset_train_nonsensitive.batch(BATCH_SIZE))

Once we have prepared our three datasets, we merge them into one MinDiff dataset using a util function provided in the library.

min_diff_dataset = md.keras.utils.pack_min_diff_data(
  dataset_train_main, 
  dataset_train_sensitive,
  dataset_train_nonsensitive)

To train with MinDiff, simply take the original model and wrap it in a MinDiffModel with a corresponding `loss` and `loss_weight`. We are using 1.5 as the default `loss_weight`, but this is a parameter that needs to be tuned for your use case, since it depends on your model and product requirements. You should experiment with changing the value to see how it impacts the model, noting that increasing it pushes the performance of the minority and majority groups closer together but may come with more pronounced tradeoffs.

As specified above, we create the original model, and wrap it in a MinDiffModel. We pass in one of the MinDiff losses and use a moderately high weight of 1.5.

original_model = ...  # Same structure as used for baseline model. 
 
min_diff_loss = md.losses.MMDLoss() 
min_diff_weight = 1.5
min_diff_model = md.keras.MinDiffModel(
  original_model, min_diff_loss, min_diff_weight)

After wrapping the original model, we compile the model as usual. This means using the same loss as for the baseline model:

optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)
loss = tf.keras.losses.BinaryCrossentropy()
min_diff_model.compile(
  optimizer=optimizer, loss=loss, metrics=['accuracy'])

We fit the model to train on the MinDiff dataset, and save the original model to evaluate (see API documentation for details on why we don’t save the MinDiff model).

min_diff_model.fit(min_diff_dataset, epochs=20)

min_diff_model.save_original_model(
    min_diff_model_location, save_format='tf')

Finally, we evaluate the new results.

min_diff_eval_subdir = 'eval_results_min_diff'
min_diff_eval_result = util.get_eval_results(
  min_diff_model_location, base_dir, min_diff_eval_subdir,
  validate_tfrecord_file, slice_selection='religion')

To ensure we evaluate a new model correctly, we need to select a threshold the same way that we would the baseline model. In a production setting, this would mean ensuring that evaluation metrics meet launch standards. In our case, we will pick the threshold that results in a similar overall FPR to the baseline model. This threshold may be different from the one you selected for the baseline model. Try selecting false positive rate with threshold 0.400. (Note that the subgroups with very low quantity examples have very wide confidence range intervals and don’t have predictable results.)

 widget_view.render_fairness_indicator(min_diff_eval_result)

Note: The scale of the y-axis has changed from .04 in the graph for the baseline model to .02 for our MinDiff model

Reviewing these results, you may notice that the FPRs for our target groups have improved. The gap between our lowest performing group and the majority group has improved from .024 to .006. Given the improvements we’ve observed and the continued strong performance for the majority group, we’ve satisfied both of our goals. Depending on the product, further improvements may be necessary, but this approach has gotten our model one step closer to performing equitably for all users.

To get a better sense of scale, we superimposed the MinDiff model on top of the base model.

You can get started with MinDiff by visiting the MinDiff page on tensorflow.org. More information about the research behind MinDiff is available in our post on the Google AI Blog. You can also learn more about evaluating for fairness in this guide.

Acknowledgements

The MinDiff framework was developed in collaboration with Thomas Greenspan, Summer Misherghi, Sean O’Keefe‎, Christina Greer, Catherina Xu‎, Manasi Joshi, Dan Nanas, Nick Blumm, Jilin Chen, Zhe Zhao, James Chen, Maciej Kula, Lichan Hong, Mahesh Sathiamoorthy. This research effort on ML Fairness in classification was jointly led by (in alphabetical order) Alex Beutel, Ed H. Chi, Flavien Prost, Hai Qian, Jilin Chen, Shuo Chen, and Tulsee Doshi. Further, this work was pursued in collaboration with Christine Luu, Jonathan Bischof, Pierre Kreitmann, and Qiuwen Chen.

Training on Test Inputs with Amortized Conditional Normalized Maximum Likelihood

Current machine learning methods provide unprecedented accuracy across a range
of domains, from computer vision to natural language processing. However, in
many important high-stakes applications, such as medical diagnosis or
autonomous driving, rare mistakes can be extremely costly, and thus effective
deployment of learned models requires not only high accuracy, but also a way to
measure the certainty in a model’s predictions. Reliable uncertainty
quantification is especially important when faced with out-of-distribution
inputs, as model accuracy tends to degrade heavily on inputs that differ
significantly from those seen during training. In this blog post, we will
discuss how we can get reliable uncertainty estimation with a strategy that
does not simply rely on a learned model to extrapolate to out-of-distribution
inputs, but instead asks: “given my training data, which labels would make
sense for this input?”.

Stanford AI Lab Papers and Talks at CoRL 2020

The Conference on Robot Learning (CoRL) 2020 is being hosted virtually from November 16th – November 18th. We’re excited to share all the work from SAIL that’s being presented, and you’ll find links to papers, videos and blogs below. Feel free to reach out to the contact authors directly to learn more about the work that’s happening at Stanford!

List of Accepted Papers

Learning 3D Dynamic Scene Representations for Robot Manipulation

Authors: Zhenjia Xu, Zhanpeng He, Jiajun Wu, Shuran Song

Contact: jiajunwu@cs.stanford.edu

Links: Paper | Video | Website

Keywords: scene representations, 3d perception, robot manipulation

Learning Latent Representations to Influence Multi-Agent Interaction

Authors: Annie Xie, Dylan P. Losey, Ryan Tolsma, Chelsea Finn, Dorsa Sadigh

Contact: anniexie@stanford.edu

Links: Paper | Blog Post | Website

Keywords: multi-agent systems, human-robot interaction, reinforcement learning

Learning Object-conditioned Exploration using Distributed Soft Actor Critic

Authors: Ayzaan Wahid (Google), Austin Stone (Google), Brian Ichter (Google Brain), Kevin Chen (Stanford), Alexander Toshev (Google)

Contact: ayzaan@google.com

Links: Paper

Keywords: object navigation, visual navigation

MATS: An Interpretable Trajectory Forecasting Representation for Planning and Control

Authors: Boris Ivanovic, Amine Elhafsi, Guy Rosman, Adrien Gaidon, Marco Pavone

Contact: borisi@stanford.edu

Links: Paper | Video

Keywords: trajectory forecasting, learning dynamical systems, motion planning, autonomous vehicles

Model-based Reinforcement Learning for Decentralized Multiagent Rendezvous

Authors: Rose E. Wang, J. Chase Kew, Dennis Lee, Tsang-Wei Edward Lee, Tingnan Zhang, Brian Ichter, Jie Tan, Aleksandra Faust

Contact: rewang@stanford.edu

Links: Paper | Video | Website

Keywords: multiagent systems; model-based reinforcement learning

Reinforcement Learning with Videos: Combining Offline Observations with Interaction

Authors: Karl Schmeckpeper, Oleh Rybkin, Kostas Daniilidis, Sergey Levine, Chelsea Finn

Contact: karls@seas.upenn.edu

Links: Paper | Website

Keywords: reinforcement learning, learning from observation

Sampling-based Reachability Analysis: A Random Set Theory Approach with Adversarial Sampling

Authors: Thomas Lew, Marco Pavone

Contact: thomas.lew@stanford.edu

Links: Paper

Keywords: reachability analysis, robust planning and control, neural networks

Keynote

Walking the Boundary of Learning and Interaction (Dorsa Sadigh)

Overview: There have been significant advances in the field of robot learning in the past decade. However, many challenges still remain when considering how robot learning can advance interactive agents such as robots that collaborate with humans. This includes autonomous vehicles that interact with human-driven vehicles or pedestrians, service robots collaborating with their users at homes over short or long periods of time, or assistive robots helping patients with disabilities. This introduces an opportunity for developing new robot learning algorithms that can help advance interactive autonomy.

In this talk, I will discuss a formalism for human-robot interaction built upon ideas from representation learning. Specifically, I will first discuss the notion of latent strategies— low dimensional representations sufficient for capturing non-stationary interactions. I will then talk about the challenges of learning such representations when interacting with humans, and how we can develop data-efficient techniques that enable actively learning computational models of human behavior from demonstrations, preferences, or physical corrections. Finally, I will introduce an intuitive controlling paradigm that enables seamless collaboration based on learned representations, and further discuss how that can be used for further influencing humans.

Live Event: November 17th, 7:00AM – 7:45AM PST

We look forward to seeing you at CoRL!

As voice agents proliferate, EMNLP broadens its scope

Amazon Scholar Julia Hirschberg on why speech understanding and natural-language understanding are intertwined.Read More

Stanford AI Lab Papers and Talks at EMNLP 2020

The Conference on Empirical Methods in Natural Language Processing (EMNLP) 2020 is being hosted virtually from November 16th – November 20th. We’re excited to share all the work from SAIL that’s being presented, and you’ll find links to papers, videos and blogs below. Feel free to reach out to the contact authors directly to learn more about the work that’s happening at Stanford!

Main Conference

Pre-Training Transformers as Energy-Based Cloze Models

Authors: Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning

Contact: kevclark@cs.stanford.edu

Links: Paper

Keywords: representation learning, self-supervised learning, energy-based models

ALICE: Active Learning with Contrastive Natural Language Explanations

Authors: Weixin Liang, James Zou, Zhou Yu

Contact: wxliang@stanford.edu

Links: Paper

Keywords: natural language explanation, class-based active learning, contrastive explanation

CheXbert: Combining Automatic Labelers and Expert Annotations for Accurate Radiology Report Labeling Using BERT

Authors: Akshay Smit, Saahil Jain, Pranav Rajpurkar, Anuj Pareek, Andrew Y. Ng, Matthew P. Lungren

Contact: akshaysm@stanford.edu

Links: Paper

Keywords: bert, natural language processing, radiology, medical imaging, deep learning

AutoQA: From Databases To QA Semantic Parsers With Only Synthetic Training Data

Authors: Silei Xu, Sina J. Semnani, Giovanni Campagna, Monica S. Lam

Contact: silei@cs.stanford.edu

Links: Paper

Keywords: question answering, semantic parsing, language models, synthetic training data, data augmentation

Data and Representation for Turkish Natural Language Inference

Authors: Emrah Budur, Rıza Özçelik, Tunga Güngör, Christopher Potts

Contact: emrah.budur@boun.edu.tr

Links: Paper | Website

Keywords: sentence-level semantics, natural language inference, neural machine translation, morphologically rich language

Intrinsic Evaluation of Summarization Datasets

Authors: Rishi Bommasani, Claire Cardie

Contact: nlprishi@stanford.edu

Links: Paper | Video | Website | Virtual Conference Room

Keywords: summarization, datasets, evaluation

Learning Music Helps You Read: Using Transfer to Study Linguistic Structure in Language Models

Authors: Isabel Papadimitriou, Dan Jurafsky

Contact: isabelvp@stanford.edu

Links: Paper

Keywords: transfer learning, analysis, music, hierarchical structure

Localizing Open-Ontology QA Semantic Parsers in a Day Using Machine Translation

Authors: Mehrad Moradshahi, Giovanni Campagna, Sina J. Semnani, Silei Xu, Monica S. Lam

Contact: mehrad@cs.stanford.edu

Links: Paper | Website

Keywords: machine translation, semantic parsing, localization

SLM: Learning a Discourse Language Representation with Sentence Unshuffling

Authors: Haejun Lee, Drew A. Hudson, Kangwook Lee, Christopher D. Manning

Contact: dorarad@stanford.edu

Links: Paper

Keywords: transformer, bert, language, understanding, nlp, squad, glue, sentences, discourse

Utility is in the Eye of the User: A Critique of NLP Leaderboards

Authors: Kawin Ethayarajh, Dan Jurafsky

Contact: kawin@stanford.edu

Links: Paper | Website

Keywords: nlp, leaderboard, utility, benchmark, fairness, efficiency

With Little Power Comes Great Responsibility

Authors: Dallas Card, Peter Henderson, Urvashi Khandelwal, Robin Jia, Kyle Mahowald, Dan Jurafsky

Contact: dcard@stanford.edu

Links: Paper | Website

Keywords: statistical power, experimental methodology, leaderboards, machine translation, human evaluation

Findings of EMNLP

DeSMOG: Detecting Stance in Media On Global Warming

Authors: Yiwei Luo, Dallas Card, Dan Jurafsky

Contact: yiweil@stanford.edu

Links: Paper | Website

Keywords: computational social science; framing; argumentation; stance; bias; climate change

Investigating Transferability in Pretrained Language Models

Authors: Alex Tamkin, Trisha Singh, Davide Giovanardi, Noah Goodman

Contact: atamkin@stanford.edu

Links: Paper | Website | Virtual Conference Room

Keywords: finetuning, transfer learning, language models, bert, probing

Stay Hungry, Stay Focused: Generating Informative and Specific Questions in Information-Seeking Conversations

Authors: Peng Qi, Yuhao Zhang, Christopher D. Manning

Contact: pengqi@cs.stanford.edu

Links: Paper | Blog Post

Keywords: conversational agents, question generation, natural language generation

Do Language Embeddings Capture Scales?

Authors: Xikun Zhang*, Deepak Ramachandran*, Ian Tenney, Yanai Elazar, Dan Roth

Contact: xikunz2@cs.stanford.edu

Links: Paper | Virtual Conference Room

Keywords: probing, analysis, bertology, scales, common sense knowledge

On the Importance of Adaptive Data Collection for Extremely Imbalanced Pairwise Tasks

Authors: Stephen Mussmann, Robin Jia, Percy Liang

Contact: robinjia@cs.stanford.edu

Links: Paper | Website

Keywords: active learning, robustness, label imbalance

Pragmatic Issue-Sensitive Image Captioning

Authors: Allen Nie, Reuben Cohn-Gordon, Christopher Potts

Contact: anie@stanford.edu

Links: Paper | Video

Keywords: controllable caption generation, question under discussion, discourse, pragmatics

Workshops and Co-Located Conferences

BLEU Neighbors: A Reference-less Approach to Automatic Evaluation

Authors: Kawin Ethayarajh, Dorsa Sadigh

Contact: kawin@stanford.edu

Links: Paper | Website

Keywords: nlp, bleu, evaluation, nearest neighbors, dialogue

Determining Question-Answer Plausibility in Crowdsourced Datasets Using Multi-Task Learning

Authors: Rachel Gardner, Maya Varma, Clare Zhu, Ranjay Krishna

Contact: rachel0@cs.stanford.edu

Links: Paper

Keywords: noisy text, bert, plausibility, multi-task learning

Explaining the ‘Trump Gap’ in Social Distancing Using COVID Discourse

Authors: Austin van Loon, Sheridan Stewart, Brandon Waldon, Shrinidhi K. Lakshmikanth, Ishan Shah, Sharath Chandra Guntuku, Garrick Sherman, James Zou, Johannes Eichstaedt

Contact: avanloon@stanford.edu

Links: Paper

Keywords: computational social science, social distancing, word2vec, vector semantics, twitter, bert

Learning Adaptive Language Interfaces through Decomposition

Authors: Siddharth Karamcheti, Dorsa Sadigh, Percy Liang

Contact: skaramcheti@cs.stanford.edu

Links: Paper | Virtual Conference Room

Keywords: semantic parsing, interaction, decomposition

Modeling Subjective Assessments of Guilt in Newspaper Crime Narratives

Authors: Elisa Kreiss*, Zijian Wang*, Christopher Potts

Contact: ekreiss@stanford.edu

Links: Paper | Website

Keywords: psycholinguistics, pragmatics, token-level supervision, model attribution, news, guilt, hedges, corpus, subjectivity

Neural Natural Language Inference Models Partially Embed Theories of Lexical Entailment and Negation

Authors: Atticus Geiger, Kyle Richardson, Chris Potts

Contact: atticusg@stanford.edu

Links: Paper | Website

Keywords: entailment intervention causality systematic generalization

Structured Self-Attention Weights Encode Semantics in Sentiment Analysis

Authors: Zhengxuan Wu, Thanh-Son Nguyen, Desmond C. Ong

Contact: wuzhengx@stanford.edu

Links: Paper

Keywords: attention, explainability, sentiment analysis

We look forward to seeing you at EMNLP 2020!

Learning to Influence Multi-Agent Interaction

Interaction with others is an important part of everyday life. No matter
the situation – whether it be playing a game of chess, carrying a
box together, or navigating lanes of traffic – we’re able to
seamlessly compete against, collaborate with, and acclimate to other
people.

Likewise, as robots become increasingly prevalent and capable, their
interaction with humans and other robots is inevitable. However, despite
the many advances in robot learning, most current algorithms are
designed for robots that act in isolation. These methods miss out on the
fact that other agents are also learning and changing – and so the
behavior the robot learns for the current interaction may not work
during the next one! Instead, can robots learn to seamlessly interact
with humans and other robots by taking their changing strategies into
account? In our new work (paper,
website), we
begin to investigate this question.

A standard reinforcement learning agent (left) based on Soft
Actor-Critic (**SAC**) assumes that
the opponent (right) follows a fixed strategy, and only blocks on its
left side.

Interactions with humans are difficult for robots because humans and
other intelligent agents don’t have fixed behavior – their
strategies and habits change over time. In other words, they update
their actions in response to the robot and thus continually change the
robot’s learning environment. Consider the robot on the left (the agent)
learning to play air hockey against the non-stationary robot on the
right. Rather than hitting the same shot every time, the other robot
modifies its policy between interactions to exploit the agent’s
weaknesses. If the agent ignores how the other robot changes, then it
will fail to adapt accordingly and learn a poor policy.

The best defense for the agent is to block where it thinks the opponent
will next target. The robot therefore needs to anticipate how the
behavior of the other agent will change, and model how its own actions
affect the other’s behavior. People can deal with these scenarios on a
daily basis (e.g., driving, walking), and they do so without explicitly
modeling every low-level aspect of each other’s policy.

Humans tend to be bounded-rational (i.e., their rationality is limited
by knowledge and computational capacity), and so likely keep track of
much less complex entities during interaction. Inspired by how humans
solve these problems, we recognize that robots also do not need to
explicitly model every low-level action another agent will make.
Instead, we can capture the hidden, underlying intent – what we call
latent strategy (in the sense that it underlies the actions of the
agent) – of other agents through learned low-dimensional
representations. These representations are learned by optimizing neural
networks based on experience interacting with these other agents.

Learning and Influencing Latent Intent

We propose a framework for learning latent representations of another
agent’s policy: Learning and Influencing Latent Intent (LILI). The
agent of our framework identifies the relationship between its behavior
and the other agent’s future strategy, and then leverages these latent
dynamics to influence the other agent, purposely guiding them towards
policies suitable for co-adaptation. At a high level, the robot learns
two things: a way to predict latent strategy, and a policy for
responding to that strategy. The robot learns these during interaction
by “thinking back” to prior experiences, and figuring out what
strategies and policies it should have used.

Modeling Agent Strategies

The first step, shown in the left side of the diagram above, is to learn
to represent the behavior of other agents. Many prior works assume
access to the underlying intentions or actions of other agents, which
can be a restrictive assumption. We instead recognize that a
low-dimensional representation of their behavior, i.e., their latent
strategy, can be inferred from the dynamics and rewards experienced by
the agent during the current interaction. Therefore, given a sequence of
interactions, we can train an
encoder-decoder
model; the encoder embeds interaction and predicts the next
latent strategy , and the decoder takes this prediction
and reconstructs the transitions and rewards observed during interaction
.

Influencing by Optimizing for Long-Term Rewards

Given a prediction of what strategy the other agent will follow next,
the agent can learn how to react to it, as illustrated on the right
side of the diagram above. Specifically, we train an agent policy
with reinforcement learning (RL) to
make decisions conditioned on the latent strategy predicted
by the encoder.

However, beyond simply reacting to the predicted latent strategy, an
intelligent agent should proactively influence this strategy to
maximize rewards over repeated interactions. Returning to our hockey
example, consider an opponent with three different strategies: it fires
to the left, down the middle, or to the right. Moreover, left-side shots
are easier for the agent to block and so gives a higher reward when
successfully blocked. The agent should influence its opponent to adopt
the left strategy more frequently in order to earn higher long-term
rewards.

For learning this influential behavior, we train the agent policy
to maximize rewards across multiple interactions:

With this objective, the agent learns to generate interactions that
influence the other agent, and hence the system, toward outcomes that
are more desirable for the agent or for the team as a whole.

Experiments

2D Navigation

We first consider a simple point mass navigation task. Similar to
pursuit-evasion games, the agent needs to reach the other agent (i.e.,
the target) in a 2D plane. This target moves one step clockwise or
counterclockwise around a circle depending on where the agent ended the
previous interaction. Because the agent starts off-center, some target
locations can be reached more efficiently than others. Importantly, the
agent never observes the location of the target.

Below, we visualize 25 consecutive interactions from policies learned by
Soft Actor-Critic (SAC) (a standard RL algorithm), LILI (no influence),
and LILI. LILI (no influence) corresponds to our approach without the
influencing objective; i.e., the agent optimizes rewards accumulated in
a single interaction. The gray circle represents the target, while the
teal line marks the trajectory taken by the agent and the teal circle
marks the agent’s position at the final timestep of the interaction.

The SAC policy, at convergence, moves to the center of the circle in
every interaction. Without knowledge of or any mechanism to infer where
the other agent is, the center of the circle gives the highest stable
rewards. In contrast, LILI (no influence) successfully models the other
agent’s behavior dynamics and correctly navigates to the other agent,
but isn’t trained to influence the other agent. Our full approach LILI
does learn to influence: it traps the other agent at the top of the
circle, where the other agent is closest to the agent’s starting
position and yields the highest rewards.

Robotic Air Hockey

Next, we evaluate our approach on the air hockey task, played between
two robotic agents. The agent first learns alongside a robot opponent,
then plays against a human opponent. The opponent is a rule-based agent
which always aims away from where the agent last blocked. When blocking,
the robot does not know where the opponent is aiming, and only observes
the vertical position of the puck. We additionally give the robot a
bonus reward if it blocks a shot on the left of the board, which
incentivizes the agent to influence the opponent into aiming left.

In contrast to the SAC agent, the LILI agent learns to anticipate
the opponent’s future strategies and successfully block the different
incoming shots.

Because the agent receives a bonus reward for blocking left, it should
lead the opponent into firing left more often. LILI (no influence) fails
to guide the opponent into taking advantage of this bonus: the
distribution over the opponent’s strategies is uniform. In contrast,
LILI leads the opponent to strike left 41% of the time, demonstrating
the agent’s ability to influence the opponent. Specifically, the agent
manipulates the opponent into alternating between the left and middle
strategies.

Finally, we test the policy learned by LILI (no influence) against a
human player following the same strategy pattern as the robot opponent.
Importantly, the human has imperfect aim and so introduces new noise to
the environment. We originally intended to test our approach LILI with
human opponents, but we found that – although LILI worked well when
playing against another robot – the learned policy was too brittle
and did not generalize to playing alongside human opponents. However,
the policy learned with LILI (no influence) was able to block 73% of
shots from the human.

Final Thoughts

We proposed a framework for multi-agent interaction that represents the
behavior of other agents with learned high-level strategies, and
incorporates these strategies into an RL algorithm. Robots with our
approach were able to anticipate how their behavior would affect another
agent’s latent strategy, and actively influenced that agent for more
seamless co-adaptation.

Our work represents a step towards building robots that act alongside
humans and other agents. To this end, we’re excited about these next
steps:

The agents we examined in our experiments had a small number of simple strategies determining their behavior. We’d like to study the scalability of our approach to more complex agent strategies that we’re likely to see in humans and intelligent agents.
Instead of training alongside artificial agents, we hope to study the human-in-the-loop setting in order to adapt to the dynamic needs and preferences of real people.

This post is based on the following paper:

Annie Xie, Dylan P. Losey, Ryan Tolsma, Chelsea Finn, Dorsa Sadigh.
Learning Latent Representations for Multi-Agent Interaction.
Project webpage

Finally, thanks to Dylan Losey, Chelsea Finn, Dorsa Sadigh, Andrey Kurenkov, and Michelle Lee for valuable feedback on this post.

Predicting qualification ranking based on practice session performance for Formula 1 Grand Prix

If you’re a Formula 1 (F1) fan, have you ever wondered why F1 teams have very different performances between qualifying and practice sessions? Why do they have multiple practice sessions in the first place? Can practice session results actually tell something about the upcoming qualifying race? In this post, we answer these questions and more. We show you how we can predict qualifying results based on practice session performances by harnessing the power of data and machine learning (ML). These predictions are being integrated into the new “Qualifying Pace” insight for each F1 Grand Prix (GP). This work is part of the continuous collaboration between F1 and the Amazon ML Solutions Lab to generate new F1 Insights powered by AWS.

Each F1 GP consists of several stages. The event starts with three practice sessions (P1, P2, and P3), followed by a qualifying (Q) session, and then the final race. Teams approach practice and qualifying sessions differently because these sessions serve different purposes. The practice sessions are the teams’ opportunities to test out strategies and tire compounds to gather critical data in preparation for the final race. They observe the car’s performance with different strategies and tire compounds, and use this to determine their overall race strategy.

In contrast, qualifying sessions determine the starting position of each driver on race day. Teams focus solely on obtaining the fastest lap time. Because of this shift in tactics, Friday and Saturday practice session results often fail to accurately predict the qualifying order.

In this post, we introduce deterministic and probabilistic methods to model the time difference between the fastest lap time in practice sessions and the qualifying session (∆t = t_q-t_p). The goal is to more accurately predict the upcoming qualifying standings based on the practice sessions.

Error sources of ∆t

The delta of the fastest lap time between practice and qualifying sessions (∆t) comes primarily from variations in fuel level and tire grip.

A higher fuel level adds weight to the car and reduces the speed of the car. For practice sessions, teams vary the fuel level as they please. For the second practice session (P2), it’s common to begin with a low fuel level and run with more fuel in the latter part of the session. During qualifying, teams use minimal fuel levels in order to record the fastest lap time. The impact of fuel on lap time varies from circuit to circuit, depending on how many straights the circuit has and how long these straights are.

Tires also play a significant role in an F1 car’s performance. During each GP event, the tire supplier brings various tire types with varying compounds suitable for different racing conditions. Two of these are for wet circuit conditions: intermediate tires for light standing water and wet tires for heavy standing water. The remaining dry running tires can be categorized into three compound types: hard, medium, and soft. These tire compounds provide different grips to the circuit surface. The more grip the tire provides, the faster the car can run.

Past racing results showed that car performance dropped significantly when wet tires were used. For example, in the 2018 Italy GP, because the P1 session was wet and the qualifying session was dry, the fastest lap time in P1 was more than 10 seconds slower than the qualifying session.

Among the dry running types, the hard tire provides the least grip but is the most durable, whereas the soft tire has the most grip but is the least durable. Tires degrade over the course of a race, which reduces the tire grip and slows down the car. Track temperature and moisture affects the progression of degradation, which in turn changes the tire grip. As in the case with fuel level, tire impact on lap time changes from circuit to circuit.

Data and attempted approaches

Given this understanding of factors that can impact lap time, we can use fuel level and tire grip data to estimate the final qualifying lap time based on known practice session performance. However, as of this writing, data records to directly infer fuel level and tire grip during the race are not available. Therefore, we take an alternative approach with data we can currently obtain.

The data we used in the modeling were records of fastest lap times for each GP since 1950 and partial years of weather data for the corresponding sessions. The lap times data included the fastest lap time for each session (P1, P2, P3, and Q) of each GP with the driver, car and team, and circuit name (publicly available on F1’s website). Track wetness and temperature for each corresponding session was available in the weather data.

We explored two implicit methods with the following model inputs: the team and driver name, and the circuit name. Method one was a rule-based empirical model that attributed observed to circuits and teams. We estimated the latent parameter values (fuel level and tire grip differences specific to each team and circuit) based on their known lap time sensitivities. These sensitivities were provided by F1 and calculated through simulation runs on each circuit track. Method two was a regression model with driver and circuit indicators. The regression model learned the sensitivity of ∆t for each driver on each circuit without explicitly knowing the fuel level and tire grip exerted. We developed and compared deterministic models using XGBoost and AutoGluon, and probabilistic models using PyMC3.

We built models using race data from 2014 to 2019, and tested against race data from 2020. We excluded data from before 2014 because there were significant car development and regulation changes over the years. We removed races in which either the practice or qualifying session was wet because ∆t for those sessions were considered outliers.

Managed model training with Amazon SageMaker

We trained our regression models on Amazon SageMaker.

Amazon SageMaker is a fully managed service that provides every developer and data scientist with the ability to build, train, and deploy ML models quickly. Specifically for model training, it provides many features to assist with the process.

For our use case, we explored multiple iterations on the choices of model feature sets and hyperparameters. Recording and comparing the model metrics of interest was critical to choosing the most suitable model. The Amazon SageMaker API allowed customized metrics definition prior to launching a model training job, and easy retrieval after the training job was complete. Using the automatic model tuning feature reduced the mean squared error (MSE) metric on the test data by 45% compared to the default hyperparameter choice.

We trained an XGBoost model using the Amazon SageMaker’s built-in implementation. Its built-in implementation allowed us to run model training through a general estimator interface. This approach provided better logging, superior hyperparameter validation, and a larger set of metrics than the original implementation.

Rule-based model

In the rule-based approach, we reason that the differences of lap times ∆t primarily come from systematic variations of tire grip for each circuit and fuel level for each team between practice and qualifying sessions. After accounting for these known variations, we assume residuals are random small numbers with a mean of zero. ∆t can be modeled with the following equation:

∆t_f(c) and ∆t_g(c) are known sensitivities of fuel mass and tire grip, and is the residual. A hierarchy exists among the factors contained in the equation. We assume grip variations for each circuit (g(c)) are at the top level. Under each circuit, there are variations of fuel level across teams (f(t,c)).

To further simplify the model, we neglect because we assume it is small. We further assume fuel variation for each team across all circuits is the same (i.e., f(t,c) = f(t)). We can simplify the model to the following:

Because ∆t_f(c) and ∆t_g(c) are known, f(t) and g(c), we can estimate team fuel variations and tire grip variations from the data.

The differences in the sensitivities depend on the characteristics of circuits. From the following track maps, we can observe that the Italian GP circuit has fewer corner turns and the straight sections are longer compared to the Singapore GP circuit. Additional tire grip gives a larger advantage in the Singapore GP circuit.

ML regression model

For the ML regression method, we don’t directly model the relation between and fuel level and grip variations. Instead, we fit the following regression model with just the circuit, team, and driver indicator variables:

I_c, I_t, and I_d represent the indicator variables for circuits, teams, and drivers.

Hierarchical Bayesian model

Another challenge with modeling the race pace was due to noisy measurements in lap times. The magnitude of random effect (ϵ) of ∆t could be non-negligible. Such randomness might come from drivers’ accidental drift from their normal practice at the turns or random variations of drivers’ efforts during practice sessions. With deterministic approaches, such random effect wasn’t appropriately captured. Ideally, we wanted a model that could quantify uncertainty about the predictions. Therefore, we explored Bayesian sampling methods.

With a hierarchical Bayesian model, we account for the hierarchical structure of the error sources. As with the rule-based model, we assume grip variations for each circuit (g(c))) are at the top level. The additional benefit of a hierarchical Bayesian model is that it incorporates individual-level variations when estimating group-level coefficients. It’s a middle ground between two extreme views of data. One extreme is to pool data for every group (circuit and driver) without considering the intrinsic variations among groups. The other extreme is to train a regression model for each circuit or driver. With 21 circuits, this amounts to 21 regression models. With a hierarchical model, we have a single model that considers the variations simultaneously at the group and individual level.

We can mathematically describe the underlying statistical model for the hierarchical Bayesian approach as the following varying intercepts model:

Here, i represents the index of each data observation, j represents the index of each driver, and k represents the index of each circuit. μ_jk represents the varying intercept for each driver under each circuit, and θ_k represents the varying intercept for each circuit. w_p and w_q represent the wetness level of the track during practice and qualifying sessions, and ∆T represents the track temperature difference.

Test models in the 2020 races

After predicting ∆t, we added it into the practice lap times to generate predictions of qualifying lap times. We determined the final ranking based on the predicted qualifying lap times. Finally, we compared predicted lap times and rankings with the actual results.

The following figure compares the predicted rankings and the actual rankings for all three practice sessions for the Austria, Hungary, and Great Britain GPs in 2020 (we exclude P2 for the Hungary GP because the session was wet).

For the Bayesian model, we generated predictions with an uncertainty range based on the posterior samples. This enabled us to predict the ranking of the drivers relatively with the median while accounting for unexpected outcomes in the drivers’ performances.

The following figure shows an example of predicted qualifying lap times (in seconds) with an uncertainty range for selected drivers at the Austria GP. If two drivers’ prediction profiles are very close (such as MAG and GIO), it’s not surprising that either driver might be the faster one in the upcoming qualifying session.

Metrics on model performance

To compare the models, we used mean squared error (MSE) and mean absolute error (MAE) for lap time errors. For ranking errors, we used rank discounted cumulative gain (RDCG). Because only the top 10 drivers gain points during a race, we used RDCG to apply more weight to errors in the higher rankings. For the Bayesian model output, we used median posterior value to generate the metrics.

The following table shows the resulting metrics of each modeling approach for the test P2 and P3 sessions. The best model by each metric for each session is highlighted.

MODEL	MSE		MAE		RDCG
	P2	P3	P2	P3	P2	P3
Practice raw	2.822	1.053	1.544	0.949	0.92	0.95
Rule-based	0.349	0.186	0.462	0.346	0.88	0.95
XGBoost	0.358	0.141	0.472	0.297	0.91	0.95
AutoGluon	0.567	0.351	0.591	0.459	0.90	0.96
Hierarchical Bayesian	0.431	0.186	0.521	0.332	0.87	0.92

All models reduced the qualifying lap time prediction errors significantly compared to directly using the practice session results. Using practice lap times directly without considering pace correction, the MSE on the predicted qualifying lap time was up to 2.8 seconds. With machine learning methods which automatically learned pace variation patterns for teams and drivers on different circuits, we brought the MSE down to smaller than half a second. The resulting prediction was a more accurate representation of the pace in the qualifying session. In addition, the models improved the prediction of rankings by a small margin. However, there was no one single approach that outperformed all others. This observation highlighted the effect of random errors on the underlying data.

Summary

In this post, we described a new Insight developed by the Amazon ML Solutions Lab in collaboration with Formula 1 (F1).

This work is part of the six new F1 Insights powered by AWS that are being released in 2020, as F1 continues to use AWS for advanced data processing and ML modeling. Fans can expect to see this new Insight unveiled at the 2020 Turkish GP to provide predictions for the upcoming qualifying races at practice sessions.

If you’d like help accelerating the use of ML in your products and services, please contact the Amazon ML Solutions Lab .

About the Author

Guang Yang is a data scientist at the Amazon ML Solutions Lab where he works with customers across various verticals and applies creative problem solving to generate value for customers with state-of-the-art ML/AI solutions.

Amazon Textract recognizes handwriting and adds five new languages

Documents are a primary tool for communication, collaboration, record keeping, and transactions across industries, including financial, medical, legal, and real estate. The format of data can pose an extra challenge in data extraction, especially if the content is typed, handwritten, or embedded in a form or table. Furthermore, extracting data from your documents is manual, error-prone, time-consuming, expensive, and does not scale. Amazon Textract is a machine learning (ML) service that extracts printed text and other data from documents as well as tables and forms.

We’re pleased to announce two new features for Amazon Textract: support for handwriting in English documents, and expanding language support for extracting printed text from documents typed in Spanish, Portuguese, French, German, and Italian.

Handwriting recognition with Amazon Textract

Many documents, such as medical intake forms or employment applications, contain both handwritten and printed text. The ability to extract text and handwriting has been a need our customers have asked us for. Amazon Textract can now extract printed text and handwriting from documents written in English with high confidence scores, whether it’s free-form text or text embedded in tables and forms. Documents can also contain a mix of typed text or handwritten text.

The following image shows an example input document containing a mix of typed and handwritten text, and its converted output document.

You can log in to the Amazon Textract console to test out the handwriting feature, or check out the new demo by Amazon Machine Learning Hero Mike Chambers.

Not only can you upload documents with both printed text and handwriting, you can also use Amazon Augmented AI (Amazon A2I), which makes it easy to build workflows for a human review of the ML predictions. Adding in Amazon A2I can help you get to market faster by having your employees or AWS Marketplace contractors review the Amazon Textract output for sensitive workloads. For more information about implementing a human review, see Using Amazon Textract with Amazon Augmented AI for processing critical documents. If you want to use one of our AWS Partners, take a look at how Quantiphi is using handwriting recognition for their customers.

Additionally, we’re pleased to announce our language expansion. Customers can now extract and process documents in more languages.

New supported languages in Amazon Textract

Amazon Textract now supports processing printed documents in Spanish, German, Italian, French, and Portuguese. You can send documents in these languages, including forms and tables, for data and text extraction, and Amazon Textract automatically detects and extracts the information for you. You can simply upload the documents on the Amazon Textract console or send them using either the AWS Command Line Interface (AWS CLI) or AWS SDKs.

AWS customer success stories

AWS customers like yourself are always looking for ways to overcome document processing. In this section, we share what our customers are saying about Amazon Textract.

Intuit

Intuit is a provider of innovative financial management solutions, including TurboTax and QuickBooks, to approximately 50 million customers worldwide.

“Intuit’s document understanding technology uses AI to eliminate manual data entry for our consumer, small business, and self-employed customers. For millions of Americans who rely on TurboTax every year, this technology simplifies tax filing by saving them from the tedious, time-consuming task of entering data from financial documents. Textract is an important element of Intuit’s document understanding capability, improving data extraction accuracy by analyzing text in the context of complex financial forms.”

– Krithika Swaminathan, VP of AI, Intuit

Veeva

Veeva helps cosmetics, consumer goods and chemical companies bring innovative, high quality products to market faster without compromising compliance.

“Our customers are processing millions of documents per year and have a critical need to extract the information stored within the documents to make meaningful business decisions. Many of our customers are multinational organizations which means the documents are submitted in various languages like Spanish or Portuguese. Our recent partnership with AWS allowed us early access to Amazon Textract’s new feature that supports additional languages like Spanish, and Portuguese. This partnership with Textract has been key to work closely, iterate and deliver exceptional solutions to our customers.”

– Ali Alemdar, Sr Product Manager, Veeva Industries

Baker Tilly

Baker Tilly is a leading advisory, tax and assurance firm dedicated to building long-lasting relationships and helping customers with their most pressing problems — and enabling them to create new opportunities.

“Across all industries, forms are one of the most popular ways of collecting data. Manual efforts can take hours or days to “read” through digital forms. Leveraging Amazon Textract’s Optical Character Recognition (OCR) technology we can now read through these digital forms quicker and effortlessly. We now leverage handwriting as part of Textract to parse out handwritten entities. This allows our customers to upload forms with both typed and handwritten text and improve their ability to make key decisions through data quickly and in a streamlined process. Additional, Textract easily integrates with Amazon S3 and RDS for instantaneous access to processed forms and near real-time analytics.”

-Ollie East – Director of Advanced Analytics and Data Engineering

ARG Group

ARG Group is the leading end-to-end provider of digital solutions for the corporate and government market.

“At ARG Group, we work with different transportation companies and their physical asset maintenance teams. Their processes have been refined over many years. Previous attempts to digitize the process caused too much disruption and consequently failed to be adopted. Textract allowed us to provide a hybrid solution to gain the benefits of predictive insights coming from digitizing maintenance data, whilst still allowing our customer workforce to continue following their preferred handwritten process. This is expected to result in a 22% reduction in downtime and 18% reduction in maintenance cost, as we can now predict when parts are likely to fail and schedule for maintenance to happen outside of production hours. We are also expecting the lifespan of our customer assets to increase, now that we are preventing failure scenarios.”

– Daniel Johnson, Business Segment Director, ARG Group

Belle Fleur

Belle Fleur believes the ML revolution is altering the way we live, work, and relate to one another, and will transform the way every business in every industry operates.

“We use Amazon Textract to detect text for our clients that have the three Vs when it pertains to data: Variety, Velocity, and Volume, and particularly our clients that have different document formats to process information and data properly and efficiently. The feature designed to recognize the various different formats, whether it’s tables or forms and now with handwriting recognition, is an AI dream come true for our medical, legal, and commercial real estate clients. We are so excited to roll out this new handwriting feature to all of our customers to further enhance their current solution, especially those with lean teams. We are able to allow the machine learning to handle the heavy lifting via automation to read thousands of documents in a fraction of the time and allow their teams to focus on higher-order assignments.”

– Tia Dubuisson, President, Belle Fleur

Lumiq

Lumiq is a data analytics company, holding the deep domain and technical expertise to build and implement AI- and ML-driven products and solutions. Their data products are built like building blocks and run on AWS, which helps their customers scale the value of their data and drive tangible business outcomes.

“With thousands of documents being generated and received across different stages of the consumer engagement lifecycle every day, one of our customers (a leading insurance service provider in India) had to invest several manual hours for data entry, data QC, and validation. The document sets consisted of proposal forms, supporting documents for identity, financials, and medical reports, among others. These documents were in different, non-standardized formats and some of them were handwritten, resulting in an increased average lag in lead to policy issuance and impacted customer experience.

“We leveraged Amazon’s machine learning-powered Textract to extract information and insights from various types of documents, including handwritten text. Our custom solution built on top of Amazon Textract and other AWS services helped in achieving a 97% reduction in human labor for PII redaction and a projected 70% reduction in work hours for data entry. We are excited to further deep-dive into Textract to enable our customers with an E2E paperless workflow and enhance their end-consumer experience with significant time savings.”

– Mohammad Shoaib, Founder and CEO, Lumiq (Crisp Analytics)

QL Resources

QL is among Asean’s largest egg producers and surimi manufacturers, and is building a presence in the sustainable palm oil sector with activities including milling, plantations, and biomass clean energy.

“We have a large amount of handwritten documents that are generated daily in our factories, where it is challenging to ubiquitously install digital capturing devices. With the custom solution developed by our AWS partner Axrail using Amazon Textract and various AWS services, we are able to digitize documents for both printed and handwritten hard copy forms that we generated on the production floor daily, especially in production areas where digital capturing tools are not available or economical. This is a sensible solution and completes the missing link for full digitization of our production data.”

– Chia Lik Khai, Director, QL Resources

Summary

We continually make improvements to our products based on your feedback, and we encourage you to log in to the Amazon Textract console and upload a sample document and use the APIs available. You can also talk with your account manager about how best to incorporate these new features. Amazon Textract has many resources to help you get started, like blog posts, videos, partners, and getting started guides. Check out the Textract resources page for more information.

You have millions of documents, which means you have a ton of meaningful and critical data within those documents. You can extract and process your data in seconds rather than days, and keep it secure by using Amazon Textract. Get started today.

About the Author

Andrea Morton-Youmans is a Product Marketing Manager on the AI Services team at AWS. Over the past 10 years she has worked in the technology and telecommunications industries, focused on developer storytelling and marketing campaigns. In her spare time, she enjoys heading to the lake with her husband and Aussie dog Oakley, tasting wine and enjoying a movie from time to time.

Extracting handwritten information through Amazon Textract

Over the past few years, businesses have experienced a steep rise in the volume of documents they have to deal with. These include a wide range of structured and unstructured text spread across different document formats. Processing these documents and extracting data is labor-intensive and costly. It involves complex operations that can easily go wrong, leading to regulatory breaches and hefty fines. Many digitally mature organizations, therefore, have started using intelligent document processing solutions.

Quantiphi has been a part of this transformation and witnessed higher adoption of QDox, our document processing solution built on top of Amazon Textract. It extracts information to gain business insights and automate downstream business processes. We have helped customers across insurance, healthcare, financial services, manufacturing automate loans processing, patient onboarding, and compliance management, to name a few.

Although these solutions have solved some crucial problems for businesses in reducing their manual efforts, extracting information from handwritten text has been a challenge. This is primarily because handwritten texts come with their own set of complexities, such as:

Differences in handwriting styles
Poor quality or illegible handwriting
Joined or cursive handwritten text
Compression or expansion of text

These challenges make it difficult to capture data correctly and gain meaningful insights for companies.

Use case: Insurance provider

Recently, one of our customers, a large supplemental insurance provider based in the US, was facing a similar challenge in extracting vital information from a doctor’s handwritten notes that accounted for 20% of the total documents. Initially, they manually sifted through the documents to decide on the claims payout and asked to automate the process, because it took 5–6 days to process the claim. As part of the process, we built a solution to extract printed and handwritten text from several supporting documents to verify the claim. To ease the process for policyholders, we built a user interface that could interact with users using a conversational agent, and the agent could request the necessary supporting documents to process the claim. The solution extracted information from the supporting documents, such as claim application, doctor notes, and invoices to validate the claim.

The following diagram illustrates the process flow.

The solution reduced manual intervention by over 70%, but extracting and validating information from a doctor’s handwritten note was still a task. The accuracy was low and required human intervention to validate the information, which impacted process efficiency.

Solution: Amazon Textract

As an AWS partner, we reached out to the Amazon Textract product team with a need to support handwriting recognition. They assured us they were developing a solution to address such challenges. When Amazon Textract came out with a beta version of the product for handwritten text, we were among the first to get private access. The Textract team worked closely with us and iterated quickly to improve the accuracy for a wide variety of documents. Below is an example of one of our documents that Textract recognized. In fact, our customers are also happy that it does even better than other handwriting recognition services we tested for them.

We used the Amazon Textract handwriting beta version with a few sample customer documents, and we saw it improved the accuracy of the entire process by over 90%, while reducing manual efforts significantly. This enabled us to expand the scope of our platform to additional offices of our client.

Armed with the success of our customers, we’re planning to implement the Amazon Textract handwriting solution into different processes across industries. As the product is set to launch, we believe that the implementation will become much easier and the results will improve considerably.

Summary

Overall, our partnership with AWS has helped us solve some challenging business problems to bring value to our customers. We plan to continue working with AWS to solve more challenging problems to bring real value to our customers.

There are many ways to get started with Amazon Textract: reach out to our AWS Partner Quantiphi, reach out to your account manager or solutions architect, or visit our Amazon Textract product page and learn more about the resources available.

About the Author

Vibhav Sangam Gupta is a Client Solutions Partner at Quantiphi, Inc, an Applied AI and Machine Learning software and services company focused on helping organizations translate the big promise of Big Data & Machine Learning technologies into quantifiable business impact.

Amazon Personalize now supports dynamic filters for applying business rules to your recommendations on the fly

We’re excited to announce dynamic filters in Amazon Personalize, which allow you to apply business rules to your recommendations on the fly, without any extra cost. Dynamic filters create better user experiences by allowing you to tailor your business rules for each user when you generate recommendations. They save you time by removing the need to define all possible permutations of your business rules in advance and enable you to use your most recent information to filter recommendations. You control the recommendations for your users in real time while responding to their individual needs, preferences, and changing behavior to improve engagement and conversion.

For online retail use cases, you can use dynamic filters in Amazon Personalize to generate recommendations within the criteria specified by the shopper, such as brand choices, shipping speed, or ratings. For video-on-demand users, you can make sure you include movies or television shows from their favorite directors or actors based on each individual user’s preferences. When users subscribe to premium services, or their subscription expires, you can ensure that their content recommendations are based on their subscription status in real time.

Based on over 20 years of personalization experience, Amazon Personalize enables you to improve customer engagement by powering personalized product and content recommendations and targeted marketing promotions. Amazon Personalize uses machine learning (ML) to create higher-quality recommendations for your websites and applications. You can get started without any prior ML experience, using simple APIs to easily build sophisticated personalization capabilities in just a few clicks. All your data is encrypted to be private and secure, and is only used to create recommendations for your users.

Setting up and applying filters to your recommendations is simple; it only takes only a few minutes to define and deploy your business rules. You can use the Amazon Personalize console or API to create a filter with your logic using the Amazon Personalize DSL (Domain Specific Language).

To create a filter that is customizable in real-time, you define your filter expression criteria using a placeholder parameter instead of a fixed value. This allows you to choose the values to filter by applying a filter to a recommendation request, rather than when you create your expression. You provide a filter when you call the GetRecommendations or GetPersonalizedRanking API operations, or as a part of your input data when generating recommendations in batch mode through a batch inference job.

This post walks you through the process of defining and applying item and user metadata-based recommendation filters with statically or dynamically defined filter values in Amazon Personalize.

Prerequisites

To define and apply filters, you first need to set up some Amazon Personalize resources on the Amazon Personalize console. For full instructions, see Getting Started (Console).

Create a dataset group.

Create an Interactions dataset using the following schema :

{
    "type": "record",
    "name": "Interactions",
    "namespace": "com.amazonaws.personalize.schema",
    "fields": [
        {
            "name": "USER_ID",
            "type": "string"
        },
        {
            "name": "ITEM_ID",
            "type": "string"
        },
        {
            "name": "EVENT_TYPE",
            "type": "string"
        },
        {
            "name": "EVENT_VALUE",
            "type": ["null","float"]
        },
        {
            "name": "IMPRESSION",
            "type": [“null”,"string"]
        },
        {
            "name": "TIMESTAMP",
            "type": "long"
        }
    ],
    "version": "1.0"
}

Import the data using the following data file.

Create an Items dataset using the following schema:

{
	"type": "record",
	"name": "Items",
	"namespace": "com.amazonaws.personalize.schema",
	"fields": [
		{
			"name": "ITEM_ID",
			"type": "string"
		},
		{
			"name": “TITLE”,
			"type": "string"
		},
		{
			"name": "GENRE",
			"type": [
				"null",
				"string"
			],
			"categorical": true
		},
		{
            		"name": "CREATION_TIMESTAMP",
            		"type": "long"
        		}
	],
	"version": "1.0"
}

Import data using the following data file.

Create a Users dataset using the following schema:

{
    "type": "record",
    "name": "Users",
    "namespace": "com.amazonaws.personalize.schema",
    "fields": [
        {
            "name": "USER_ID",
            "type": "string"
        },
        {
            "name": "AGE",
            "type": ["null", "int"]
        },
        {
            "name": “SUBSCRIPTION”,
            "type": [“null”,”string”]
        }
    ],
    "version": "1.0"
}

Import the data using the following data file.
Create a solution using any recipe. In this post, we use the aws-user-personalization
Create a campaign.

Creating your filter

Now that you have set up your Amazon Personalize resources, you can define and test your custom filters.

Filter expression language

Amazon Personalize uses its own DSL called filter expressions to determine which items to exclude or include in a set of recommendations. Filter expressions can only filter recommendations based on datasets in the same dataset group, and you can only use them to filter results for solution versions (an Amazon Personalize model trained using your datasets in the dataset group) or campaigns (a deployed solution version for real-time recommendations). Amazon Personalize can filter items based on user-item interactions, item metadata, or user metadata datasets. For filter expression values, you can specify fixed values or add placeholder parameters, which allow you to set the filter criteria when you get recommendations from Amazon Personalize.

Dynamically passing values in filters is only supported for IN and = operations. For range queries, you need to continue to use static filters. Range queries use the following operations: NOT IN, <, >, <=, and >=.

The following are some examples of filter expressions.

Filtering by item

To remove all items in a genre chosen when you make your inference call with a filter applied, use the following filter expression:

EXCLUDE ItemId WHERE items.genre in ($GENRE)

To remove all items in the Comedy genre, use the following filter expression:

EXCLUDE ItemId WHERE items.genre in ("Comedy")

To include items with a number of downloads less than 20, use the following filter expression:

INCLUDE ItemId WHERE items.number_of_downloads < 20

This last filter expression includes a range query (items.number_of_downloads < 20). To perform this query in an Amazon Personalize recommendation filter, it needs to be predefined as a static filter.

Filtering by interaction

To remove items that have been clicked or streamed by a user, use the following filter expression:

EXCLUDE ItemId WHERE interactions.event_type in (“click”, “stream”)

To include items that a user has interacted with in any way, use the following filter expression:

INCLUDE ItemId WHERE interactions.event_type in ("*")

Filtering by item based on user properties

To exclude items where the number of downloads is less than 20 if the current user’s age is greater than 18 but less than 30, use the following filter expression:

EXCLUDE ItemId WHERE items.number_of_downloads < 20 IF CurrentUser.age > 18 AND CurrentUser.age < 30

You can also combine multiple expressions together using a pipe ( | ) to separate them. Each expression is evaluated independently and the recommendations you receive is the union of those expressions.

The following filter expression includes two expressions. The first expression includes items in the Comedy genre; the second expression excludes items with a description of classic, and the results are the union of the results of both filters.

INCLUDE Item.ID WHERE items.genre IN (“Comedy”) | EXCLUDE ItemID WHERE items.description IN ("classic”)

Filters can also use multiple filter expressions with dynamically passed values to be applied to your recommendations. The first expression includes a genre defined in the request for recommendations with the $GENRE value, and second expression excludes a description defined in the request for recommendations with the $DESC value. The result of the filter is the union of the results of both expressions that includes the $GENRE but excludes items with a description defined in the request for recommendations with the $DESC value:

INCLUDE Item.ID WHERE items.GENRE IN ($GENRE) | EXCLUDE ItemID WHERE item.DESCRIPTION IN ("$DESC”)

For more information, see Datasets and Schemas. For further details on filter definition DSL, see our documentation.

Creating a filter on the Amazon Personalize console

You can use the preceding DSL to create a customizable filter on the Amazon Personalize console. Complete the following steps:

On the Amazon Personalize console, on the Filters page, choose Create filter.
For Filter name, enter the name for your filter.
Select Build expression or add your expression manually to create your custom filter.
Enter a value of $ plus a parameter name that is similar to your property name and easy to remember (for example, $GENRE).
Optionally, to chain additional expressions with your filter choose, +.
To add additional filter expressions, choose Add expression.
Choose Finish.

Creating a filter takes you to a page containing detailed information about your filter. Here you can view more information about your filter, including the filter ARN and the corresponding filter expression you created (see the following screenshot). You can also delete filters on this page or create more filters from the summary page.

You can also create filters via the createFilter API in Amazon Personalize. For more information, see CreateFilter.

Range queries are not supported when dynamically passing values to recommendations filters. For example, filters excluding items with less than 20 views in the items metadata must be defined as static filters.

Applying your filter via the Amazon Personalize console

The Amazon Personalize console allows you to spot-check real-time recommendations from the Campaigns page. On this page, you can test your filters while retrieving recommendations for a specific user on demand. To do so, navigate to the Campaign tab; this should be in the same dataset group that you used to create the filter. You can then test the impact of applying the filter on the recommendations.

The following screenshot shows results when you pass the value Action as the parameter to a filter based on the items’ genre.

When applying filters to your recommendations in real time via the console, the Filter parameters section auto-populates with the filter parameter and expects a value to be passed to the filter when you choose Get recommendations button.

Applying your filter via the SDK

You can also apply filters to recommendations that are served through your SDK or APIs by supplying the filterArn as an additional and optional parameter to your GetRecommendations calls. Use filterArn as the parameter key and supply the filterArn as a string for the value. filterArn is a unique identifying key that the CreateFilter API call returns. You can also find a filter’s ARN on the filter’s detailed information page.

The following example code is a request body for GetRecommendations API that applies a filter to a recommendation:

{
    "campaignArn": "arn:aws:personalize:us-west-2:000000000000:campaign/test-campaign",
    "userId": "1",
    "itemId": "1",
    "numResults": 5,
    "filterArn": "arn:aws:personalize:us-west-2:000000000000:filter/test-filter"
    "filter-values": ‘[{
	"GENRE": ““ACTION”, “HORROR””
	}]

Summary

Recommendation filters in Amazon Personalize allow you to further customize your recommendations for each user in real time to provide an even more tailored experience that improves customer engagement and conversion. For more information about optimizing your user experience with Amazon Personalize, see What Is Amazon Personalize?

About the Author

Vaibhav Sethi is the Product Manager for Amazon Personalize. He focuses on delivering products that make it easier to build machine learning solutions. In his spare time, he enjoys hiking and reading.

Samuel Ashman is a Technical Writer for Amazon Personalize. His goal is to make it easier for developers to use machine learning and AI in their applications. His studies in computer science allow him to understand the challenges developers face. In his free time, Samuel plays guitar and exercises.

Parth Pooniwala is a Senior Software Engineer with Amazon Personalize focused on building AI-powered recommender systems at scale. Outside of work he is a bibliophile, film enthusiast, and occasional armchair philosopher.

List of Accepted Papers

Keynote

Walking the Boundary of Learning and Interaction (Dorsa Sadigh)

Main Conference

Findings of EMNLP

Workshops and Co-Located Conferences

Learning and Influencing Latent Intent

Modeling Agent Strategies

Influencing by Optimizing for Long-Term Rewards

Experiments

2D Navigation

Robotic Air Hockey

Final Thoughts

Error sources of ∆t

Data and attempted approaches

Managed model training with Amazon SageMaker

Rule-based model

ML regression model

Hierarchical Bayesian model

Test models in the 2020 races

Metrics on model performance

Summary

About the Author

Handwriting recognition with Amazon Textract

New supported languages in Amazon Textract

AWS customer success stories

Intuit

Veeva

Baker Tilly

ARG Group

Belle Fleur

Lumiq

QL Resources

Summary

About the Author

Use case: Insurance provider

Solution: Amazon Textract

Summary

About the Author

Prerequisites

Creating your filter

Filter expression language

Filtering by item

Filtering by interaction

Filtering by item based on user properties

Creating a filter on the Amazon Personalize console

Applying your filter via the Amazon Personalize console

Applying your filter via the SDK

Summary

About the Author

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.