To balance these competing demands, policymakers need analytical tools that can evaluate the tradeoffs between mobility and COVID-19 infections. Furthermore, such tools should be fine-grained, able to test out heterogeneous plans—for example, allowing one level of mobility at essential retail, another level at gyms, and yet another at restaurants—so that policymakers can tailor restrictions to the specific risks and needs of each sector. At the same time, the tool also needs to be scalable, supporting analyses for a massive number of potential policies so that policymakers can find the best option for their jurisdiction.
To fulfill these needs, we developed a novel computational tool, which we built in collaboration with the Biocomplexity Institute & Initiative at UVA to support the Virginia Department of Health (VDH). Described in our award-winning KDD 2021 paper, our tool enables policymakers to assess the costs and benefits of thousands of different mobility measures, based on millions of simulations from our underlying epidemiological model. We designed our tool to fulfill VDH’s desire to have a quantitative and comprehensive analysis of a range of reopening policies. With their guidance, we developed an interactive dashboard, where policymakers can select various proposed changes in mobility and observe their predicted impacts on COVID-19 infections over time and across regions.
Our dashboard focuses on mobility to five key categories of places: Restaurants, Gyms, Religious Organizations, Essential Retail (grocery stores, pharmacies, convenience stores), and Retail (clothing stores, book stores, hardware stores, etc.). For each category, the user can use sliders to choose a target level of mobility (e.g., 50% of normal levels, based on pre-pandemic mobility), or they can choose to continue current levels of mobility at these places. The other panels on the dashboard then visualize predicted COVID-19 infections under the selected mobility plan, and compare these outcomes to what would happen if all categories remained at their current levels of mobility.
Our tool enables policymakers to comprehensively analyze pandemic tradeoffs, by quantifying visits lost under each mobility plan as well as predicted infections. The sliders for each category allow them to test fine-grained, heterogeneous policies. Furthermore, the flexibility of our approach (i.e., allowing any combination of mobility levels) results in an exponential number of scenarios to test. To scale our modeling efforts, our tool features a robust computational infrastructure that compresses 2 years of compute time into the span of a few days.
Our mobility networks encode the hourly movements of people from census block groups (CBGs) to points of interest (POIs), which are non-residential locations such as restaurants, grocery stores, and churches. Using iterative proportional fitting, we infer these networks from aggregated, anonymized location data provided by SafeGraph. In this work, we infer hourly networks for the Washington DC, Virginia Beach, and Richmond metropolitan areas, three of the largest metropolitan areas in Virginia. From November 1 to December 31, 2020, their resulting networks contain 3.4 billion hourly edges between CBGs and POIs.
We integrate the mobility networks, along with other data sources such as daily mask use, into our model. The key to our model is that it maintains the number of people in each CBG who are susceptible (S), exposed (E), infectious (I), or removed (R).
These CBG states are updated in each hour of the simulation, based on transmission dynamics that capture both household transmission and transmission occurring at POIs. That is, if there are susceptible and infectious individuals visiting a POI at the same time, then we model some probability of new infection occurring. That probability depends on the POI’s area in square feet, its median dwell time, the percentage of people wearing masks, and the number of susceptible and infectious visitors. Based on all of these factors, our model realistically captures who was infected where and when, down to the individual POI and hour.
To validate our models, we compare its predictions against actual daily COVID-19 cases and deaths, as reported by The New York Times. In our initial work 3, published in Nature 2020, we showed that our dynamic mobility networks enable even these relatively simple SEIR models with minimal free parameters to accurately fit real case trajectories and predict case counts in held-out time periods, despite substantial changes in population behavior during the pandemic. Integrating these networks furthermore allows us to capture the fine-grained spread of the virus, enabling analyses of the riskiest venues to reopen and the most at-risk populations.
In this work, we sought to translate our model into a tool that can directly support COVID-19 decision-makers, motivated by our interactions with the Virginia Department of Health. This goal required many extensions to our computational pipeline, including fitting the model to new regions and time periods, and improving our computational infrastructure to deploy the model at scale. Furthermore, to keep pace with developments in the pandemic, we introduced new real-world features to the model such as daily mask use, time-varying case and death detection rates, and model initialization based on historical reported cases/deaths. These additions allowed us to accurately fit real COVID-19 trajectories in Virginia, and we showed that the inclusion of our new features contributed substantially toward reducing model loss. Most importantly, we worked with VDH to design use cases of our model that were most relevant to their needs, and developed a new dashboard to effectively communicate thousands of results from our model. Our full pipeline—the extended model, the computational infrastructure, and the new dashboard—constitutes advancements in this work that allowed us to truly transform our scientific model into a tool for real-world impact.
Using our model
Our fitted model can be applied to a wide variety of use cases. First, we can use it for retrospective analyses, by leveraging the model’s ability to capture who got infected where and when.
For example, we can use the model to compare the learned infection rates of lower-income and higher-income CBGs. What’s striking is that our model correctly predicts disparities from mobility data alone, even though we did not give our model any CBG demographics during runtime (only during analysis). In our prior work, we showed that two mechanisms in the mobility data explained these predicted disparities: lower-income CBGs were not able to reduce their mobility as much during the pandemic, and the POIs that they go to (even in the same category) tend to be more crowded with longer visits, and thus riskier. In this work, we show that this trend extends to both waves of the pandemic and to new metropolitan areas.
We can also use the model for forward-facing experiments. Essentially, the model has many different interpretable inputs, so we can simply modify one of those inputs, run the model, and observe what happens to the model’s predicted infections. For example, to generate data for our dashboard, we modify the mobility networks to reflect the user’s selected levels of mobility for each category, and run the model forward to produce predicted infections. We can also use our model to analyze vaccination strategies; for example, by reducing transmission rates per CBG based on the percentage of the CBG that is vaccinated.
Discussion & next steps
Our approach is not without its limitations, which we have discussed with policymakers. For instance, the mobility data from SafeGraph does not cover all POIs (e.g., limited coverage of nursing homes) or populations (e.g., children), and our model makes necessary but simplifying assumptions about the dynamics of disease transmission. Furthermore, in this work, we focused on how changes in mobility impact transmission, but where do these changes in mobility come from and how can we effect them? In future work, we plan to develop new models to answer these questions, to analyze and predict how complex mobility networks change in response to policy interventions and other pandemic events.
That said, in this work we’ve addressed a significant part of the puzzle, by introducing a tool that provides a quantitative and comprehensive near real-time assessment of the effects of mobility on transmission. Our underlying model is furthermore capable of many more types of analyses, from informing inequities to evaluating future vaccination strategies. In fact, we are now supporting the Virginia Department of Health on their vaccination efforts and extending our model to evaluate different vaccination policies. As the pandemic evolves, we will continue building decision-support tools and advancing the capabilities of our model, so that we can best support the needs of policymakers.
Special thanks to the SAIL blog editors, Emma Pierson, and Pang Wei Koh for their helpful feedback on this post. This blog post is based on our paper in KDD 2021:
Supporting COVID-19 policy response with large-scale mobility-based modeling. Serina Chang, Mandy L. Wilson, Bryan Lewis, Zakaria Mehrab, Komal K. Dudakiya, Emma Pierson, Pang Wei Koh, Jaline Gerardin, Beth Redbird, David Grusky, Madhav Marathe, and Jure Leskovec. KDD 2021 (Applied Data Science Track, Best Paper Award).
S. Gao, J. Rao, Y. Kang, et al. Association of mobile phone location data indications of travel and stay-at-home mandates with COVID-19 infection rates in the US. JAMA Netw Open (2020). ↩
J. Oh, HY. Lee, Q. Khuong, et al. Mobility restrictions were associated with reductions in COVID-19 incidence early in the pandemic: evidence from a real-time evaluation in 34 countries. Sci Rep 11, 13717 (2021). ↩
S. Chang, E. Pierson, P.W. Koh, et al. Mobility network models of COVID-19 explain inequities and inform reopening. Nature 589, 82–87 (2020). ↩
Imitation Learning is a promising approach to endow robots with various complex manipulation capabilities. By allowing robots to learn from datasets collected by humans, robots can learn to perform the same skills that were demonstrated by the human. Typically, these datasets are collected by having humans control robot arms, guiding them through different tasks. While this paradigm has proved effective, a lack of open-source human datasets and reproducible learning methods make assessing the state of the field difficult. In this paper, we conduct an extensive study of six offline learning algorithms for robot manipulation on five simulated and three real-world multi-stage manipulation tasks of varying complexity, and with datasets of varying quality. Our study analyzes the most critical challenges when learning from offline human data for manipulation.
Based on the study, we derive several lessons to understand the challenges in learning from human demonstrations, including the sensitivity to different algorithmic design choices, the dependence on the quality of the demonstrations, and the variability based on the stopping criteria due to the different objectives in training and evaluation. We also highlight opportunities for learning from human datasets, such as the ability to learn proficient policies on challenging, multi-stage tasks beyond the scope of current reinforcement learning methods, and the ability to easily scale to natural, real-world manipulation scenarios where only raw sensory signals are available.
We have open-sourced our datasets and all algorithm implementations to facilitate future research and fair comparisons in learning from human demonstration data. Please see the robomimic website for more information.
Why is learning from human-labeled datasets difficult?
We explore five challenges in learning from human-labeled datasets.
(C1) Unobserved Factors in Human Decision Making. Humans are not perfect Markovian agents. In addition to what they currently see, their actions may be influenced by other external factors – such as the device they are using to control the robot and the history of the actions that they have provided.
(C2) Mixed Demonstration Quality. Collecting data from multiple humans can result in mixed quality data, since some people might be better quality supervisors than others.
(C3) Dependence on dataset size. When a robot learns from an offline dataset, it needs to understand how it should act (action) in every scenario that it might encounter (state). This is why the coverage of states and actions in the dataset matters. Larger datasets are likely to contain more situations, and are therefore likely to train better robots.
(C4) Train Objective ≠ Eval Objective. Unlike traditional supervised learning, where validation loss is a strong indicator of how good a model is, policies are usually trained with surrogate losses. Consider an example where we train a policy via Behavioral Cloning from a set of demonstrations on a block lifting task. Here, the policy is trained to replicate the actions taken by the demonstrator, but this is not necessarily equivalent to optimizing the block lifting success rate (see the Dagger paper for a more precise explanation). This makes it hard to know which trained policy checkpoints are good without trying out each and every model directly on the robot – a time consuming process.
(C5) Sensitivity to Agent Design Decisions. Performance can be very sensitive to important agent design decisions, like the observation space and hyperparameters used for learning.
In this section, we summarize the tasks (5 simulated and 3 real), datasets (3 different variants), algorithms (6 offline methods, including 3 imitation and 3 batch reinforcement), and observation spaces (2 main variants) that we explored in our study.
Task Reset Distributions
When measuring the task success rate of a policy, the policy is evaluated across several trials. At the start of each trial, the initial placement of all objects in the task are randomized from a task reset distribution. The videos below show this distribution for each task. This gives an impression of the range of different scenarios that a trained policy is supposed to be able to handle.
These datasets consist of rollouts from a series of SAC agent checkpoints trained on Lift and Can, instead of humans. As a result, they contain random, suboptimal, and expert data due to the varied success rates of the agents that generated the data. This kind of mixed quality data is common in offline RL works (e.g. D4RL, RLUnplugged).
These datasets consist of 200 demonstrations collected from a single proficient human operator using RoboTurk.
These datasets consist of 300 demonstrations collected from six human operators of varied proficiency using RoboTurk. Each operator falls into one of 3 groups – “Worse”, “Okay”, and “Better” – each group contains two operators. Each operator collected 50 demonstrations per task. As a result, these datasets contain mixed quality human demonstration data. We show videos for a single operator from each group.
We evaluated 6 different offline learning algorithms in this study, including 3 imitation learning and 3 batch (offline) reinforcement learning algorithms.
BC: standard Behavioral Cloning, which is direct regression from observations to actions.
BC-RNN: Behavioral Cloning with a policy network that’s a recurrent neural network (RNN), which allows modeling temporal correlations in decision-making.
HBC: Hierarchical Behavioral Cloning, where a high-level subgoal planner is trained to predict future observations, and a low-level recurrent policy is conditioned on a future observation (subgoal) to predict action sequences (see Mandlekar*, Xu* et al. (2020) and Tung*, Wong* et al. (2021) for more details).
CQL: Conservative Q-Learning, a batch reinforcement learning method proposed in Kumar et al. (2020).
IRIS: Implicit Reinforcement without Interaction, a batch reinforcement learning method proposed in Mandlekar et al. (2020).
We study two different observation spaces in this work – low-dimensional observations and image observations.
We provide examples of the image observations used in each task below.
Summary of Lessons Learned
In this section, we briefly highlight the lessons we learned from our study. See the paper for more thorough results and discussion.
Lesson 1: History-dependent models are extremely effective.
We found that there is a substantial performance gap between BC-RNN and BC, which highlights the benefits of history-dependence. This performance gap is larger for longer-horizon tasks (e.g. ~55% for the Transport (PH) dataset compared to ~5% for the Square (PH) dataset)) and also larger for multi-human data compared to single-human data (e.g.~25% for Square (MH) compared to ~5% for Square (PH)).
Lesson 2: Batch (Offline) RL struggles with suboptimal human data.
Recent batch (offline) RL algorithms such as BCQ and CQL have demonstrated excellent results in learning from suboptimal and multi-modal machine-generated datasets. Our results confirm the capacity of such algorithms to work well – BCQ in particular performs strongly on our agent-generated MG datasets that consist of a diverse mixture of good and poor policies (for example, BCQ achieves 91.3% success rate on Lift (MG) compared to BC which achieves 65.3%).
Surprisingly though, neither BCQ nor CQL performs particularly well on these human-generated datasets. For example, BCQ and CQL achieve 62.7% and 22.0% success respectively on the Can (MH) dataset, compared to BC-RNN which achieves 100% success. This puts the ability of such algorithms to learn from more natural dataset distributions into question (instead of those collected via RL exploration or pre-trained agents). There is an opportunity for future work in batch RL to resolve this gap.
Lesson 3: Improving offline policy selection is important.
Lesson 4: Observation space and hyperparameters play a large role in policy performance.
Lesson 5: Using human data for manipulation is promising.
Lesson 6: Study results transfer to real world.
We collected 200 demonstrations per task, and trained a BC-RNN policy using identical hyperparameters to simulation, with no hyperparameter tuning. We see that in most cases, performance and insights on what works in simulation transfer well to the real world.
Below, we present examples of policy failures on the Tool Hang task, which illustrate its difficulty, and the large room for improvement.
We also show that results from our observation space study hold true in the real world – visuomotor policies benefit strongly from wrist observations and pixel shift randomization.
Learning from large multi-human datasets can be challenging.
Large multi-human datasets hold promise for endowing robots with dexterous manipulation capabilities.
Studying this setting in simulation can enable reproducible evaluation and insights can transfer to real world.
[July 20, 2021]Our work was recently covered by the New York Times here. You can also find a technical preprint here.
With the rise of large online computer science courses, there
is an abundance of high-quality content. At the same time, the sheer
size of these courses makes high-quality feedback to student work more
and more difficult. Talk to any educator, and they will tell you how
instrumental instructor feedback is to a student’s learning process.
Unfortunately, giving personalized feedback isn’t cheap: for a large
online coding course, this could take months of labor. Today, large
online courses either don’t offer feedback at all or take shortcuts that
sacrifice the quality of the feedback given.
Several computational approaches have been proposed to automatically
produce personalized feedback, but each falls short: they either require
too much upfront work by instructors or are limited to very simple
assignments. A scalable algorithm for feedback to student code that
works for university-level content remains to be seen. Until now, that
is. In a recent paper, we proposed a new AI system based on
meta-learning that trains a neural network to ingest student code and
output feedback. Given a new assignment, this AI system can quickly
adapt with little instructor work. On a dataset of student solutions to
Stanford’s CS106A exams, we found the AI system to match human
instructors in feedback quality.
To test the approach in a real-world setting, we deployed the AI system
at Code in Place 2021, a large online computer science course spun out
of Stanford with over 12,000 students, to provide feedback to an
end-of-course diagnostic assessment. The students’ reception to the
feedback was overwhelmingly positive: across 16,000 pieces of feedback
given, students agreed with the AI feedback 97.9% of the time, compared
to 96.7% agreement to feedback provided by human instructors. This is,
to the best of our knowledge, the first successful deployment of machine
learning based feedback to open-ended student work.
In the middle of the pandemic, while everyone is forced to social
distance in the confines of their own homes, thousands of people across
the world were hard at work figuring out why their code was stuck in an
infinite loop. Stanford CS106A, one of the university’s most popular
courses and its largest introductory programming offering with nearly
1,600 students every year, grew even bigger. Dubbed Code in Place,
CS106A instructors Chris Piech, Mehran Sahami and Julie Zelenski wanted
to make the curriculum and teaching philosophy of CS106A publicly
available as an uplifting learning experience for students and adults
alike during a difficult time. In its inaugural showing in April ‘20,
Code in Place pulled together 908 volunteer teachers to run an online
course for 10,428 students from around the world. One year later, with
the pandemic still in full force in many areas of the world, Code in
Place kicked off again, growing to over 12,000 students and 1,120
While crowd-sourcing a teaching team did make a lot of things possible
for Code in Place that usual online courses lack, there are still limits
to what can be done with a class of this scale. In particular, one of
the most challenging hurdles was providing high-quality feedback to
What is feedback?
Everyone knows high quality content is an important
ingredient for learning, but another equally important but more subtle
ingredient is getting high quality feedback. Knowing the breakdown of
what you did well and what the areas for improvement are, is fundamental
to understanding. Think back to when you first got started programming:
for me, small errors that might be obvious to someone more experienced,
cause a lot of frustration. This is where feedback comes in, helping
students overcome this initial hurdle with instructor guidance.
Unfortunately, feedback is something online code education has struggled
with. With popular “massively open online courses” (MOOCs), feedback on
student code boils down to compiler error messages, standardized
tooltips, or multiple-choice quizzes.
You can find an example of each below. On the left, multiple choice
quizzes are simple to grade and can easily assign numeric scores to
student work. However, feedback is limited to showing the right answer,
which does little to help students understand their underlying
misconceptions. The middle picture shows an example of an opaque
compiler error complaining about a syntax issue. As a beginner learning
to code, error messages are very intimidating and difficult to
interpret. Finally, on the right, we see an example of a standardized
tooltip: upon making a mistake, a pre-specified message is shown.
Pre-specified messages tend to be very vague: here, the tooltip just
tells us our solution is wrong and to try something different.
It makes a lot of sense why MOOCs settle for subpar feedback: it’s
really difficult to do otherwise! Even for Stanford CS106A, the teaching
team is constantly fighting the clock in office hours in an attempt to
help everyone. Outside of Stanford, where classes may be more
understaffed, instructors are already unable to provide this level of
individualized support. With large online courses, the sheer size makes
any hope of providing feedback unimaginable. Last year, Code in Place
gave a diagnostic assessment during the course for students to summarize
what they have learned. However, there was no way to give feedback
scalably to all these student solutions. The only option was to release
the correct solutions online for students to compare to their own work,
displacing the burden of feedback onto the students.
Code in Place and its MOOC cousins are examples of a trend of education
moving online, which might only grow given the lasting effects of the
pandemic. This shift surfaces a very important challenge: can we provide
feedback at scale?
The feedback challenge.
In 2014, Code.org, one
of the largest online platforms for code education, launched an
initiative to crowdsource thousands of instructors to provide feedback
to student solutions [1,2]. The hope of the initiative was to tag enough
student solutions with feedback so that for a new student, Code.org
could automatically provide feedback by matching the student’s solution
to a bank of solutions already annotated with feedback by an instructor.
Unfortunately, Code.org quickly found that even after thousands of
aggregate hours spent providing feedback, instructors were only
scratching the surface. New students were constantly coming up with new
mistakes and new strategies. The initiative was cancelled after two
years and has not been reproduced since.
We might ask: why did this happen? What is it about feedback that makes
it so difficult to scale? In our research, we came up with two parallel
First, providing feedback to student code is hard work. As an
instructor, every student solution requires me to reason about the
student’s thought process to uncover what misconceptions they might have
had. If you have ever had to debug someone else’s code, providing
feedback is at least as hard as that. In a previous research paper, we found that producing
feedback for only 800 block-based programs took a teaching team a
collective 24.9 hours. If we were to do that for all of Code in Place,
it would take 8 months of work.
Second, students approach the same programming problem in an exponential
number of ways. Almost every new student solution will be unique, and a
single misconception can manifest itself in seemingly infinite ways. As
a concrete example, even after seeing a million solutions to a Code.org
problem, there is still a 15% chance that a new student generates a
solution never seen before. Perhaps not coincidentally, it turns out the
distribution of student code closely follows the famous Zipf
distribution, which reveals an extremely “long tail” of rare solutions
that only one student will ever submit. Moreover, this close
relationship to Zipf doesn’t just apply to Code.org; it is a much more
general phenomenon. We see similar patterns for student work for
university level programming assignments in Python and Java, as well as
free response solutions to essay-like prompts.
So, if asking instructors to manually provide feedback at scale is
nearly impossible, what else can we do?
“If humans can’t do it, maybe machines
can” (famous last words). After all, machines process information a lot
faster than humans do. There have been several approaches applying
computational techniques to provide feedback, the simplest of which is
unit tests. An instructor can write a collection of unit tests for the
core concepts and use them to evaluate student solutions. However, unit
tests expect student code to compile and, often, student code does not
due to errors. If we wish to give feedback on partially complete
solutions, we need to be able to handle non-compiling code. Given the
successes of AI and deep learning in computer vision and natural
language, there have been attempts of designing AI systems to
automatically provide feedback, even when student code does not compile.
Given a dataset of student code, we can ask an
instructor to provide feedback for each of the solutions, creating a
labeled dataset. This can be used to train a deep learning model to
predict feedback for a new student solution. While this is great in
theory, in practice, compiling a sufficiently large and diverse dataset
is difficult. In machine learning, we are accustomed to datasets with
millions of labeled examples since annotating an image is both cheap and
requires no domain knowledge. On the other hand, annotating student
code with feedback is both time-consuming and needs expertise, limiting
datasets to be a few thousand examples in size. Given the Zipf-like
nature of student code, it is very unlikely that a dataset of this size
can capture all the different ways students approach a problem. This is
reflected in practice as supervised attempts perform poorly on new
While annotating student code is difficult work,
instructors are really good at thinking about how students would tackle
a coding problem and what mistakes they might make along the way.
Generative grading [2,3] asks instructors to distill this intuition
about student cognition into an algorithm called a probabilistic
grammar. Instructors specify what misconceptions a student might make
and how that translates to code. For example, if a student forgets a
stopping criterion resulting in an infinite loop, their program likely
contains a “while” statement with no “break” condition. Given such an
algorithm, we can run it forward to generate a full student solution
with all misconceptions already labeled. Doing this repeatedly, we
curate a large dataset to train a supervised model. This approach was
very successful on block-based code, where performance rivaled human
instructors. However, the success of it hinges on a good algorithm.
While tractable for block-based programs, it became exceedingly
difficult to build a good algorithm for university level assignments
where student code is much more complex.
The supervised approach requires the instructor to curate a dataset of
student solutions with feedback where as the generative grading approach
requires the instructor to build an algorithm to generate annotated
data. In contrast, the meta-learning approach requires the instructor to
annotate feedback for K examples across N programming problems. K is
typically very small (~10) and N not much larger (~100).
Meta-learning how to give feedback.
So far, neither approach is quite
right. In different ways, supervised learning and generative grading
both expect too much from the instructor. As they stand, for every new
coding exercise, the instructor would have to put in days, if not weeks
to months of effort. In an ideal world, we would shift more of the
burden of feedback onto the AI system. While we would still like
instructors to play a role, the AI system should bear the onus of
quickly adapting to every new exercise. To accomplish this, we built an
AI system to “learn how to learn” to give feedback.
Meta-learning is an old idea from the 1990s [9, 10] that has seen a
resurgence in the last five years. Recall that in supervised learning a
model is trained to solve a single task; in meta-learning, we solve many
tasks at once. The catch is that we are limited to a handful of labeled
examples for every task. Whereas supervised learning gets lots of labels
for one task, we spread the annotation effort evenly across many tasks,
leaving us with a few labels per task. In research literature, this is
called the few-shot classification problem. The upside to meta-learning
is that after training, if your model is presented with a new task that
it has not seen before, it can quickly adapt to solve it with only a
“few shots” (i.e., a few annotations from the new task).
So, what does meta-learning for feedback look like? To answer that, we
first need to describe what composes a “task” in the world of
educational feedback. Last year, we compiled a dataset of student
solutions from eight CS106A exams collected over the last three academic
years. Each exam consists of four to six programming exercises in which
the student must write code (but is unable to run or compile it for
testing). Every student solution is annotated by an instructor using a feedback rubric containing a list of misconceptions tailored to a single
problem. As an example, consider a coding exercise that asks the student
to write a Python program that requires string insertion. A potential
feedback rubric is shown in the left image: possible misconceptions are
inserting at the wrong location or inserting the wrong string. So, we
can treat every misconception as its own task. The string insertion
example would comprise of four tasks.
One of the key ideas of this approach is to frame
the feedback challenge as a few-shot classification problem. Remember
that the reasons why previous methods for automated feedback struggled
were the (1) high cost of annotation and (2) diversity of student
solutions. Casting feedback as a few-shot problem cleverly circumvents
both challenges. First, meta-learning can leverage previous data on old
exams to learn to provide feedback to a new exercise with very little
upfront cost. We only need to label a few examples for the new exercise
to adapt the meta-learner and importantly, do not need to train a new
model from scratch. Second, there are two ways to handle diversity: you
can go for “depth” by training on a lot of student solutions for a
single problem to see different strategies, or you can go for “breadth”
and get sense of diverse strategies through student solutions on a lot
of different problems. Meta-learning focuses its efforts on capturing
“breadth”, accumulating more generalizable knowledge that can be shared
We will leave the details of the meta-learner to the technical report.
In short, we propose a new deep neural network called a ProtoTransformer
Network that combines the strengths of BERT from natural language
processing and Prototypical Networks from few-shot learning literature.
This architecture, in tandem with technical innovations – creating
synthetic tasks for code, self-supervised pretraining on unlabeled code,
careful encoding of variable and function names, and adding question and
rubric descriptions as side information – together produce a highly
performant AI system for feedback. To help ground this in context, we
include three examples on the bottom of the last page of the AI system
predicting feedback to student code. The predictions were taken from
actual model output on student submissions.
Aside from looking at qualitative examples, we can
measure its performance quantitatively by evaluating the correctness of
the feedback an AI system gave on exercises not used in training. A
piece of feedback is considered correct if a human instructor annotated
the student solution with it.
We consider two experimental settings for evaluation:
Held-out Questions: we randomly pick 10% of questions across all exams
to evaluate the meta-learner. This simulates instructors providing
feedback for part of every exam, leaving a few questions for the AI to
give feedback for.
Held-out Exams: we hold out an entire exam for evaluation. This is a
much harder setting as we know nothing about the new exam but also most
faithfully represents an autonomous feedback system.
We measure the performance of human instructors by asking several
teaching assistants to grade the same student solution and recording
agreement. We also compare the meta-learner to a supervised baseline. As
shown in the graph on the previous page, the meta-learner outperforms
the supervised baseline by up to 24 percentage points, showcasing the
utility of meta-learning. More surprisingly, we find that the
meta-learner surpasses human performance by 6% in held-out questions.
However, there is still room for improvement as we fall short 8% to
human performance on held-out exams – a harder challenge. Despite this,
we find these results encouraging: previous methods for feedback could
not handle the complexity of university assignments, let alone approach,
or match the performance of instructors.
Automated feedback for Code in Place.
Taking a step back, we began
with the challenge of feedback, an important ingredient to a student’s
learning process that is frustratingly difficult to scale, especially
for large online courses. Many attempts have been made towards this,
some based on crowdsourcing human effort and others based on
computational approaches with and without AI, but all of which have
faced roadblocks. In late May ‘21, we built and tested an approach based
on meta-learning, showing surprisingly strong results on university
level content. But admittedly, the gap between ML research and
deployment can be large, and it remained to be shown that our approach
can give high quality feedback at scale in a live application. Come
June, Code in Place ‘21 was gearing up for its diagnostic assessment.
In an amazing turnout, Code in Place ‘21 had 12,000 students. But
grading 12,000 students each solving 5 problems would be beyond
intractable. To put it into numbers, it would take 8 months of human
labor, or more than 400 teaching assistants working standard
nine-to-five shifts to manually grade all 60,000 solutions.
The Code in Place ‘21 diagnostic contained five new questions that were
not in the CS106A dataset used to train the AI system. However, the
questions were similar in difficulty and scope, and correct solutions
were roughly the same length as those in CS106A. Because the AI system
was trained with meta-learning, it could quickly adapt to these new
questions. Volunteers from the teaching team helped annotate a small
portion of the student solutions that the AI meta-learning algorithm
To showcase feedback to students, we were joined by Alan Chang and together we built an application for students
to see their solutions and AI feedback (see image above). We were
transparent in informing students that an AI was providing feedback. For
each predicted misconception, we associated it with a message (shown in
the blue box) to the student. We carefully crafted the language of these
messages to be helpful and supportive of the student’s learning. We also
provided finer grained feedback by highlighting portions of the code
that the AI system weighted more strongly in making its prediction. In
the image above, the student forgot to cast the height to an integer. In
fact, the highlighted line should be height = int(input(…)), which the
AI system picked up on.
Human versus AI Feedback
For each question, we asked the student to
rate the correctness of the feedback provided by clicking either a
“thumbs up” or a “thumbs down” before they can proceed to the next
question (see lower left side of the image above). Additionally, after a
student reviewed all their feedback, we asked them to rate the AI system
holistically on a five-point scale. As part of the deployment, some of
the student solutions were given feedback by humans but students did not
know which ones. So, we can compare students’ holistic and per-question
rating when given AI feedback versus instructor feedback.
Here’s what we found:
1,096 students responded to a survey after receiving 15,134 pieces
of feedback. The reception was overwhelmingly positive: Across all
15k pieces of feedback, students agreed with AI suggestions 97.9% ±
0.001 of the time.
We compared student agreement with AI feedback against agreement
with instructor feedback, where we surprisingly found the AI system
surpass human instructors: 97.9% > 96.7% (p-value 0.02). The
improvement was driven by higher student ratings on constructive
feedback – times when the algorithm suggested an improvement.
On the five-point scale, the average holistic rating of usefulness
by students was 4.6 ± 0.018 out of 5.
Given the wide diversity of students participating in Code in Place,
we segmented the quality of AI feedback by gender and country, where
we found no statistically significant difference across
To the best of our knowledge, this was both the first successful
deployment of AI-driven feedback to open-ended student work and the
first successful deployment of prototype networks in a live application.
With promising results in both a research and a real-world setting, we
are optimistic about the future of artificial intelligence in code
education and beyond.
How could AI feedback impact teaching?
A successful deployment of an
automated feedback system raises several important questions about the
role of AI in education and more broadly, society.
To start, we emphasize that what makes Code in Place so successful is
its amazing teaching team made up of over 1,000 section leaders. While
feedback is an important part of the learning experience, it is one
component of a larger ecosystem. We should not incorrectly conclude from
our results that AI can automate teaching or replace instructors – nor
should the system be used for high-stakes grading. Instead, we should
view AI feedback as another tool in the toolkit for instructors to
better shape an amazing learning experience for students.
Further, we should evaluate our AI systems with a double bottom line of
both performance and fairness. Our initial experiments suggest that the
AI is not biased but our initial results are being supplemented by a
more thorough audit. To minimize the chance of providing incorrect
feedback to student work, future research should encourage AI systems to
learn to say: “I don’t know”.
Third, we find it important that progress in education research be
public and available for others to critique and build upon.
Finally, this research opens so many directions moving forward. We hope
to use this work to enable teachers to better reach their potential.
Moreover, an AI feedback makes it scalable to study not just students’
final solutions, but the process of how students solve their
assignments. Finally, there is a novel opportunity for computational
approaches towards unraveling the science of how students learn.
Many thanks to Chelsea Finn, Chris Piech, and Noah
Goodman for their guidance. Special thanks to Chris for his support the
last three years through the successes and failures towards AI feedback
prediction. Also, thanks to Alan Cheng, Milan Mosse, Ali Malik, Yunsung
Kim, Juliette Woodrow, Vrinda Vasavada, Jinpeng Song, and John Mitchell
for great collaborations. Thank you to Mehran Sahami, Julie Zelenki,
Brahm Capoor and the Code in Place team who supported this project.
Thank you to the section leaders who provided all the human feedback
that the AI was able to learn from. Thank you to the Stanford Institute
for Human-Centered Artificial Intelligence (in particular the Hoffman-Yee Research Grant) and the Stanford
Interdisciplinary Graduate Fellowship for their support.
One of the most common assumptions in machine learning (ML) is that the training and test data are independently and identically distributed (i.i.d.). For example, we might collect some number of data points and then randomly split them, assigning half to the training set and half to the test set.
However, this assumption is often broken in ML systems deployed in the wild. In real-world applications, distribution shifts— instances where a model is trained on data from one distribution but then deployed on data from a different distribution— are ubiquitous. For example, in medical applications, we might train a diagnosis model on patients from a few hospitals, and then deploy it more broadly to hospitals outside the training set 1; and in wildlife monitoring, we might train an animal recognition model on images from one set of camera traps and then deploy it to new camera traps 2.
A large body of prior work has shown that these distribution shifts can significantly degrade model performance in a variety of real-world ML applications: models can perform poorly out-of-distribution, despite achieving high in-distribution performance 3. To be able to reliably deploy ML models in the wild, we urgently need to develop methods for training models that are robust to real-world distribution shifts.
The WILDS benchmark
To facilitate the development of ML models that are robust to real-world distribution shifts, our ICML 2021 paper presents WILDS, a curated benchmark of 10 datasets that reflect natural distribution shifts arising from different cameras, hospitals, molecular scaffolds, experiments, demographics, countries, time periods, users, and codebases.
The WILDS datasets cover two common types of distribution shifts: domain generalization and subpopulation shift.
In domain generalization, the training and test distributions comprise data from related but distinct domains. The figure shows an example from the OGB-MolPCBA dataset 4 in WILDS, where the task is to predict the biochemical properties of molecules, and the goal is to generalize to molecules with different molecular scaffolds that have not been seen in the training set.
In subpopulation shift, we consider test distributions that are subpopulations of the training distribution, and seek to perform well even on the worst-case subpopulation. As an example, consider the CivilComments-WILDS dataset 5, where the task is toxicity classification on online text comments. Standard models perform well on average but poorly on comments that mention certain minority demographic groups (e.g., they might be likely to erroneously flag innocuous comments mentioning Black people as toxic), and we seek to train models that can perform equally well on comments that correspond to different demographic subpopulations.
Finally, some datasets exhibit both types of distribution shifts. For example, the second example in the figure above is from the FMoW-WILDS dataset 6, where there is both a domain generalization problem over time (the training set consists of satellite images taken before 2013, while the test images were taken after 2016) as well as a subpopulation shift problem over different geographical regions (we seek to do well over all regions).
Selection criteria for WILDS datasets
WILDS builds on extensive data collection efforts by domain experts working on applying ML methods in their application areas, and who are often forced to grapple with distribution shifts to make progress in their applications. To design WILDS, we worked with these experts to identify, select, and adapt datasets that fulfilled the following criteria:
Real-world relevance. The training/test splits and evaluation metrics are motivated by real-world scenarios and chosen in conjunction with domain experts. By focusing on realistic distribution shifts, WILDS complements existing distribution shift benchmarks, which have largely studied shifts that are cleanly characterized but are not likely to arise in real-world deployments. For example, many recent papers have studied datasets with shifts induced by synthetic transformations, such as changing the color of MNIST digits 7. Though these are important testbeds for systematic studies, model robustness need not transfer across shifts—e.g., a method that improves robustness on a standard vision dataset can consistently harm robustness on real-world satellite imagery datasets 8. So, in order to evaluate and develop methods for real-world distribution shifts, benchmarks like WILDS that capture shifts in the wild serve as an important complement to more synthetic benchmarks.
Distribution shifts with large performance gaps. The train/test splits reflect shifts that substantially degrade model performance, i.e., with a large gap between in-distribution and out-of-distribution performance. Measuring the in-distribution versus out-of-distribution gap is an important but subtle problem, as it relies on carefully constructing an appropriate in-distribution setting. We discuss its complexities and our approach in more detail in the paper.
Apart from the 10 datasets in WILDS, we also survey distribution shifts that occur in other application areas—algorithmic fairness and policing, medicine and healthcare, genomics, natural language and speech processing, education, and robotics—and discuss examples of datasets from these areas that we considered but did not include in WILDS. We investigated datasets in autonomous driving, fairness in policing, and computational biology, but either did not observe substantial performance drops or found that performance disparities arose from factors beyond distribution shifts.
To make it easy to work with WILDS and to enable systematic comparisons between approaches, we developed an open-source Python package that fully automates data loading and evaluation. This package also contains default models and hyperparameters that can easily reproduce all of the baseline numbers we have in our paper. The package is simple to install—just run pip install wilds—and straightforward to use with any PyTorch-based algorithms and models:
We are also hosting a public leaderboard at https://wilds.stanford.edu/leaderboard/ to track the state of the art in algorithms for learning robust models. In our paper, we benchmarked several existing algorithms for learning robust models, but found that they did not consistently improve upon standard models trained with empirical risk minimization (i.e., minimizing the average loss). We thus believe that there is substantial room for developing algorithms and model architectures that can close the gaps between in-distribution and out-of-distribution performance on the WILDS datasets.
Just in the past few months, WILDS has been used to develop methods for domain generalization—such as Fish, which introduces an inter-domain gradient matching objective and is currently state-of-the-art on our leaderboard for several datasets 9, and a Model-Based Domain Generalization (MBDG) approach that uses generative modeling 10—as well as for subpopulation shift settings through environment inference 11 or a variant of distributionally robust optimization 12. WILDS has also been used to develop methods for out-of-distribution calibration 13, uncertainty measurement 14, gradual domain adaptation 15, and self-training 16.
Finally, it has also been used to study out-of-distribution selective classification 17, and to investigate the relationship between in-distribution and out-of-distribution generalization 18.
However, we have only just begun to scratch the surface of how we can train models that are robust to the distribution shifts that are unavoidable in real-world applications, and we’re excited to see what the ML research community will come up with. If you’re interested in trying WILDS out, please check out https://wilds.stanford.edu, and let us know if you have any questions or feedback.
We’ll be presenting WILDS at ICML at 6pm Pacific Time on Thursday, July 22, 2021, with the poster session from 9pm to 11pm Pacific Time on the same day. If you’d like to find out more, please drop by https://icml.cc/virtual/2021/poster/10117! (The link requires ICML registration.)
WILDS is a large collaborative effort by researchers from Stanford, UC Berkeley, Cornell, INRAE, the University of Saskatchewan, the University of Tokyo, Recursion, Caltech, and Microsoft Research. This blog post is based on the WILDS paper:
WILDS: A Benchmark of in-the-Wild Distribution Shifts. Pang Wei Koh*, Shiori Sagawa*, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, Tony Lee, Etienne David, Ian Stavness, Wei Guo, Berton A. Earnshaw, Imran S. Haque, Sara Beery, Jure Leskovec, Anshul Kundaje, Emma Pierson, Sergey Levine, Chelsea Finn, and Percy Liang. ICML 2021.
We are grateful to the many people who generously volunteered their time and expertise to advise us on WILDS.
J. R. Zech, M. A. Badgeley, M. Liu, A. B. Costa, J. J. Titano, and E. K. Oermann. Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study. In PLOS Medicine, 2018. ↩
S. Beery, G. V. Horn, and P. Perona. Recognition in terra incognita. In European Conference on Computer Vision (ECCV), pages 456–473, 2018. ↩
J. Quiñonero-Candela, M. Sugiyama, A. Schwaighofer, and N. D. Lawrence. Dataset shift in machine learning. The MIT Press, 2009. ↩
W. Hu, M. Fey, M. Zitnik, Y. Dong, H. Ren, B. Liu, M. Catasta, and J. Leskovec. Open Graph Benchmark: Datasets for machine learning on graphs. In Advances in Neural Information Processing Systems (NeurIPS), 2020. ↩
D. Borkan, L. Dixon, J. Sorensen, N. Thain, and L. Vasserman. Nuanced metrics for measuring unintended bias with real data for text classification. In WWW, pages 491–500, 2019. ↩
G. Christie, N. Fendley, J. Wilson, and R. Mukherjee. Functional map of the world. In Computer Vision and Pattern Recognition (CVPR), 2018. ↩
B. Kim, H. Kim, K. Kim, S. Kim, and J. Kim, 2019. Learning not to learn: Training deep neural networks with biased data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 9012-9020). ↩
S. M. Xie, A. Kumar, R. Jones, F. Khani, T. Ma, and P. Liang. In-N-Out: Pre-training and self-training using auxiliary information for out-of-distribution robustness. In International Conference on Learning Representations (ICLR), 2021. ↩
Y. Shi, J. Seely, P. H. Torr, N. Siddharth, A. Hannun, N. Usunier, and G. Synnaeve. Gradient Matching for Domain Generalization. arXiv preprint arXiv:2104.09937, 2021. ↩
A Robey, H. Hassani, and G. J. Pappas. Model-Based Robust Deep Learning. arXiv preprint arXiv:2005.10247, 2020. ↩
E. Creager, J. H. Jacobsen, and R. Zemel. Environment inference for invariant learning. In International Conference on Machine Learning, 2021. ↩
E. Liu, B. Haghgoo, A. Chen, A. Raghunathan, P. W. Koh, S. Sagawa, P. Liang, and C. Finn. Just Train Twice: Improving group robustness without training group information. In International Conference on Machine Learning (ICML), 2021. ↩
Y. Wald, A. Feder, D. Greenfeld, and U. Shalit. On calibration and out-of-domain generalization. arXiv preprint arXiv:2102.10395, 2021. ↩
E. Daxberger, A., Kristiadi, A., Immer, R., Eschenhagen, M., Bauer, and P. Hennig. Laplace Redux–Effortless Bayesian Deep Learning. arXiv preprint arXiv:2106.14806, 2021. ↩
S. Abnar, R. V. D. Berg, G. Ghiasi, M. Dehghani, N., Kalchbrenner, and H. Sedghi. Gradual Domain Adaptation in the Wild: When Intermediate Distributions are Absent. arXiv preprint arXiv:2106.06080, 2021. ↩
J. Chen, F. Liu, B. Avci, X. Wu, Y. Liang, and S. Jha. Detecting Errors and Estimating Accuracy on Unlabeled Data with Self-training Ensembles. arXiv preprint arXiv:2106.15728, 2021. ↩
E. Jones, S. Sagawa, P. W. Koh, A. Kumar, and P. Liang. Selective classification can magnify disparities across groups. In International Conference on Learning Representations (ICLR), 2021. ↩
J. Miller, R. Taori, A. Raghunathan, S. Sagawa, P. W. Koh, V. Shankar, P. Liang, Y. Carmon, and L. Schmidt. Accuracy on the line: on the strong correlation between out-of-distribution and in-distribution generalization. In International Conference on Machine Learning (ICML), 2021. ↩
The International Conference on Machine Learning (ICML) 2021 is being hosted virtually from July 18th – July 24th. We’re excited to share all the work from SAIL that’s being presented, and you’ll find links to papers, videos and blogs below. Feel free to reach out to the contact authors directly to learn more about the work that’s happening at Stanford!
List of Accepted Papers
Deep Reinforcement Learning amidst Continual Structured Non-Stationarity
Authors: Annie Xie, James Harrison, Chelsea Finn
Keywords: deep reinforcement learning, non-stationarity
Just Train Twice: Improving Group Robustness without Training Group Information
Authors: Evan Zheran Liu*, Behzad Haghgoo*, Annie S. Chen*, Aditi Raghunathan, Pang Wei Koh, Shiori Sagawa, Percy Liang, Chelsea Finn
Authors: Pang Wei Koh*, Shiori Sagawa*, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, Tony Lee, Etienne David, Ian Stavness, Wei Guo, Berton A. Earnshaw, Imran S. Haque, Sara Beery, Jure Leskovec, Anshul Kundaje, Emma Pierson, Sergey Levine, Chelsea Finn, Percy Liang