SAIL and Stanford Robotics at ICRA 2020

SAIL and Stanford Robotics at ICRA 2020

The International Conference on Robotics and Automation (ICRA) 2020 is being hosted virtually from May 31 – Jun 4.
We’re excited to share all the work from SAIL that’s being presented, and you’ll find links to papers, videos and blogs below. Feel free to reach out to the contact authors directly to learn more about the work that’s happening at Stanford!

List of Accepted Papers

Design of a Roller-Based Dexterous Hand for Object Grasping and Within-Hand Manipulation

Authors: Shenli Yuan, Austin D. Epps, Jerome B. Nowak, J. Kenneth Salisbury


Award nominations: Best Paper, Best Student Paper, Best Paper Award in Robot Manipulation, Best Paper in Mechanisms and Design

Links: Paper | Video

Keywords: dexterous manipulation, grasping, grippers and other end-effectors

Distributed Multi-Target Tracking for Autonomous Vehicle Fleets

Authors: Ola Shorinwa, Javier Yu, Trevor Halsted, Alex Koufos, and Mac Schwager


Award nominations: Best Paper

Links: Paper | Video

Keywords: mulit-target tracking, distributed estimation, multi-robot systems

Efficient Large-Scale Multi-Drone Delivery Using Transit Networks

Authors: Shushman Choudhury, Kiril Solovey, Mykel J. Kochenderfer, Marco Pavone


Award nominations: Best Multi-Robot Systems Paper

Links: Paper | Video

Keywords: multi-robot, optimization, task allocation, route planning

Form2Fit: Learning Shape Priors for Generalizable Assembly from Disassembly

Authors: Kevin Zakka, Andy Zeng, Johnny Lee, Shuran Song


Award nominations: Best Automation Paper

Links: Paper | Blog Post | Video

Keywords: perception for grasping, assembly, robotics

Human Interface for Teleoperated Object Manipulation with a Soft Growing Robot

Authors: Fabio Stroppa, Ming Luo, Kyle Yoshida, Margaret M. Coad, Laura H. Blumenschein, and Allison M. Okamura


Award nominations: Best Human-Robot Interaction Paper

Links: Paper | Video

Keywords: soft robot, growing robot, manipulation, interface, teleoperation

6-PACK: Category-level 6D Pose Tracker with Anchor-Based Keypoints

Authors: Chen Wang, Roberto Martín-Martín, Danfei Xu, Jun Lv, Cewu Lu, Li Fei-Fei, Silvio Savarese, Yuke Zhu


Links: Paper | Blog Post | Video

Keywords: category-level 6d object pose tracking, unsupervised 3d keypoints

A Stretchable Capacitive Sensory Skin for Exploring Cluttered Environments

Authors: Alexander Gruebele, Jean-Philippe Roberge, Andrew Zerbe, Wilson Ruotolo, Tae Myung Huh, Mark R. Cutkosky


Links: Paper

Keywords: robot sensing systems , skin , wires , capacitance , grasping

Accurate Vision-based Manipulation through Contact Reasoning

Authors: Alina Kloss, Maria Bauza, Jiajun Wu, Joshua B. Tenenbaum, Alberto Rodriguez, Jeannette Bohg


Links: Paper | Video

Keywords: manipulation planning, contact modeling, perception for grasping and manipulation

Assistive Gym: A Physics Simulation Framework for Assistive Robotics

Authors: Zackory Erickson, Vamsee Gangaram, Ariel Kapusta, C. Karen Liu, and Charles C. Kemp


Links: Paper

Keywords: assistive robotics; physics simulation; reinforcement learning; physical human robot interaction

Controlling Assistive Robots with Learned Latent Actions

Authors: Dylan P. Losey, Krishnan Srinivasan, Ajay Mandlekar, Animesh Garg, Dorsa Sadigh


Links: Paper | Blog Post | Video

Keywords: human-robot interaction, assistive control

Distal Hyperextension is Handy: High Range of Motion in Cluttered Environments

Authors: Wilson Ruotolo, Rachel Thomasson, Joel Herrera, Alex Gruebele, Mark R. Cutkosky


Links: Paper | Video

Keywords: dexterous manipulation, grippers and other end-effectors, multifingered hands

Dynamically Reconfigurable Discrete Distributed Stiffness for Inflated Beam Robots

Authors: Brian H. Do, Valory Banashek, Allison M. Okamura


Links: Paper | Video

Keywords: soft robot materials and design; mechanism design; compliant joint/mechanism

Dynamically Reconfigurable Tactile Sensor for Robotic Manipulation

Authors: Tae Myung Huh, Hojung Choi, Simone Willcox, Stephanie Moon, Mark R. Cutkosky


Links: Paper | Video

Keywords: robot sensing systems , electrodes , force , force measurement , capacitance

Enhancing Game-Theoretic Autonomous Car Racing Using Control Barrier Functions

Authors: Gennaro Notomista, Mingyu Wang, Mac Schwager, Magnus Egerstedt


Links: Paper

Keywords: autonomous driving

Evaluation of Non-Collocated Force Feedback Driven by Signal-Independent Noise

Authors: Zonghe Chua, Allison Okamura, Darrel Deo


Links: Paper | Video

Keywords: haptics and haptic interfaces; prosthetics and exoskeletons; brain-machine interface

From Planes to Corners: Multi-Purpose Primitive Detection in Unorganized 3D Point Clouds

Authors: Christiane Sommer, Yumin Sun, Leonidas Guibas, Daniel Cremers, Tolga Birdal


Links: Paper | Code | Video

Keywords: plane detection, corner detection, orthogonal, 3d geometry, computer vision, point pair, slam

Guided Uncertainty-Aware Policy Optimization: Combining Learning and Model-Based Strategies for Sample-Efficient Policy Learning

Authors: Michelle A. Lee, Carlos Florensa, Jonathan Tremblay, Nathan Ratliff, Animesh Garg, Fabio Ramos, Dieter Fox


Links: Paper | Video

Keywords: deep learning in robotics and automation, perception for grasping and manipulation, learning and adaptive systems

IRIS: Implicit Reinforcement without Interaction at Scale for Learning Control from Offline Robot Manipulation Data

Authors: Ajay Mandlekar, Fabio Ramos, Byron Boots, Silvio Savarese, Li Fei-Fei, Animesh Garg, Dieter Fox


Links: Paper | Video

Keywords: imitation learning, reinforcement learning, robotics

Interactive Gibson Benchmark: A Benchmark for Interactive Navigation in Cluttered Environments

Authors: Fei Xia, William B. Shen, Chengshu Li, Priya Kasimbeg, Micael Tchapmi, Alexander Toshev, Roberto Martín-Martín, Silvio Savarese


Links: Paper | Video

Keywords: visual navigation, deep learning in robotics, mobile manipulation

KETO: Learning Keypoint Representations for Tool Manipulation

Authors: Zengyi Qin, Kuan Fang, Yuke Zhu, Li Fei-Fei, Silvio Savarese


Links: Paper | Blog Post | Video

Keywords: manipulation, representation, keypoint, interaction, self-supervised learning

Learning Hierarchical Control for Robust In-Hand Manipulation

Authors: Tingguang Li, Krishnan Srinivasan, Max Qing-Hu Meng, Wenzhen Yuan, Jeannette Bohg


Links: Paper | Blog Post | Video

Keywords: in-hand manipulation, robotics, reinforcement learning, hierarchical

Learning Task-Oriented Grasping from Human Activity Datasets

Authors: Mia Kokic, Danica Kragic, Jeannette Bohg


Links: Paper

Keywords: perception, grasping

Learning a Control Policy for Fall Prevention on an Assistive Walking Device

Authors: Visak CV Kumar, Sehoon Ha, Gregory Sawicki, C. Karen Liu


Links: Paper

Keywords: assistive robotics; human motion modeling; physical human robot interaction; reinforcement learning

Learning an Action-Conditional Model for Haptic Texture Generation

Authors: Negin Heravi, Wenzhen Yuan, Allison M. Okamura, Jeannette Bohg


Links: Paper | Blog Post | Video

Keywords: haptics and haptic interfaces

Learning to Collaborate from Simulation for Robot-Assisted Dressing

Authors: Alexander Clegg, Zackory Erickson, Patrick Grady, Greg Turk, Charles C. Kemp, C. Karen Liu


Links: Paper

Keywords: assistive robotics; physical human robot interaction; reinforcement learning; physics simulation; cloth manipulation

Learning to Scaffold the Development of Robotic Manipulation Skills

Authors: Lin Shao, Toki Migimatsu, Jeannette Bohg


Links: Paper | Video

Keywords: learning and adaptive systems, deep learning in robotics and automation, intelligent and flexible manufacturing

Map-Predictive Motion Planning in Unknown Environments

Authors: Amine Elhafsi, Boris Ivanovic, Lucas Janson, Marco Pavone


Links: Paper

Keywords: motion planning deep learning robotics

Motion Reasoning for Goal-Based Imitation Learning

Authors: De-An Huang, Yu-Wei Chao, Chris Paxton, Xinke Deng, Li Fei-Fei, Juan Carlos Niebles, Animesh Garg, Dieter Fox


Links: Paper | Video

Keywords: imitation learning, goal inference

Object-Centric Task and Motion Planning in Dynamic Environments

Authors: Toki Migimatsu, Jeannette Bohg


Links: Paper | Blog Post | Video

Keywords: control of systems integrating logic, dynamics, and constraints

Optimal Sequential Task Assignment and Path Finding for Multi-Agent Robotic Assembly Planning

Authors: Kyle Brown, Oriana Peltzer, Martin Sehr, Mac Schwager, Mykel Kochenderfer


Links: Paper | Video

Keywords: multi robot systems, multi agent path finding, mixed integer programming, automated manufacturing, sequential task assignment

Refined Analysis of Asymptotically-Optimal Kinodynamic Planning in the State-Cost Space

Authors: Michal Kleinbort, Edgar Granados, Kiril Solovey, Riccardo Bonalli, Kostas E. Bekris, Dan Halperin


Links: Paper

Keywords: motion planning, sampling-based planning, rrt, optimality

Retraction of Soft Growing Robots without Buckling

Authors: Margaret M. Coad, Rachel P. Thomasson, Laura H. Blumenschein, Nathan S. Usevitch, Elliot W. Hawkes, and Allison M. Okamura


Links: Paper | Video

Keywords: soft robot materials and design; modeling, control, and learning for soft robots

Revisiting the Asymptotic Optimality of RRT*

Authors: Kiril Solovey, Lucas Janson, Edward Schmerling, Emilio Frazzoli, and Marco Pavone


Links: Paper | Video

Keywords: motion planning, rapidly-exploring random trees, rrt*, sampling-based planning

Sample Complexity of Probabilistic Roadmaps via Epsilon Nets

Authors: Matthew Tsao, Kiril Solovey, Marco Pavone


Links: Paper | Video

Keywords: motion planning, sampling-based planning, probabilistic roadmaps, epsilon nets

Self-Supervised Learning of State Estimation for Manipulating Deformable Linear Objects

Authors: Mengyuan Yan, Yilin Zhu, Ning Jin, Jeannette Bohg


Links: Paper | Video

Keywords: self-supervision, deformable objects

Spatial Scheduling of Informative Meetings for Multi-Agent Persistent Coverage

Authors: Ravi Haksar, Sebastian Trimpe, Mac Schwager


Links: Paper | Video

Keywords: distributed systems, multi-robot systems, multi-robot path planning

Spatiotemporal Relationship Reasoning for Pedestrian Intent Prediction

Authors: Bingbin Liu, Ehsan Adeli, Zhangjie Cao, Kuan-Hui Lee, Abhijeet Shenoi, Adrien Gaidon, Juan Carlos Niebles


Links: Paper | Video

Keywords: spatiotemporal graphs, forecasting, graph neural networks, autonomous-driving.

TRASS: Time Reversal as Self-Supervision

Authors: Suraj Nair, Mohammad Babaeizadeh, Chelsea Finn, Sergey Levine, Vikash Kumar


Links: Paper | Blog Post | Video

Keywords: visual planning; reinforcement learning; self-supervision

UniGrasp: Learning a Unified Model to Grasp with Multifingered Robotic Hands

Authors: Lin Shao, Fabio Ferreira, Mikael Jorda, Varun Nambiar, Jianlan Luo, Eugen Solowjow, Juan Aparicio Ojea, Oussama Khatib, Jeannette Bohg


Links: Paper | Video

Keywords: deep learning in robotics and automation; grasping; multifingered hands

Vine Robots: Design, Teleoperation, and Deployment for Navigation and Exploration

Authors: Margaret M. Coad, Laura H. Blumenschein, Sadie Cutler, Javier A. Reyna Zepeda, Nicholas D. Naclerio, Haitham El-Hussieny, Usman Mehmood, Jee-Hwan Ryu, Elliot W. Hawkes, and Allison M. Okamura


Links: Paper | Video

Keywords: soft robot applications; field robots

We look forward to seeing you at ICRA!

Read More

DADS: Unsupervised Reinforcement Learning for Skill Discovery

DADS: Unsupervised Reinforcement Learning for Skill Discovery

Posted by Archit Sharma, AI Resident, Google Research

Recent research has demonstrated that supervised reinforcement learning (RL) is capable of going beyond simulation scenarios to synthesize complex behaviors in the real world, such as grasping arbitrary objects or learning agile locomotion. However, the limitations of teaching an agent to perform complex behaviors using well-designed task-specific reward functions are also becoming apparent. Designing reward functions can require significant engineering effort, which becomes untenable for a large number of tasks. For many practical scenarios, designing a reward function can be complicated, for example, requiring additional instrumentation for the environment (e.g., sensors to detect the orientation of doors) or manual-labelling of “goal” states. Considering that the ability to generate complex behaviors is limited by this form of reward-engineering, unsupervised learning presents itself as an interesting direction for RL.

In supervised RL, the extrinsic reward function from the environment guides the agent towards the desired behaviors, reinforcing the actions which bring the desired changes in the environment. With unsupervised RL, the agent uses an intrinsic reward function (such as curiosity to try different things in the environment) to generate its own training signals to acquire a broad set of task-agnostic behaviors. The intrinsic reward functions can bypass the problems of the engineering extrinsic reward functions, while being generic and broadly applicable to several agents and problems without any additional design. While much research has recently focused on different approaches to unsupervised reinforcement learning, it is still a severely under-constrained problem — without the guidance of rewards from the environment, it can be hard to learn behaviors which will be useful. Are there meaningful properties of the agent-environment interaction that can help discover better behaviors (“skills”) for the agents?

In this post, we present two recent publications that develop novel unsupervised RL methods for skill discovery. In “Dynamics-Aware Unsupervised Discovery of Skills” (DADS), we introduce the notion of “predictability” to the optimization objective for unsupervised learning. In this work we posit that a fundamental attribute of skills is that they bring about a predictable change in the environment. We capture this idea in our unsupervised skill discovery algorithm, and show applicability in a broad range of simulated robotic setups. In our follow-up work “Emergent Real-World Robotic Skills via Unsupervised Off-Policy Reinforcement Learning”, we improve the sample-efficiency of DADS to demonstrate that unsupervised skill discovery is feasible in the real world.

The behavior on the left is random and unpredictable, while the behavior on the right demonstrates systematic motion with predictable changes in the environment. Our goal is to learn potentially useful behaviors such as those on the right, without engineered reward functions.

Overview of DADS
DADS designs an intrinsic reward function that encourages discovery of “predictable” and “diverse” skills. The intrinsic reward function is high if (a) the changes in the environment are different for different skills (encouraging diversity) and (b) changes in the environment for a given skill are predictable (predictability). Since DADS does not obtain any rewards from the environment, optimizing the skills to be diverse enables the agent to capture as many potentially useful behaviors as possible.

In order to determine if a skill is predictable, we train another neural network, called the skill-dynamics network, to predict the changes in the environment state when given the current state and the skill being executed. The better the skill-dynamics network can predict the change of state in the environment, the more “predictable” the skill is. The intrinsic reward defined by DADS can be maximized using any conventional reinforcement learning algorithm.

An overview of DADS.

The algorithm enables several different agents to discover predictable skills purely from reward-free interaction with the environment. DADS, unlike prior work, can scale to high-dimensional continuous control environments such as Humanoid, a simulated bipedal robot. Since DADS is environment agnostic, it can be applied to both locomotion and manipulation oriented environments. We show some of the skills discovered by different continuous control agents.

Ant discovers galloping (top left) and skipping (bottom left), Humanoid discovers different locomotive gaits (middle, sped up 2x), and D’Claw from ROBEL (right) discovers different ways to rotate an object, all using DADS. More sample videos are available here.

Model-Based Control Using Skill-Dynamics
Not only does DADS enable the discovery of predictable and potentially useful skills, it allows for an efficient approach to apply the learned skills to downstream tasks. We can leverage the learned skill-dynamics to predict the state-transitions for each skill. The predicted state-transitions can be chained together to simulate the complete trajectory of states for any learned skill without executing it in the environment. Therefore, we can simulate the trajectory for different skills and choose the skill which gets the highest reward for the given task. The model-based planning approach described here can be very sample-efficient as no additional training is required for the skills. This is a significant step up from the prior approaches, which require additional training on the environment to combine the learned skills.

Using the skills discovered by the agents, we can traverse an arbitrary sequence of checkpoints without any additional training. The plot on the right follows the agent’s traversal from one checkpoint to another.

Real-World Results
The demonstration of unsupervised learning in real-world robotics has been fairly limited, with results being restricted to simulation environments. In “Emergent Real-World Robotic Skills via Unsupervised Off-Policy Reinforcement Learning”, we develop a sample-efficient version of our earlier algorithm, called off-DADS, through algorithmic and systematic improvements in an off-policy learning setup. Off-policy learning enables the use of data collected from different policies to improve the current policy. In particular, reusing the previously collected data can dramatically improve the sample-efficiency of reinforcement learning algorithms. Leveraging the improvement from off-policy learning, we train D’Kitty (a quadruped from ROBEL) in the real-world starting from random policy initialization without any rewards from the environment or hand-crafted exploration strategies. We observe the emergence of complex behaviors with diverse gaits and directions by optimizing the intrinsic reward defined by DADS.

Using off-DADS, we train D’Kitty from ROBEL to acquire diverse locomotion behaviors, which can then be used for goal-navigation through model-based control.

Future Work
We have contributed a novel unsupervised skill discovery algorithm with broad applicability that is feasible to be executed in the real-world. This work provides a foundation for future work, where robots can solve a broad range of tasks with minimal human effort. One possibility is to study the relationship between the state-representation and the skills discovered by DADS in order to learn a state-representation that encourages discovery of skills for a known distribution of downstream tasks. Another interesting direction for exploration is provided by the formulation of skill-dynamics that separates high-level planning and low-level control, and study its general applicability to reinforcement learning problems.

We would like to thank our coauthors, Michael Ahn, Sergey Levine, Vikash Kumar, Shixiang Gu and Karol Hausman. We would also like to acknowledge the support and feedback provided by various members of the Google Brain team and the Robotics at Google team.

TensorFlow User Groups: Updates from Around the World

TensorFlow User Groups: Updates from Around the World

Posted by Soonson Kwon, Biswajeet Mallik, and Siddhant Agarwal, Program Managers

TensorFlow User Groups (or TFUGs, for short) are a community of curious, passionate machine learning developers and researchers around the world. TFUGs play an important role in helping developers share their knowledge and experience in machine learning, and the latest TensorFlow updates. Google and the TensorFlow team are proud of (and grateful for) the many TFUGs around the world, and we’re excited to see them grow and become active developer communities.

Currently, there are more than 75 TFUGs around the globe, on 6 continents, with events in more than 15 languages, engaged in many creative ways to bring developers together. In this article, we wanted to share some global updates from around the world, and information on how you can get involved. If you would like to start a TFUG, please check this page or email us.

Here are a few examples of the many activities they are running around the world.


From March 23th to April 3rd, TensorFlow User Group Mumbai hosted “10 days of ML challenge” to help developers learn about Machine Learning. (Check out this blog from a participant as well.)
TensorFlow User Group Kolkata organized TweetsOnTF a fun twitter contest from March 27th – April 17th to celebrate TensorFlow Dev Summit 2020.
TensorFlow User Group Ahmedabad conducted their first event around Machine Learning & Data Science in Industry & Research with over 100+ students and developers.


TensorFlow User Group Ibadan has been running a monthly meetup. On May 14th, they hosted an online meetup about running your models on browsers with JavaScript.

(Photo from Dec, 2019)

Mainland China

TensorFlow User Group Shanghai, TensorFlow User Group Zhuhai, and many China TFUGs hosted a TensorFlow Dev Summit 2020 viewing party.


TensorFlow User Group Turkey has been hosting an online event series. On May 17th, they hosted a session called: “NLP and Its Applications in Healthcare”. (YouTube Channel)


On May 20th, TensorFlow User Group Tokyo hosted a ”Reading Neural Network Paper meetup” with 110 researchers via online covering Explainable AI.


On May 14th, TensorFlow User Group Korea hosted an online interview with Laurence Moroney celebrating its 50K members.


On May 30th, TensorFlow User Group Melbourne will host TensorFlow.js Show & Tell to share latest creations on Machine Learning in JavaScript together with Jason Mayes, TF.js Developer Advocate.


TensorFlow User Group Vietnam organized a Webinar led by Ba Ngoc (Machine Learning GDE) and Khanh (TensorFlow Developer Advocate) on how to prepare for the recently announced TensorFlow Certification.


Also, welcome to our newest TensorFlow User Group in Casablanca (Twitter, Facebook), newly created and in the process of ramping up.

How to get involved

Those are just a few of the many of the activities TFUG are having around the world. If you would like to start a TFUG in your region, please visit this page. To find a user group near you, check out this list. And, if you have any questions regarding TFUG, email us. Thank you!
Read More

Responding to the European Commission’s AI white paper

In January, our CEO Sundar Pichai visited Brussels to talk about artificial intelligence and how Google could help people and businesses succeed in the digital age through partnership. Much has changed since then due to COVID-19, but one thing hasn’t—our commitment to the potential of partnership with Europe on AI, especially to tackle the pandemic and help people and the economy recover. 

As part of that effort, we earlier today filed our response to the European Commission’s Consultation on Artificial Intelligence, giving our feedback on the Commission’s initial proposal for how to regulate and accelerate the adoption of AI. 

Excellence, skills, trust

Our filing applauds the Commission’s focus on building out the European “ecosystem of excellence.” European universities already boast renowned leaders in dozens of areas of AI research—Google partners with some of them via our machine learning research hubs in Zurich, Amsterdam, Berlin, Paris and London—and many of their students go on to make important contributions to European businesses.  

We support the Commission’s plans to help businesses develop the AI skills they need to thrive in the new digital economy. Next month, we’ll contribute to those efforts by extending our machine learning check-up tool to 11 European countries to help small businesses implement AI and grow their businesses. Google Cloud already works closely with scores of businesses across Europe to help them innovate using AI.  

We also support the Commission’s goal of building a framework for AI innovation that will create trust and guide ethical development and use of this widely applicable technology. We appreciate the Commission’s proportionate, risk-based approach. It’s important that AI applications in sensitive fields—such as medicine or transportation—are held to the appropriate standards. 

Based on our experience working with AI, we also offered a couple of suggestions for making future regulation more effective. We want to be a helpful and engaged partner to policymakers, and we have provided more details in our formal response to the consultation.

Definition of high-risk AI applications

AI has a broad range of current and future applications, including some that involve significant benefits and risks.  We think any future regulation would benefit from a more carefully nuanced definition of “high-risk” applications of AI. We agree that some uses warrant extra scrutiny and safeguards to address genuine and complex challenges around safety, fairness, explainability, accountability, and human interactions. 

Assessment of AI applications

When thinking about how to assess high-risk AI applications, it’s important to strike a balance. While AI won’t always be perfect, it has great potential to help us improve over the performance of existing systems and processes. But the development process for AI must give people confidence that the AI system they’re using is reliable and safe. That’s especially true for applications like new medical diagnostic techniques, which potentially allow skilled medical practitioners to offer more accurate diagnoses, earlier interventions, and better patient outcomes. But the requirements need to be proportionate to the risk, and shouldn’t unduly limit innovation, adoption, and impact. 

This is not an easy needle to thread. The Commission’s proposal suggests “ex ante” assessment of AI applications (i.e., upfront assessment, based on forecasted rather than actual use cases). Our contribution recommends having established due diligence and regulatory review processes expand to include the assessment of AI applications. This would avoid unnecessary duplication of efforts and likely speed up implementation.

For the (probably) rare instances when high-risk applications of AI are not obviously covered by existing regulations, we would encourage clear guidance on the “due diligence” criteria companies should use in their development processes. This would enable robust upfront self-assessment and documentation of any risks and their mitigations, and could also include further scrutiny after launch.

This approach would give European citizens confidence about the trustworthiness of AI applications, while also fostering innovation across the region. And it would encourage companies—especially smaller ones—to launch a range of valuable new services. 

Principles and process

Responsible development of AI presents new challenges and critical questions for all of us. In 2018 we published our own AI Principles to help guide our ethical development and use of AI, and also established internal review processes to help us avoid bias, test rigorously for safety, design with privacy top of mind.  Our principles also specify areas where we will not design or deploy AI, such as to support mass surveillance or violate human rights. Look out for an update on our work around these principles in the coming weeks. 

AI is an important part of Google’s business and our aspirations for the future. We share a common goal with policymakers—a desire to build trust in AI through responsible innovation and thoughtful regulation, so that European citizens can safely enjoy the full social and economic benefits of AI. We hope that our contribution to the consultation is useful, and we look forward to participating in the discussion in coming months.

Read More

Federated Analytics: Collaborative Data Science without Data Collection

Federated Analytics: Collaborative Data Science without Data Collection

Posted by Daniel Ramage, Research Scientist and Stefano Mazzocchi, Software Engineer, Google Research

Federated learning, introduced in 2017, enables developers to train machine learning (ML) models across many devices without centralized data collection, ensuring that only the user has a copy of their data, and is used to power experiences like suggesting next words and expressions in Gboard for Android and improving the quality of smart replies in Android Messages. Following the success of these applications, there is a growing interest in using federated technologies to answer more basic questions about decentralized data — like computing counts or rates — that often don’t involve ML at all. Analyzing user behavior through these techniques can lead to better products, but it is essential to ensure that the underlying data remains private and secure.

Today we’re introducing federated analytics, the practice of applying data science methods to the analysis of raw data that is stored locally on users’ devices. Like federated learning, it works by running local computations over each device’s data, and only making the aggregated results — and never any data from a particular device — available to product engineers. Unlike federated learning, however, federated analytics aims to support basic data science needs. This post describes the basic methodologies of federated analytics that were developed in the pursuit of federated learning, how we extended those insights into new domains, and how recent advances in federated technologies enable better accuracy and privacy for a growing range of data science needs.

Origin of Federated Analytics
The first exploration into federated analytics was in support of federated learning: how can engineers measure the quality of federated learning models against real-world data when that data is not available in a data center? The answer was to re-use the federated learning infrastructure but without the learning part. In federated learning, the model definition can include not only the loss function that is to be optimized, but also code to compute metrics that indicate the quality of the model’s predictions. We could use this code to directly evaluate model quality on phones’ data.

As an example, Gboard engineers measured the overall quality of next word prediction models against raw typing data held on users’ phones. Participating phones downloaded a candidate model, locally computed a metric of how well the model’s predictions matched the words that were actually typed, and then uploaded the metric without any adjustment to the model’s weights or any change to the Gboard typing experience. By averaging the metrics uploaded by many phones, engineers learned a population-level summary of model performance. The technique also easily extended to estimate basic statistics like dataset sizes.

Federated Analytics for Song Recognition Measurement
Beyond model evaluation, federated analytics is used to support the Now Playing feature on Google’s Pixel phones, a tool that shows you what song is playing in the room around you. Under the hood, Now Playing uses an on-device database of song fingerprints to identify music playing near the phone without the need for a network connection. The architecture is good for privacy and for users — it is fast, works offline, and no raw or processed audio data leaves the phone. Because every phone in a region receives the same database, and only songs in the database can be recognized, it’s important for the database to hold the right songs.

To measure and improve each regional database quality, engineers needed to answer a basic question: which of its songs are most often recognized? Federated analytics provides an answer without revealing which songs are heard by any individual phone. It is enabled for users who agreed to send device related usage and diagnostics information to Google.

When Now Playing recognizes a song, it records the track name into the on-device Now Playing history, where users can see recently recognized songs and add them to a music app’s playlist. Later, when the phone is idle, plugged in, and connected to WiFi, Google’s federated learning and analytics server may invite the phone to join a “round” of federated analytics computation, along with several hundred other phones. Each phone in the round computes the recognition rate for the songs in its Now Playing History, and uses the secure aggregation protocol to encrypt the results. The encrypted rates are sent to the federated analytics server, which does not have the keys to decrypt them individually. But when combined with the encrypted counts from the other phones in the round, the final tally of all song counts (and nothing else) can be decrypted by the server.

The result enables Google engineers to improve the song database (for example, by making sure the database contains truly popular songs), without any phone revealing which songs were heard. In its first improvement iteration, this resulted in a 5% increase in overall song recognition across all Pixel phones globally.

Protecting Federated Analytics with Secure Aggregation
Secure aggregation can enable stronger privacy properties for federated analytics applications. For intuition about the secure aggregation protocol, consider a simpler version of the song recognition measurement problem. Let’s say that Rakshita wants to know how often her friends Emily and Zheng have listened to a particular song. Emily has heard it SEmily times and Zheng SZheng times, but neither is comfortable sharing their counts with Rakshita or each other. Instead, the trio could perform a secure aggregation: Emily and Zheng meet to decide on a random number M, which they keep secret from Rakshita. Emily reveals to Rakshita the sum SEmily + M, while Zheng reveals the difference SZheng M. Rakshita sees two numbers that are effectively random (they are masked by M), but she can add them together (SEmily + M) + (SZheng M) = SEmily + SZheng to reveal the total number of times that the song was heard by both Emily and Zheng.

The privacy properties of this approach can be strengthened by summing over more people or by adding small random values to the counts (e.g. in support of differential privacy). For Now Playing, song recognition rates from hundreds of devices are summed together, before the result is revealed to the engineers.

An illustration of the secure aggregation protocol, from the federated learning comic book.

Toward Learning and Analytics with Greater Privacy
The methods of federated analytics are an active area of research and already go beyond analyzing metrics and counts. Sometimes, training ML models with federated learning can be used for obtaining aggregate insights about on-device data, without any of the raw data leaving the devices. For example, Gboard engineers wanted to discover new words commonly typed by users and add them to dictionaries used for spell-checking and typing suggestions, all without being able to see any words that users typed. They did it by training a character-level recurrent neural network on phones, using only the words typed on these phones that were not already in the global dictionary. No typed words ever left the phones, but the resulting model could then be used in the datacenter to generate samples of frequently typed character sequences – the new words!

We are also developing techniques for answering even more ambiguous questions on decentralized datasets like “what patterns in the data are difficult for my model to recognize?” by training federated generative models. And we’re exploring ways to apply user-level differentially private model training to further ensure that these models do not encode information unique to any one user.

Google’s commitment to our privacy principles means pushing the state of the art in safeguarding user data, be it through differential privacy in the data center or advances in privacy during data collection. Google’s earliest system for decentralized data analysis, RAPPOR, was introduced in 2014, and we’ve learned a lot about making effective decisions even with a great deal of noise (often introduced for local differential privacy) since. Federated analytics continues this line of work.

It’s still early days for the federated analytics approach and more progress is needed to answer many common data science questions with good accuracy. The recent Advances and Open Problems in Federated Learning paper offers a comprehensive survey of federated research, while Federated Heavy Hitters Discovery with Differential Privacy introduces a federated analytics method for the discovery of most frequent items in the dataset. Federated analytics enables us to think about data science differently, with decentralized data and privacy-preserving aggregation in a central role. We welcome new contributions and extensions in this emerging field.

This post reflects the work of many people, including Blaise Agüera y Arcas, Galen Andrew, Sean Augenstein, Françoise Beaufays, Kallista Bonawitz, Mingqing Chen, Hubert Eichner, Úlfar Erlingsson, Christian Frank, Anna Goralska, Marco Gruteser, Alex Ingerman, Vladimir Ivanov, Peter Kairouz, Chloé Kiddon, Ben Kreuter, Alison Lentz, Wei Li, Xu Liu, Antonio Marcedone, Rajiv Mathews, Brendan McMahan, Tom Ouyang, Sarvar Patel, Swaroop Ramaswamy, Aaron Segal, Karn Seth, Haicheng Sun, Timon Van Overveldt, Sergei Vassilvitskii, Scott Wegner, Yuanbo Zhang, Li Zhang, and Wennan Zhu.

Pose Animator - An open source tool to bring SVG characters to life in the browser via motion capture

Pose Animator – An open source tool to bring SVG characters to life in the browser via motion capture

By Shan Huang, Creative Technologist, Google Partner Innovation


The PoseNet and Facemesh (from Mediapipe) TensorFlow.js models made real time human perception in the browser possible through a simple webcam. As an animation enthusiast who struggles to master the complex art of character animation, I saw hope and was really excited to experiment using these models for interactive, body-controlled animation.

The result is Pose Animator, an open-source web animation tool that brings SVG characters to life with body detection results from webcam. This blog post covers the technical design of Pose Animator, as well as the steps for designers to create and animate their own characters.

Using FaceMesh and PoseNet with TensorFlow.js to animate full body character

The overall idea of Pose Animator is to take a 2D vector illustration and update its containing curves in real-time based on the recognition result from PoseNet and FaceMesh. To achieve this, Pose Animator borrows the idea of skeleton-based animation from computer graphics and applies it to vector characters.
In skeletal animation a character is represented in two parts:

  • a surface used to draw the character, and
  • a hierarchical set of interconnected bones used to animate the surface.

In Pose Animator, the surface is defined by the 2D vector paths in the input SVG files. For the bone structure, Pose Animator provides a predefined rig (bone hierarchy) representation, based on the key points from PoseNet and FaceMesh. This bone structure’s initial pose is specified in the input SVG file, along with the character illustration, while the real time bone positions are updated by the recognition result from ML models.

Detection keypoints from PoseNet (blue) and FaceMesh (red)

Check out these steps to create your own SVG character for Pose Animator.

Animated bezier curves controlled by PoseNet and FaceMesh output/td>

Rigging Flow Overview

The full rigging (skeleton binding) flow requires the following steps:

  • Parse the input SVG file for the vector illustration and the predefined skeleton, both of which are in T-pose (initial pose).
  • Iterate through every segment in vector paths to compute the weight influence and transformation from each bone using Linear Blend Skinning (explained later in this post).
  • In real time, run FaceMesh and PoseNet on each input frame and use result keypoints to update the bone positions.
  • Compute new positions of vector segments from the updated bone positions, bone weights and transformations.

There are other tools that provide similar puppeteering functionality, however, most of them only update asset bounding boxes and do not deform the actual geometry of characters with recognition key points. Also, few tools provide full body recognition and animation. By deforming individual curves, Pose Animator is good at capturing the nuances of facial and full body movement and hopefully provides more expressive animation.

Rig Definition

The rig structure is designed according to the output key points from PoseNet and FaceMesh. PoseNet returns 17 key points for the full body, which is simple enough to directly include in the rig. FaceMesh however provides 486 keypoints, so I needed to be more selective about which ones to include. In the end I selected 73 key points from the FaceMesh output and together we have a full body rig of 90 keypoints and 78 bones as shown below:

The 90 keypoints, 78 bones full body rig

Every input SVG file is expected to contain this skeleton in default position. More specifically, Pose Animator will look for a group called ‘skeleton’ containing anchor elements named with the respective joint they represent. A sample rig SVG can be found here. Designers have the freedom to move the joints around in their design files to best embed the rig into the character. Pose Animator will compute skinning according to the default position in the SVG file, although extreme cases (e.g. very short leg / arm bones) may not be well supported by the rigging algorithm and may produce unnatural results.

The illustration with embedded rig in design software (Adobe Illustrator)

Linear Blend Skinning for vector paths

Pose Animator uses one of the most common rigging algorithms for deforming surfaces using skeletal structures – Linear Blend Skinning (LBS), which transforms a vertex on a surface by blending together its transformation controlled by each bone alone, weighted by each bone’s influence. In our case, a vertex refers to an anchor point on a vector path, and bones are defined by two connected keypoints in the above rig (e.g. the ‘leftWrist’ and ‘leftElbow’ keypoints define the bone ‘leftWrist-leftElbow’).
To put into math formula, the world space position of the vertex vi’ is computed as where
– wi is the influence of bone i on vertex i,
– vi describes vertex i’s initial position,
– Tj describes the spatial transformation that aligns the initial pose of bone j with its current pose.
The influence of bones can be automatically generated or manually assigned through weight painting. Pose Animator currently only supports auto weight assignment. The raw influence of bone j on vertex i is calculated as: Where d is the distance from vi to the nearest point on bone j. Finally we normalize the weight of all bones for a vertex to sum up to 1. Now, to apply LBS on 2D vector paths, which are composed of straight lines and bezier curves, we need some special treatment for bezier curve segments with in and out handles. We need to compute weights separately for curve points, in control point, and out control point. This produces better looking results because the bone influence for control points are more accurately captured.
There is one exception case. When the in control point, curve point, and out control point are collinear, we use the curve point weight for all three points to guarantee that they stay collinear when animated. This helps to preserve the smoothness of curves.

Collinear curve handles share the same weight to stay collinear

Motion stabilization

While LBS already gives us animated frames, there’s a noticeable amount of jittering introduced by FaceMesh and PoseNet raw output. To reduce the jitter and get smoother animation, we can use the confidence scores from prediction results to weigh each input frame unevenly, granting less influence to low-confidence frames.
Following this idea, Pose Animator computes the smoothed position of joint i at frame t as where
The smoothed confidence score of frame i is computed as Consider extreme cases. When two consecutive frames both have confidence score 1, position approaches the latest position at 50% speed, which looks responsive and reasonably smooth. (To further play with responsiveness, you can tweak the approach speed by changing the weight on the latest frame.) When the latest frame has confidence score 0, its influence is completely ignored, preventing low confidence results from introducing sudden jerkiness.

Confidence score based clipping

In addition to interpolating joint positions with confidence scores, we also introduce a minimum threshold to decide if a path should be rendered at all.
The confidence score of a path is the averaged confidence score of its segment points, which in turn is the weighted average of the influence bones’ scores. The whole path is hidden for a particular frame when its score is below a certain threshold.
This is useful for hiding paths in low confidence areas, which are often body parts out of the camera view. Imagine an upper body shot: PoseNet will always return keypoint predictions for legs and hips though they will have low confidence scores. With this clamping mechanism we can make sure lower body parts are properly hidden instead of showing up as strangely distorted paths.

Looking ahead

To mesh or not to mesh

The current rigging algorithm is heavily centered around 2D curves. This is because the 2D rig constructed from PoseNet and FaceMesh has a large range of motion and varying bone lengths – unlike animation in games where bones have relatively fixed length. I currently get smoother results from deforming bezier curves than deforming the triangulated mesh from input paths, because bezier curves preserve the curvature / straightness of input lines better.
I am keen to improve the rigging algorithm for meshes. Besides, I want to explore a more advanced rigging algorithm than Linear Blend Skinning, which has limitations such as volume thinning around the bent areas.

New editing features

Pose Animator delegates illustration editing to design softwares like Illustrator, which are powerful for editing vector graphics, but not tailored for animation / skinning requirements. I want to support more animation features through in-browser UI, including:

  • Skinning weight painting tool, to enable tweaking individual weights on keypoints manually. This will provide more precision than auto weight assignment.
  • Support raster images in the input SVG files, so artists may use photos / drawings in their design. Image bounding boxes can be easily represented as vector paths so it’s straightforward to compute its deformation using the current rigging algorithm.

Try it yourself!

Try out the live demos, where you can either play with existing characters, or add in your own SVG character and see them come to life.
I’m the most excited to see what kind of interactive animation the creative community will create. While the demos are human characters, Pose Animator will work for any 2D vector design, so you can go as abstract / avant-garde as you want to push its limits.
To create your own animatable illustration, please check out this guide! Don’t forget to share your creations with us using #PoseAnimator on social media. Feel free to reach out to me on twitter @yemount for any questions.
Alternatively if you want to view the source code directly, it is available to fork on github here. Happy hacking!
Read More

Evaluating Natural Language Generation with BLEURT

Evaluating Natural Language Generation with BLEURT

Posted by Thibault Sellam, Software Engineer, and Ankur P. Parikh, Research Scientist, Google Research

In the last few years, research in natural language generation (NLG) has made tremendous progress, with models now able to translate text, summarize articles, engage in conversation, and comment on pictures with unprecedented accuracy, using approaches with increasingly high levels of sophistication. Currently, there are two methods to evaluate these NLG systems: human evaluation and automatic metrics. With human evaluation, one runs a large-scale quality survey for each new version of a model using human annotators, but that approach can be prohibitively labor intensive. In contrast, one can use popular automatic metrics (e.g., BLEU), but these are oftentimes unreliable substitutes for human interpretation and judgement. The rapid progress of NLG and the drawbacks of existing evaluation methods calls for the development of novel ways to assess the quality and success of NLG systems.

In “BLEURT: Learning Robust Metrics for Text Generation” (presented during ACL 2020), we introduce a novel automatic metric that delivers ratings that are robust and reach an unprecedented level of quality, much closer to human annotation. BLEURT (Bilingual Evaluation Understudy with Representations from Transformers) builds upon recent advances in transfer learning to capture widespread linguistic phenomena, such as paraphrasing. The metric is available on Github.

Evaluating NLG Systems
In human evaluation, a piece of generated text is presented to annotators, who are tasked with assessing its quality with respect to its fluency and meaning. The text is typically shown side-by-side with a reference, authored by a human or mined from the Web.

An example questionnaire used for human evaluation in machine translation.

The advantage of this method is that it is accurate: people are still unrivaled when it comes to evaluating the quality of a piece of text. However, this method of evaluation can easily take days and involve dozens of people for just a few thousand examples, which disrupts the model development workflow.

In contrast, the idea behind automatic metrics is to provide a cheap, low-latency proxy for human-quality measurements. Automatic metrics often take two sentences as input, a candidate and a reference, and they return a score that indicates to what extent the former resembles the latter, typically using lexical overlap. A popular metric is BLEU, which counts the sequences of words in the candidate that also appear in the reference (the BLEU score is very similar to precision).

The advantages and weaknesses of automatic metrics are the opposite of those that come with human evaluation. Automatic metrics are convenient — they can be computed in real-time throughout the training process (e.g., for plotting with Tensorboard). However, they are often inaccurate due to their focus on surface-level similarities and they fail to capture the diversity of human language. Frequently, there are many perfectly valid sentences that can convey the same meaning. Overlap-based metrics that rely exclusively on lexical matches unfairly reward those that resemble the reference in their surface form, even if they do not accurately capture meaning, and penalize other paraphrases.

BLEU scores for three candidate sentences. Candidate 2 is semantically close to the reference, and yet its score is lower than Candidate 3.

Ideally, an evaluation method for NLG should combine the advantages of both human evaluation and automatic metrics — it should be relatively cheap to compute, but flexible enough to cope with linguistic diversity.

Introducing BLEURT
BLEURT is a novel, machine learning-based automatic metric that can capture non-trivial semantic similarities between sentences. It is trained on a public collection of ratings (the WMT Metrics Shared Task dataset) as well as additional ratings provided by the user.

Three candidate sentences rated by BLEURT. BLEURT captures that candidate 2 is similar to the reference, even though it contains more non-reference words than candidate 3.

Creating a metric based on machine learning poses a fundamental challenge: the metric should do well consistently on a wide range of tasks and domains, and over time. However, there is only a limited amount of training data. Indeed, public data is sparse — the WMT Metrics Task dataset, the largest collection of human ratings at the time of writing, contains ~260K human ratings covering the news domain only. This is too limited to train a metric suited for the evaluation of NLG systems of the future.

To address this problem, we employ transfer learning. First, we use the contextual word representations of BERT, a state-of-the-art unsupervised representation learning method for language understanding that has already been successfully incorporated into NLG metrics (e.g., YiSi or BERTscore).

Second, we introduce a novel pre-training scheme to increase BLEURT’s robustness. Our experiments reveal that training a regression model directly over publicly available human ratings is a brittle approach, since we cannot control in what domain and across what time span the metric will be used. The accuracy is likely to drop in the presence of domain drift, i.e., when the text used comes from a different domain than the training sentence pairs. It may also drop when there is a quality drift, when the ratings to be predicted are higher than those used during training — a feature which would normally be good news because it indicates that ML research is making progress.

The success of BLEURT relies on “warming-up” the model using millions of synthetic sentence pairs before fine-tuning on human ratings. We generated training data by applying random perturbations to sentences from Wikipedia. Instead of collecting human ratings, we use a collection of metrics and models from the literature (including BLEU), which allows the number of training examples to be scaled up at very low cost.

BLEURT’s data generation process combines random perturbations and scoring with pre-existing metrics and models.

Experiments reveal that pre-training significantly increases BLEURT’s accuracy, especially when the test data is out-of-distribution.

We pre-train BLEURT twice, first with a language modelling objective (as explained in the original BERT paper), then with a collection of NLG evaluation objectives. We then fine-tune the model on the WMT Metrics dataset, on a set of ratings provided by the user, or a combination of both.The following figure illustrates BLEURT’s training procedure end-to-end.

We benchmark BLEURT against competing approaches and show that it offers superior performance, correlating well with human ratings on the WMT Metrics Shared Task (machine translation) and the WebNLG Challenge (data-to-text). For example, BLEURT is ~48% more accurate than BLEU on the WMT Metrics Shared Task of 2019. We also demonstrate that pre-training helps BLEURT cope with quality drift.

Correlation between different metrics and human ratings on the WMT’19 Metrics Shared Task.

As NLG models have gotten better over time, evaluation metrics have become an important bottleneck for the research in this field. There are good reasons why overlap-based metrics are so popular: they are simple, consistent, and they do not require any training data. In the use cases where multiple reference sentences are available for each candidate, they can be very accurate. While they play a critical part in our infrastructure, they are also very conservative, and only give an incomplete picture of NLG systems’ performance. Our view is that ML engineers should enrich their evaluation toolkits with more flexible, semantic-level metrics.

BLEURT is our attempt to capture NLG quality beyond surface overlap. Thanks to BERT’s representations and a novel pre-training scheme, our metric yields SOTA performance on two academic benchmarks, and we are currently investigating how it can improve Google products. Future research includes investigating multilinguality and multimodality.

This project was co-advised by Dipanjan Das. We thank Slav Petrov, Eunsol Choi, Nicholas FitzGerald, Jacob Devlin, Madhavan Kidambi, Ming-Wei Chang, and all the members of the Google Research Language team.