Taking some of the guesswork out of drug discovery

In their quest to discover effective new medicines, scientists search for drug-like molecules that can attach to disease-causing proteins and change their functionality. It is crucial that they know the 3D shape of a molecule to understand how it will attach to specific surfaces of the protein.

But a single molecule can fold in thousands of different ways, so solving that puzzle experimentally is a time consuming and expensive process akin to searching for a needle in a molecular haystack.

MIT researchers are using machine learning to streamline this complex task. They have created a deep learning model that predicts the 3D shapes of a molecule solely based on a graph in 2D of its molecular structure. Molecules are typically represented as small graphs.

Their system, GeoMol, processes molecules in only seconds and performs better than other machine learning models, including some commercial methods. GeoMol could help pharmaceutical companies accelerate the drug discovery process by narrowing down the number of molecules they need to test in lab experiments, says Octavian-Eugen Ganea, a postdoc in the Computer Science and Artificial Intelligence Laboratory (CSAIL) and co-lead author of the paper.

“When you are thinking about how these structures move in 3D space, there are really only certain parts of the molecule that are actually flexible, these rotatable bonds. One of the key innovations of our work is that we think about modeling the conformational flexibility like a chemical engineer would. It is really about trying to predict the potential distribution of rotatable bonds in the structure,” says Lagnajit Pattanaik, a graduate student in the Department of Chemical Engineering and co-lead author of the paper.

Other authors include Connor W. Coley, the Henri Slezynger Career Development Assistant Professor of Chemical Engineering; Regina Barzilay, the School of Engineering Distinguished Professor for AI and Health in CSAIL; Klavs F. Jensen, the Warren K. Lewis Professor of Chemical Engineering; William H. Green, the Hoyt C. Hottel Professor in Chemical Engineering; and senior author Tommi S. Jaakkola, the Thomas Siebel Professor of Electrical Engineering in CSAIL and a member of the Institute for Data, Systems, and Society. The research will be presented this week at the Conference on Neural Information Processing Systems.

Mapping a molecule

In a molecular graph, a molecule’s individual atoms are represented as nodes and the chemical bonds that connect them are edges. 

GeoMol leverages a recent tool in deep learning called a message passing neural network, which is specifically designed to operate on graphs. The researchers adapted a message passing neural network to predict specific elements of molecular geometry.

Given a molecular graph, GeoMol initially predicts the lengths of the chemical bonds between atoms and the angles of those individual bonds. The way the atoms are arranged and connected determines which bonds can rotate.

GeoMol then predicts the structure of each atom’s local neighborhood individually and assembles neighboring pairs of rotatable bonds by computing the torsion angles and then aligning them. A torsion angle determines the motion of three segments that are connected, in this case, three chemical bonds that connect four atoms.

“Here, the rotatable bonds can take a huge range of possible values. So, the use of these message passing neural networks allows us to capture a lot of the local and global environments that influences that prediction. The rotatable bond can take multiple values, and we want our prediction to be able to reflect that underlying distribution,” Pattanaik says.

Overcoming existing hurdles

One major challenge to predicting the 3D structure of molecules is to model chirality. A chiral molecule can’t be superimposed on its mirror image, like a pair of hands (no matter how you rotate your hands, there is no way their features exactly line up). If a molecule is chiral, its mirror image won’t interact with the environment in the same way.

This could cause medicines to interact with proteins incorrectly, which could result in dangerous side effects. Current machine learning methods often involve a long, complex optimization process to ensure chirality is correctly identified, Ganea says.

Because GeoMol determines the 3D structure of each bond individually, it explicitly defines chirality during the prediction process, eliminating the need for optimization after-the-fact.

After performing these predictions, GeoMol outputs a set of likely 3D structures for the molecule.

“What we can do now is take our model and connect it end-to-end with a model that predicts this attachment to specific protein surfaces. Our model is not a separate pipeline. It is very easy to integrate with other deep learning models,” Ganea says.

A “super-fast” model

The researchers tested their model using a dataset of molecules and the likely 3D shapes they could take, which was developed by Rafael Gomez-Bombarelli, the Jeffrey Cheah Career Development Chair in Engineering, and graduate student Simon Axelrod.

They evaluated how many of these likely 3D structures their model was able to capture, in comparison to machine learning models and other methods.

In nearly all instances, GeoMol outperformed the other models on all tested metrics.

“We found that our model is super-fast, which was really exciting to see. And importantly, as you add more rotatable bonds, you expect these algorithms to slow down significantly. But we didn’t really see that. The speed scales nicely with the number of rotatable bonds, which is promising for using these types of models down the line, especially for applications where you are trying to quickly predict the 3D structures inside these proteins,” Pattanaik says.

In the future, the researchers hope to apply GeoMol to the area of high-throughput virtual screening, using the model to determine small molecule structures that would interact with a specific protein. They also want to keep refining GeoMol with additional training data so it can more effectively predict the structure of long molecules with many flexible bonds.

“Conformational analysis is a key component of numerous tasks in computer-aided drug design, and an important component in advancing machine learning approaches in drug discovery,” says Pat Walters, senior vice president of computation at Relay Therapeutics, who was not involved in this research. “I’m excited by continuing advances in the field and thank MIT for contributing to broader learnings in this area.”

This research was funded by the Machine Learning for Pharmaceutical Discovery and Synthesis consortium.

Read More

Stanford AI Lab Papers and Talks at NeurIPS 2021

The Thirty-fifth Conference on Neural Information Processing Systems (NeurIPS) 2021 is being hosted virtually from Dec 6th – 14th. We’re excited to share all the work from SAIL that’s being presented at the main conference, at the Datasets and Benchmarks track and the various workshops, and you’ll find links to papers, videos and blogs below.

Some of the members in our SAIL community also serve as co-organizers of several exciting workshops that will take place on Dec 13-14, so we hope you will check them out!

Feel free to reach out to the contact authors and the workshop organizers directly to learn more about the work that’s happening at Stanford!

Main Conference

Improving Compositionality of Neural Networks by Decoding Representations to Inputs


Authors: Mike Wu, Noah Goodman, Stefano Ermon

Contact: wumike@stanford.edu

Links: Paper

Keywords: generative models, compositionality, decoder


Reverse engineering recurrent neural networks with Jacobian switching linear dynamical systems


Authors: Jimmy T.H. Smith, Scott W. Linderman, David Sussillo

Contact: jsmith14@stanford.edu

Links: Paper | Website

Keywords: recurrent neural networks, switching linear dynamical systems, interpretability, fixed points


Compositional Transformers for Scene Generation


Authors: Drew A. Hudson, C. Lawrence Zitnick

Contact: dorarad@cs.stanford.edu

Links: Paper | Github

Keywords: GANs, transformers, compositionality, scene synthesis


Combining Recurrent, Convolutional, and Continuous-time Models with Linear State Space Layers


Authors: Albert Gu, Isys Johnson, Karan Goel, Khaled Saab, Tri Dao, Atri Rudra, Chris Ré

Contact: albertgu@stanford.edu

Links: Paper

Keywords: recurrent neural networks, rnn, continuous models, state space, long range dependencies, sequence modeling


Emergent Communication of Generalizations


Authors: Jesse Mu, Noah Goodman

Contact: muj@stanford.edu

Links: Paper | Video

Keywords: emergent communication, multi-agent communication, language grounding, compositionality


ELLA: Exploration through Learned Language Abstraction


Authors: Suvir Mirchandani, Siddharth Karamcheti, Dorsa Sadigh

Contact: suvir@cs.stanford.edu

Links: Paper | Video

Keywords: instruction following, reward shaping, reinforcement learning


CSDI: Conditional Score-based Diffusion Models for Probabilistic Time Series Imputation


Authors: Yusuke Tashiro, Jiaming Song, Yang Song, Stefano Ermon

Contact: ytashiro@stanford.edu

Links: Paper | Website

Keywords: score-based generative modeling, time series imputation


Confidence-Aware Imitation Learning from Demonstrations with Varying Optimality


Authors: Songyuan Zhang, Zhangjie Cao, Dorsa Sadigh, Yanan Sui

Contact: szhang21@mit.edu

Links: Paper | Video | Website

Keywords: imitation learning, learning from demonstration, learning from suboptimal demonstrations


Explaining heterogeneity in medial entorhinal cortex with task-driven neural networks


Authors: Aran Nayebi, Alexander Attinger, Malcolm G. Campbell, Kiah Hardcastle, Isabel I.C. Low, Caitlin S. Mallory, Gabriel C. Mel, Ben Sorscher, Alex H. Williams, Surya Ganguli, Lisa M. Giocomo, Daniel L.K. Yamins

Contact: anayebi@stanford.edu

Award nominations: Spotlight Presentation

Links: Paper | Website

Keywords: neural coding, medial entorhinal cortex, grid cells, biologically-inspired navigation, path integration, recurrent neural networks


On the theory of reinforcement learning with once-per-episode feedback


Authors: Niladri Chatterji, Aldo Pacchiano, Peter Bartlett, Michael Jordan

Contact: niladri@cs.stanford.edu

Keywords: theoretical reinforcement learning, binary rewards, non-markovian rewards


HyperSPNs: Compact and Expressive Probabilistic Circuits


Authors: Andy Shih, Dorsa Sadigh, Stefano Ermon

Contact: andyshih@stanford.edu

Links: Paper | Video | Website

Keywords: generative models, tractable probabilistic models, sum product networks, probabilistic circuits


COMBO: Conservative Offline Model-Based Policy Optimization


Authors: Tianhe Yu*, Aviral Kumar*, Rafael Rafailov, Aravind Rajeswaran, Sergey Levine, Chelsea Finn

Contact: tianheyu@cs.stanford.edu

Links: Paper

Keywords: offline reinforcement learning, model-based reinforcement learning, deep reinforcement learning


Conservative Data Sharing for Multi-Task Offline Reinforcement Learning


Authors: Tianhe Yu*, Aviral Kumar*, Yevgen Chebotar, Karol Hausman, Sergey Levine, Chelsea Finn

Contact: tianheyu@cs.stanford.edu

Links: Paper

Keywords: offline reinforcement learning, multi-task reinforcement learning, deep reinforcement learning


Autonomous Reinforcement Learning via Subgoal Curricula


Authors: Archit Sharma, Abhishek Gupta, Sergey Levine, Karol Hausman, Chelsea Finn

Contact: architsh@stanford.edu

Links: Paper | Website

Keywords: reinforcement learning, curriculum, autonomous learning, reset-free reinforcement learning


Lossy Compression for Lossless Prediction


Authors: Yann Dubois, Benjamin Bloem-Reddy, Karen Ullrich Chris J. Maddison

Contact: yanndubs@stanford.edu

Award nominations: Spotlight Presentation

Links: Paper | Video | Website

Keywords: compression, invariances, information theory, machine learning, self-supervised learning


Capturing implicit hierarchical structure in 3D biomedical images with self-supervised hyperbolic representations


Authors: Joy Hsu, Jeffrey Gu, Gong-Her Wu, Wah Chiu, Serena Yeung

Contact: joycj@stanford.edu

Links: Paper

Keywords: hyperbolic representations, hierarchical structure, biomedical


Estimating High Order Gradients of the Data Distribution by Denoising


Authors: Chenlin Meng, Yang Song, Wenzhe Li, Stefano Ermon

Contact: chenlin@stanford.edu

Keywords: score matching, langevin dynamics, denoising, generative modeling


Universal Off-Policy Evaluation


Authors: Yash Chandak, Scott Niekum, Bruno Castro da Silva, Erik Learned-Miller, Emma Brunskill, Philip Thomas

Contact: ychandak@cs.umass.edu

Links: Paper | Website

Keywords: metrics, risk, distribution, cdf, off-policy evaluation, ope, reinforcement learning, counterfactuals, high-confidence bounds, confidence intervals


Evidential Softmax for Sparse Multimodal Distributions in Deep Generative Models


Authors: Phil Chen, Masha Itkina, Ransalu Senanayake, Mykel J. Kochenderfer

Contact: philhc@stanford.edu

Links: Paper

Keywords: deep learning or neural networks, sparsity and feature selection, variational inference, (application) natural language and text processing


Provable Guarantees for Self-Supervised Deep Learning with Spectral Contrastive Loss


Authors: Jeff Z. HaoChen, Colin Wei, Adrien Gaidon, Tengyu Ma

Contact: jhaochen@stanford.edu

Links: Paper

Keywords: deep learning theory, unsupervised learning theory, representation learning theory


Provable Model-based Nonlinear Bandit and Reinforcement Learning: Shelve Optimism, Embrace Virtual Curvature


Authors: Kefan Dong, Jiaqi Yang, Tengyu Ma

Contact: kefandong@stanford.edu

Links: Paper | Video

Keywords: nonlinear bandits, online learning, deep reinforcement learning theory, sequential rademacher complexity


Decrypting Cryptic Crosswords: Semantically Complex Wordplay Puzzles as a Target for NLP


Authors: Joshua Rozner, Christopher Potts, Kyle Mahowald

Contact: rozner@stanford.edu

Links: Paper | Website

Keywords: compositionality in language, curriculum learning, meta-linguistics, systematicity, generalization


Design of Experiments for Stochastic Contextual Linear Bandits


Authors: Andrea Zanette*, Kefan Dong*, Jonathan Lee*, Emma Brunskill

Contact: zanette@berkeley.edu

Links: Paper

Keywords: linear bandits, design of experiments


Provable Benefits of Actor-Critic Methods for Offline Reinforcement Learning


Authors: Andrea Zanette, Martin J. Wainwright, Emma Brunskill

Contact: zanette@berkeley.edu

Links: Paper

Keywords: offline rl, mirror descent, bellman closure


A Topological Perspective on Causal Inference


Authors: Duligur Ibeling, Thomas Icard

Contact: icard@stanford.edu

Links: Paper

Keywords: causal inference, topological learning theory


Adversarial Training Helps Transfer Learning via Better Representations


Authors: Zhun Deng, Linjun Zhang, Kailas Vodrahalli, Kenji Kawaguchi, James Zou

Contact: jamesyzou@gmail.com

Links: Paper

Keywords: transfer learning, adversarial training


Widening the Pipeline in Human-Guided Reinforcement Learning with Explanation and Context-Aware Data Augmentation


Authors: Lin Guan,Mudit Verma,Sihang Guo,Ruohan Zhang,Subbarao Kambhampati

Contact: zharu@stanford.edu

Award nominations: Spotlight

Links: Paper | Website

Keywords: human-in-the-loop reinforcement learning, evaluative feedback, saliency map, visual explanation


Machine versus Human Attention in Deep Reinforcement Learning Tasks


Authors: Sihang Guo, Ruohan Zhang, Bo Liu, Yifeng Zhu, Dana Ballard, Mary Hayhoe, Peter Stone

Contact: zharu@stanford.edu

Links: Paper

Keywords: deep reinforcement learning, interpretability, attention, eye tracking


Play to Grade: Testing Coding Games as Classifying Markov Decision Process


Authors: Allen Nie, Emma Brunskill, Chris Piech

Contact: anie@stanford.edu

Links: Paper | Website

Keywords: reinforcement learning, computational education, collaborative training, markov decision process


The Value of Information When Deciding What to Learn


Authors: Dilip Arumugam, Benjamin Van Roy

Contact: dilip@cs.stanford.edu

Links: Paper

Keywords: exploration, information theory, multi-armed bandits, reinforcement learning


[Diversity Matters When Learning From Ensembles](https://papers.nips.cc/paper/2021/hash/466473650870501e3600d9a1b4ee5d44-Abstract.html

https://arxiv.org/abs/2110.14149)

Authors: Giung Nam*, Jongmin Yoon*, Yoonho Lee, Juho Lee

Contact: yoonho@cs.stanford.edu

Links: [Paper](https://papers.nips.cc/paper/2021/hash/466473650870501e3600d9a1b4ee5d44-Abstract.html

https://arxiv.org/abs/2110.14149) | Website

Keywords: deep ensembles, knowledge distillation, calibration, output diversified sampling, batchensemble


Reinforcement Learning with State Observation Costs in Action-Contingent Noiselessly Observable Markov Decision Processes


Authors: HyunJi Nam, Scott Fleming, Emma Brunskill

Contact: scottyf@stanford.edu

Links: Paper | Website

Keywords: reinforcement learning, observation cost, markov decision process, mdp, partially observable markov decision process, pomdp, probably approximately correct, pac, healthcare, health care


Meta-learning with an Adaptive Task Scheduler


Authors: Huaxiu Yao, Yu Wang, Ying Wei, Peilin Zhao, Mehrdad Mahdavi, Defu Lian, Chelsea Finn

Contact: huaxiu@cs.stanford.edu

Links: Paper

Keywords: adaptive task scheduler, meta-learning, sampling


Spatial-Temporal Super-Resolution of Satellite Imagery via Conditional Pixel Synthesis


Authors: Yutong He, Dingjie Wang, Nicholas Lai, William Zhang, Chenlin Meng, Marshall Burke, David B. Lobell, Stefano Ermon

Contact: kellyyhe@stanford.edu

Links: Paper | Video | Website

Keywords: remote sensing, super-resolution, generative models


Scatterbrain: Unifying Sparse and Low-rank Attention


Authors: Beidi Chen*, Tri Dao*, Eric Winsor, Zhao Song, Atri Rudra, Christopher Ré.

Contact: trid@stanford.edu

Links: Paper

Keywords: efficient attention, sparse, low-rank


BCD Nets: Scalable Variational Approaches for Bayesian Causal Discovery


Authors: Chris Cundy, Aditya Grover, Stefano Ermon

Contact: cundy@stanford.edu

Keywords: causal inference, variational inference


Calibrating Predictions to Decisions: A Novel Approach to Multi-Class Calibration


Authors: Shengjia Zhao, Michael P Kim, Roshni Sahoo, Tengyu Ma, Stefano Ermon

Contact: sjzhao@stanford.edu

Links: Paper

Keywords: calibration, decision making under uncertainty


Beyond Pinball Loss: Quantile Methods for Calibrated Uncertainty Quantification


Authors: Youngseog Chung, Willie Neiswanger, Ian Char, Jeff Schneider

Contact: youngsec@andrew.cmu.edu, neiswanger@cs.stanford.edu

Links: Paper | Website

Keywords: uncertainty quantification, uq, quantile regression, pinball loss


Causal Abstractions of Neural Networks


Authors: Atticus Geiger*, Hanson Lu*, Thomas Icard, Christopher Potts

Contact: atticusg@stanford.edu

Links: Paper

Keywords: interpretability, analysis, nlp, causality


Generalized Shape Metrics on Neural Representations


Authors: Alex H Williams, Erin Kunz, Simon Kornblith, Scott Linderman

Contact: alex.h.willia@gmail.com

Keywords: representational similarity analysis, neural representations, shape analysis, metric space


D2C: Diffusion-Denoising Models for Few-shot Conditional Generation


Authors: Abhishek Sinha*, Jiaming Song*, Chenlin Meng, Stefano Ermon

Contact: tsong@cs.stanford.edu

Links: Paper | Website

Keywords: generative modeling, contrastive learning, conditional generation


Combiner: Full Attention Transformer with Sparse COmputation Cost


Authors: Hongyu Ren, Hanjun Dai, Zihang Dai, Mengjiao Yang, Jure Leskovec, Dale Schuurmans, Bo Dai

Contact: hyren@cs.stanford.edu

Links: Paper

Keywords: efficient transformer


Maximum Likelihood Training of Score-Based Diffusion Models


Authors: Yang Song, Conor Durkan, Iain Murray, Stefano Ermon

Contact: yangsong@cs.stanford.edu

Award nominations: Spotlight presentation

Links: Paper

Keywords: score-based generative models, denoising score matching, diffusion models, maximum likelihood training


Contrastive Reinforcement Learning of Symbolic Reasoning Domains


Authors: Gabriel Poesia, WenXin Dong, Noah Goodman

Contact: poesia@stanford.edu

Keywords: reinforcement learning, education, contrastive learning, symbolic reasoning


Equivariant Manifold Flows


Authors: Isay Katsman, Aaron Lou, Derek Lim, Qingxuan Jiang, Ser Nam Lim, Christopher M. De Sa

Contact: aaronlou@stanford.edu

Links: Paper | Website

Keywords: manifold, normalizing flow, equivariant, invariant


Lower Bounds on Metropolized Sampling Methods for Well-Conditioned Distributions


Authors: Yin Tat Lee, Ruoqi Shen, Kevin Tian

Contact: kjtian@stanford.edu

Award nominations: Oral presentation

Links: Paper | Video

Keywords: sampling, lower bounds, langevin dynamics, hamiltonian monte carlo


List-Decodable Mean Estimation in Nearly-PCA Time


Authors: Ilias Diakonikolas, Daniel M. Kane, Daniel Kongsgaard, Jerry Li, Kevin Tian

Contact: kjtian@stanford.edu

Award nominations: Spotlight presentation

Links: Paper

Keywords: robust statistics, semidefinite programming, mixture models


Robust Regression Revisited: Acceleration and Improved Estimation Rates


Authors: Arun Jambulapati, Jerry Li, Tselil Schramm, Kevin Tian

Contact: kjtian@stanford.edu

Links: Paper

Keywords: robust statistics, regression, generalized linear models, acceleration, sum of squares methods


Learning with User-Level Privacy


Authors: Daniel Levy*, Ziteng Sun*, Kareem Amin, Satyen Kale, Alex Kulesza, Mehryar Mohri, Ananda Theertha Suresh

Contact: danilevy@stanford.edu

Links: Paper

Keywords: differential privacy user-level


Adapting to Function Difficulty and Growth Conditions in Private Optimization


Authors: Hilal Asi*, Daniel Levy*, John C. Duchi

Contact: asi@stanford.edu

Links: Paper

Keywords: differential privacy adaptivity optimization


Imitation with Neural Density Models


Authors: Kuno Kim, Akshat Jindal, Yang Song, Jiaming Song, Yanan Sui, Stefano Ermon

Contact: khkim@cs.stanford.edu

Links: Paper

Keywords: rl; imitation learning; density estimation


Why Do Pretrained Language Models Help in Downstream Tasks? An Analysis of Head and Prompt Tuning


Authors: Colin Wei, Sang Michael Xie, Tengyu Ma

Contact: colinwei@stanford.edu

Links: Paper

Keywords: nlp pretraining, theoretical analysis


Safe Reinforcement Learning by Imagining the Near Future


Authors: Garrett Thomas, Yuping Luo, Tengyu Ma

Contact: gwthomas@stanford.edu

Links: Paper

Keywords: safe exploration, model-based rl


Pseudo-Spherical Contrastive Divergence


Authors: Lantao Yu, Jiaming Song, Yang Song, Stefano Ermon

Contact: lantaoyu@cs.stanford.edu

Links: Paper

Keywords: deep generative models, energy-based models, proper scoring rules


IQ-Learn: Inverse soft-Q Learning for Imitation


Authors: Divyansh Garg, Shuvam Chakraborty, Chris Cundy, Jiaming Song, Stefano Ermon

Contact: divgarg@stanford.edu

Award nominations: Spotlight

Links: Paper | Website

Keywords: reinforcement learning, imitation learning, inverse reinforcement learning, statistical learning, energy-based models


Intrinsic Dimension, Persistent Homology and Generalization in Neural Networks


Authors: Tolga Birdal ~Tolga_Birdal3 , Aaron Lou, Leonidas Guibas, Umut Simsekli

Contact: aaronlou@stanford.edu

Links: Paper | Website

Keywords: generalization, persistent homology, intrinsic dimension, deep networks


Baleen: Robust Multi-Hop Reasoning at Scale via Condensed Retrieval


Authors: Omar Khattab, Christopher Potts, Matei Zaharia

Contact: okhattab@stanford.edu

Award nominations: Spotlight paper

Links: Paper | Blog Post

Keywords: neural retrieval, multi-hop question answering, claim verification, reasoning, colbert


Datasets and Benchmarks Track

Workshops

This year, multiple members of the SAIL community are also involved in great workshops that will take place on Dec 13-14. We hope you’ll check them out!

Machine Learning for Structural Biology Workshop (Dec 13)



Organizers: Namrata Anand, Bonnie Berger, Wouter Boomsma, Erika DeBenedictis, Stephan Eismann, John Ingraham, Sergey Ovchinnikov, Roshan Rao, Raphael Townshend and Ellen Zhong

Controllable Generative Modeling in Language and Vision (CtrlGen Workshop) (Dec 13)



Organizers: Steven Y. Feng, Drew A. Hudson, Anusha Balakrishnan, Varun Gangal, Dongyeop Kang, Tatsunori Hashimoto and Joel Tetreault

DistShift Workshop (Dec 13)



Organizers: Shiori Sagawa, Pang Wei Koh, Fanny Yang, Hongseok Namkoong, Jiashi Feng, Kate Saenko, Percy Liang, Sarah Bird and Sergey Levine

Data-centric AI Workshop (Dec 14)



Organizers: Andrew Ng, Lora Aroyo, Cody Coleman, Greg Diamos, Vijay Janapa Reddi, Joaquin Vanschoren,Carole-Jean Wu and Sharon Zhou

Physical Reasoning and Inductive Biases for the Real World Workshop (Dec 14)



Organizers: Krishna Murthy Jatavallabhula, Rika Antonova, Kevin Smith, Hsiao-Yu (Fish) Tung, Florian Shkurti, Jeannette Bohg and Josh Tenenbaum

Workshop Papers

  • How Does Contrastive Pre-training Connect Disparate Domains? by Kendrick Shen*, Robbie Jones*, Ananya Kumar*, Sang Michael Xie*, Percy Liang (DistShift Workshop)
  • Optimal Representations for Covariate Shifts by Yann Dubois, Yangjun Ruan, Chris J. Maddison (DistShift Workshop)
  • [Correct-N-Contrast: a Contrastive Approach for Improving Robustness to Spurious Correlations] by Michael Zhang, Nimit S. Sohoni, Hongyang R. Zhang, Chelsea Finn, Christopher Ré (DistShift Workshop)
  • Calibrated Ensembles: A Simple Way to Mitigate ID-OOD Accuracy Tradeoffs by Ananya Kumar, Aditi Raghunathan, Tengyu Ma, Percy Liang (DistShift Workshop)
  • Sharp Bounds for Federated Averaging (Local SGD) and Continuous Perspective by Margalit Glasgow*, Honglin Yuan*, Tengyu Ma (New Frontiers in Federated Learning)
  • What Matters in Learning from Offline Human Demonstrations for Robot Manipulation | Blog Post | Video | Website by Ajay Mandlekar, Danfei Xu, Josiah Wong, Soroush Nasiriany, Chen Wang, Rohun Kulkarni, Li Fei-Fei, Silvio Savarese, Yuke Zhu, Roberto Martín-Martín (Offline Reinforcement Learning Workshop)
  • An Algorithmic Theory of Metacognition in Minds and Machines | Blog Post by Rylan Schaeffer (Metacognition in the Age of AI: Challenges and Opportunities)
  • Beyond Ads: Sequential Decision-Making Algorithms in Public Policy by Peter Henderson, Ben Chugg, Brandon Anderson, Daniel E. Ho (Workshop on Causal Inference Challenges in Sequential Decision Making)
  • Tracking Urbanization in Developing Regions withRemote Sensing Spatial-Temporal Super-Resolution by Yutong He*, William Zhang*, Chenlin Meng, Marshall Burke, David B. Lobell, Stefano Ermon (Workshop on Machine Learning for the Developing World (ML4D))
  • Likelihood-free Density Ratio Acquisition Functions are not Equivalent to Expected Improvements by Jiaming Song, Stefano Ermon (Bayesian Deep Learning Workshop)
  • Exploiting Proximity Search and Easy Examples to Select Rare Events by Daniel Kang, Alex Derhacobian, Kaoru Tsuji, Trevor Hebert, Peter Bailis, Tadashi Fukami, Tatsunori Hashimoto, Yi Sun, Matei Zaharia (Data Centric AI workshop)

We look forward to seeing you at NeurIPS 2021!

Read More

Improving the performance of large-scale applications via basic block reordering

What the research is:

At Meta, we develop compilers to optimize the performance of our large-scale applications running in data center environments. Profile-guided optimization (PGO) is an important step in modern compilers for improving the performance of large-scale applications based on their runtime behavior. The technique leverages program execution profiles, such as the execution frequencies of binary functions or individual instructions, to guide compilers to optimize critical paths of a program more selectively and effectively.

Basic block reordering is among the most impactful PGO techniques. As part of the compilation process, a binary is broken up into smaller blocks. Many of these basic blocks are executed one after the other, but some contain conditional branches (control-flow instructions, like if-then-else, while, and switch statements) where the execution can jump to two or more blocks. Depending on the relative frequency of these jumps, some orderings of basic blocks can lead to fewer CPU instruction cache misses and therefore faster executions.

The source code on the left is transformed into the control-flow graph in the middle. The code blocks that make up this graph are laid out in memory on the right.

Profiling is used to collect information about the typical execution of an application. It allows us to learn how many times each basic block has been executed and how many times each branch has been taken. Given this information, a compiler’s job is to produce the most “CPU-friendly” ordering of basic blocks that leads to the best binary performance.

Traditional compiler approaches for basic block reordering optimize a specific dimension of CPU, such as instruction cache line utilization or branch prediction. However, we found that such orderings may impose suboptimal results. To overcome these shortcomings, we proposed a new model for block reordering that combines multiple effects and does a better job at predicting the performance of an application. Our model is based on a new optimization problem that we call the extended traveling salesman problem, or Ext-TSP for short.

How it works:

Given a collection of cities and distances between every pair of cities, the classical traveling salesman problem (TSP) involves finding an order in which to visit the cities to minimize the total distance traveled. There are many variations of this problem, like MAX-TSP, where we want to maximize the total distance traveled. Ext-TSP is a generalization of the latter problem, where we want to maximize the distance of not only two adjacent cities but also cities that are close enough together in the order — say, no more than a fixed number of positions away.

In the context of basic block reordering, the blocks play the role of the cities, and the jump counts play the role of the distances between two cities. The ordering corresponds to how the basic blocks are laid out in memory. If two basic blocks are laid out close together, then there is a good chance that a jump from one to another will not incur an instruction cache miss. In a sense, the Ext-TSP objective aims to optimize the utilization of the instruction cache and thus the performance of the application.

Our paper “Improved basic block reordering” introduces the new optimization problem. It shows that finding the best ordering is NP-hard, but also that there is a greedy efficient heuristic that produces good solutions for the instances typically arising from real-world binaries. In addition, for the aforementioned optimization problem, we describe a mixed integer programming formulation that is capable of finding optimal solutions on small functions. Our experiments with the exact method demonstrate that the new suggested heuristic finds an optimal ordering of basic blocks in over 98 percent of real-world instances. From the practical point of view, the new basic block reordering has been implemented in BOLT, an open source binary optimization and layout tool developed at Meta. An extensive evaluation on a variety of real-world data center applications indicate that the new method outperforms existing block reordering techniques, improving the resulting performance of applications with large code size.

This image shows a control-flow graph and two block layouts, one maximizing the number of fall-through jumps and another maximizing the Ext-TSP objective.

Theory behind extended TSP

Given the notoriety of the original TPS problem, we wanted to understand how much more difficult Ext-TSP was compared to the classical TSP. In our paper “On the extended TSP problem,” we study Ext-TSP from a mathematical perspective and prove both negative and positive results about the problem.

On the negative side, it turns out that Ext-TSP is much harder than classical TSP. We prove that conditional on the exponential time hypothesis, it is unlikely that there exists an efficient algorithm for solving the problem optimally, even for very simple treelike instances, like those arising from simple programs without loops. This is very surprising, as most optimization problems (including the classical TSP) admit such efficient algorithms on trees.

On the positive side, we design so-called approximation algorithms that are efficient and return a solution that is guaranteed to be at most a given fixed factor worse than the optimal solution. Given the previous impossibility of having efficient optimal algorithms, such approximation algorithms are the best we can hope for.

Why it matters:

Developing new compiler technology to optimize the performance of our servers is an impactful area of research at Meta. Faster applications mean less computer power is needed to serve our users, which directly translates into less electricity usage in our data centers and a small environmental footprint of our operations.

The post Improving the performance of large-scale applications via basic block reordering appeared first on Facebook Research.

Read More

PASS: Performance-Adaptive Sampling Strategy for Graph Convolutional Networks

Figure 1: On LinkedIn, people are commonly connected with members from the same field who are likely to share skills and/or job preferences. Graph Convolutional Networks (GCNs) leverage this feature of the LinkedIn network and make better job recommendations by aggregating information from a member’s connections. For instance, to recommend a job to Member A, GCNs will aggregate information from Members B, C, and D who worked/are working in the same companies or have the same major.

TL;DR: Graph Convolutional Networks (GCNs) complement each node embedding with their neighboring node embeddings under a ‘homophily’ assumption, “connected nodes are relevant.” This leads to two critical problems when applying GCNs to real-world graphs: 1) scalability: numbers of neighboring nodes are sometimes too large to aggregate everything (e.g., Cristiano Ronaldo has 358 million connected accounts — his followers — on the Instagram’s member-to-member network),  2) low accuracy: nodes are sometimes connected with irrelevant nodes (e.g., people make connections with their personal friends who work in the totally different fields on LinkedIn). Here, we introduce a performance-adaptive sampling strategy for GCNs to solve both scalability and accuracy problems at once.

Graphs are ubiquitous. Any entities and interactions among them could be presented as graphs — nodes correspond to the individual entities and edges are generated between nodes when the corresponding entities have interactions between them. For instance, there are who-follows-whom graphs in social networks, who-pays-whom transaction networks in banking systems, and who-buys-which-products graphs in online malls. In addition to those originally graph-structured data, recently, few other computer science fields build new types of graphs by abstracting their concept (e.g., scene graphs in computer vision or knowledge graphs in NLP).

What are Graph Convolutional Networks?

As graphs contain rich contextual information — relationships among entities, various approaches have been proposed to include graph information in deep learning models. One of the most successful deep learning models combining graph information is Graph Convolutional Networks (GCNs) [1]. Intuitively, GCNs complement each node embeddings with their neighboring node embeddings, assuming neighboring nodes are relevant (we call this ‘homophily’), thus their information would help to improve a target node’s embedding. In Figure 1, on LinkedIn’s member-to-member networks, we refer to Member A‘s previous/current colleagues to make a job recommendation for Member A, assuming their jobs or skills are related to Member A‘s. GCNs aggregate neighboring node embeddings by borrowing the convolutional filter concept from Convolutional Neural Networks (CNNs) and replacing it with a first-order graph spectral filter.

Figure 2. GCNs aggregate neighboring node embeddings to complement each node embeddings in convolution operations. After 2 steps of convolution operations, nodes have information of neighboring nodes within 2 hops.

When (h_i^{(l)}) denotes the hidden embedding of node (v_i) in the (l)-th layer, one-step convolution (we also call it one-step aggregation or one-step message-passing) in GCNs is described as follows:

[h^{(l+1)}_i = alpha left( frac{1}{N(i)}sum_{j=1}^{N}a(v_i, v_j)h^{(l)}_jW^{(l)} right), quad l = 0,dots,L-1 tag{1}label{1}]

where (a(v_i, v_j)) =1 when there is an edge from (v_i) to (v_j), otherwise 0; (N(i) = sum_{j=1}^{N} a(v_i, v_j)) is the degree of node (v_i); (alpha(cdot)) is a nonlinear function; (W^{(l)}) is the learnable transformation matrix. In short, GCNs average neighboring nodes (v_j)‘s embeddings (h_j^{(l)}), transform them with (W^{(l)}) and (alpha(cdot)), then update node (v_i)‘s embedding (h_i^{(l+1)}) using the aggregated and transformed neighboring embeddings. In practice, (h_i^{(0)}) is set with input node attributes and (h_i^{(L)}) is passed to an output layer specialized to a given downstream task. By stacking graph convolutional layers (L) times, (L)-layered GCNs complement each node embeddings with its neighboring nodes within (L) hops (Figure 2).

GCNs have garnered considerable attention as a powerful deep learning tool for representation learning of graph data. They demonstrate state-of-the-art performance on node classification, link prediction, and graph property prediction tasks. Currently, GCNs are one of most hot topics in graph mining and deep learning fields.

GCNs do not scale to large-scale real-world graphs.

However, when we adapt GCNs to million or billion-scaled real-world graphs (even trillion-scaled graphs for Google or Facebook), GCNs show a scalability issue. The main challenge comes from neighborhood expansion — GCNs expand neighbors recursively in the aggregation operations (i.e., convolution operations), leading to high computation and memory footprints. For instance, given a graph whose average degree is (d), (L)-layer GCNs access (d^L) neighbors per node on average (Figure 2). If the graph is dense or has many high degree nodes (e.g., Cristiano Ronaldo has 358 million followers on Instagram), GCNs need to aggregate a huge number of neighbors for most of the training/test examples.

The only way to alleviate this neighbor explosion problem is to sample a fixed number of neighbors in the aggregation operation, thereby regulating the computation time and memory usage. We first recast the original Equation (eqref{1}) as follows:

[h^{(l+1)}_i = alpha left( mathbb{E}_{jsim p(j|i)}[h^{(l)}_j]W^{(l)} right), quad l = 0,dots,L-1tag{2}label{2}]

where (p(j|i) = frac{a(v_i, v_j)}{N(i)}) defines the probability of sampling (v_j) given (v_i). Then we approximate the expectation by Monte-Carlo sampling as follows [2]:

[h^{(l+1)}_i = alpha left( frac{1}{k}sum_{jsim p(j|i)}^{k}h^{(l)}_jW^{(l)} right), quad l = 0,dots,L-1tag{3}label{3}]

where (k) is the number of sampled neighbors for each node. Now, we can regulate the GCNS’ computation costs using the sampling number (k).

GCN performance is affected by how neighbors are sampled, more specifically, how sampling policies — (q(j|i)), a probability of sampling a neighboring node (v_j) given a source node (v_i) — are defined. Various sampling policies [2-5] have been proposed to improve the GCN performance. Most of them target to minimize the variance caused by sampling (i.e., variance of the estimator (h^{(l+1)}_i) in Equation (eqref{3})). Variance minimization makes the aggregation of the sampled neighborhood to approximate the original aggregation of the full neighborhood. In other words, their sampling policies set the full neighborhood as the optimum they should approximate. But, is the full neighborhood the optimum?

Are all neighbors really helpful?

Figure 3. In the real world, we make connections not only with people working in similar fields but also with personal friends or family members who have different career paths in LinkedIn. Which neighbor should we sample to make a better job recommendation?

To answer this question, let’s go back to the motivation of the convolution operation in GCNs. When two nodes are connected with each other in graphs, we regard them as related to each other. Based on this ‘homophily’ assumption, GCNs aggregate neighboring nodes’ embeddings via the convolution operation to complement a target node’s embedding. So the convolution operation in GCNs will shine only when neighbors are informative for the task.

However, real-world graphs always contain unintended noisy neighbors. For example, in LinkedIn’s member-to-member networks, members might make connections not only with her colleagues working in the same field, but also with her family members or personal friends who may have totally different career paths (Figure 3). These family members or personal friends are uninformative for the job recommendation task. When their embeddings are aggregated into the target member’s embedding via the convolution operations, the recommendation quality becomes degraded. Thus, to fully enjoy benefits of the convolution operations, we need to filter out noisy neighbors.

How could we filter out noisy neighbors? We find the answer in the sampling policy: we sample neighbors only informative for a given task. How could we sample informative neighbors for the task? We train a sampler to maximize the target task’s performance (instead of minimizing sampling variance).

Figure 4. PASS is composed of three steps: (a) sampling, (b) feedforward propagation, and (c) backpropagation. In the backpropagation process, the GCN and the sampling policy are optimized jointly to minimize the GCN performance loss.

PASS: performance-adaptive sampling strategy for GCNs

We propose PASS, a performance-adaptive sampling strategy that optimizes a sampling policy directly for GCN performance. The key idea behind our approach is that we learn a sampling policy by propagating gradients of the GCN performance loss through the non-differentiable sampling operation. We first describe a learnable sampling policy function and how it operates in the GCN. Then we describe how to learn the parameters of the sampling policy by back-propagating gradients through the sampling operation.

Sampling policy: Figure 4 shows an overview of PASS. In the forward pass, PASS samples neighbors with its sampling policy (Figure 4(a)), then propagates their embeddings through the GCN (Figure 4(b)). Here, we introduce our parameterized sampling policy (q^{(l)}(j|i)) that estimates the probability of sampling node (v_j) given node (v_i) at the (l)-th layer. The policy (q^{(l)}(j|i)) is composed of two methodologies, importance (q^{(l)}_{imp}(j|i)) and random sampling (q^{(l)}_{rand}(j|i)) as follows:

[q^{(l)}_{imp}(j|i) = (W_scdot h^{(l)}_i)cdot(W_scdot h^{(l)}_j)\
q^{(l)}_{rand}(j|i) = frac{1}{N(i)}\
tilde{q}^{(l)}(j|i) = a_scdot[q^{(l)}_{imp}(j|i), quad q^{(l)}_{rand}(j|i)] \
q^{(l)}(j|i) = tilde{q}^{(l)}(j|i) / sum_{k=1}^{N(i)}tilde{q}^{(l)}(k|i)]

where (W_s) is a transformation matrix; (h^{(l)}_i) is the hidden embedding of node (v_i) at the (l)-th layer; (N(i)) is the degree of node (v_i); (a_s) is an attention vector; and (q^{(l)}(cdot|i)) is normalized to sum to 1. (W_s) and (a_s) are learnable parameters of our sampling policy, which will be updated toward performance improvement.

When a graph is well-clustered (i.e., less noisy neighbors), nodes are connected with all informative neighbors. Then random sampling becomes effective since its randomness helps aggregate diverse informative neighbors, thus preventing the GCN from overfitting. By capitalizing on both importance and random samplings, our sampling policy better generalizes across various graphs. Since we don’t know whether a given graph is well-clustered or not in advance, (a_s) learns which sampling methodology is more effective on a given task.

Training the Sampling Policy: after a forward pass with sampling, the GCN computes the performance loss (e.g., cross-entropy for node classification) then back-propagates gradients of the loss (Figure 4(c)). To learn a sampling policy maximizing the GCN performance, PASS trains the sampling policy based on gradients of the performance loss passed through the GCN. When (theta) denotes parameters ((W_s, a_s)) in our sampling policy (q^{(l)}_{theta}), we can write the sampling operation with (q^{(l)}_theta(j|i)) as follows:

[h^{(l+1)}_i = alpha_{W^{(l)}}(mathbb{E}_{jsim q^{(l)}_{theta}(j|i)}[h^{(l)}_j]), quad l = 0,dots,L-1]

Before being fed as input to the GCN transformation (alpha_{W^{(l)}})((cdot)), the hidden embeddings (h^{(l)}_j) go through an expectation operation (mathbb{E}_{jsim q^{(l)}_{theta}(j|i)})[(cdot)] under the sampling policy, which is non-differentiable. To pass gradients of the loss through the expectation, we apply the log derivative trick [6], widely used in reinforcement learning to compute gradients of stochastic policies. Then the gradient (nabla_theta mathcal{L}) of the loss (mathcal{L}) w.r.t. the sampling policy (q^{(l)}_{theta(j|i)}) is computed as follows:

Based on Theorem 4.1, we pass the gradients of the GCN performance loss to the sampling policy through the non-differentiable sampling operation and optimize the sampling policy for the GCN performance. You can find proof of the theorem in our original paper. PASS optimizes the sampling policy jointly with the GCN parameters to minimize the task performance loss, resulting in a considerable performance improvement.

Experimental Results

Table 1. PASS outperforms all baselines up to 10.4% on the benchmark datasets and up to 10.2% on LinkedIn production datasets (LnkIndustry, LnkTitle). Results on the benchmark datasets are presented in precision. Results on LinkedIn production datasets are presented in percentage points (pp) with respect to GraphSage (random sampling).

To examine the effectiveness of PASS, we run PASS on seven public benchmarks and two LinkedIn production datasets in comparison to four state-of-the-art sampling algorithms. GraphSage [2] samples neighbors randomly, while FastGCN [3], AS-GCN [4], and GCN-BS [5] do importance sampling with various sampling policy designs. Note that FastGCN, AS-GCN, and GCN-BS all target to minimize variance caused by neighborhood sampling. In Table 1, our proposed PASS method shows the highest accuracy among all baselines across all datasets on the node classification tasks. One interesting result is that GraphSage, which samples neighbors randomly, still shows good performance as compared to carefully designed importance sampling algorithms. The seven public datasets are well-clustered, which means most neighbors are relevant rather than noisy to a target node; thus there is not much room for improvement using importance sampling.

In the following experiment, we add noise to graphs. We investigate two different noise scenarios: 1) fake connections among existing nodes, and 2) fake neighbors with random feature vectors. These two scenarios are common in real-world graphs. The first “fake connection” scenario simulates connections made by mistake or unfit for the purpose (e.g., connections between family members in LinkedIn). The second “fake neighbor” scenario simulates fake accounts with random attributes used for fraudulent activities. For each node, we generate five true neighbors and five fake neighbors.

Table 2. PASS maintains high accuracy in various graph noise scenarios, while the accuracy of all other baselines plummets. PASS is effective not only in sampling informative neighbors but also in removing irrelevant neighbors.

Table 2 shows that PASS consistently maintains high accuracy across all scenarios, while the performance of all other methods plummets. GraphSage, which gives the same sampling probability to true neighbors and fake neighbors, shows a sharp drop in accuracy. Other importance sampling-based methods, FastGCN, AS-GCN, and GCN-BS, also see a sharp drop in accuracy. They target to minimize sampling variance; thus they are likely to sample high-degree or dense-feature nodes, which help stabilize the variance, regardless of their relationship with the target node. Then, they all fail to distinguish fake neighbors from true neighbors. On the other hand, PASS learns which neighbors are informative or fake from gradients of the performance loss. These results show that the optimization of the sampling policy towards performance brings robustness to graph noise.

How does PASS learn which neighbors to sample?

PASS demonstrates superior performance in sampling informative neighbors for a given task. How could PASS learn whether a neighbor is informative for the task? How could PASS decide a certain sampling probability for each neighbor? To understand how PASS actually works, we dissect the back-propagation process of PASS. In Theorem 5.1., we find out that, during the back-propagation phase, PASS measures the alignment between (-dmathcal{L}/dh^{(l)}_i) and (h^{(l)}_j) and increases the sampling probability (q^{(l)}(j|i)) in proportion to this alignment. Proof of Theorem 5.1. can be found in the original paper.

This is an intuitively reasonable learning mechanism. GCNs train their parameters to move the node embeddings (h^{(l)}_i) in the direction that minimizes the performance loss (mathcal{L}), i.e., the gradient (-dmathcal{L} / dh^{(l)}_i). PASS promotes this process by sampling neighbors whose embeddings are aligned with the gradient (-dmathcal{L}/dh^{(l)}_i). When (h^{(l)}_i) is aggregated with the embedding (h^{(l)}_j) of a sampled neighbor aligned with the gradient, it moves in the direction that reduces the loss (mathcal{L}).

Figure 5. Interpretation of why PASS assigns higher sampling probability to node (v_3) than (v_5) given target node (v_2). Node (v_3)’s embedding (h^{(l)}_3) helps (v_2)’s embedding (h^{(l)}_2) move in the direction (-dmathcal{L} / dh^{(l)}_2) that decreases the performance loss (mathcal{L}), while aggregating with node (v_5)’s embedding would move (h^{(l)}_2) in the opposite direction.

Let’s think about a simple example. In Figure 5, (h^{(l)}_3) is better aligned with (-dmathcal{L}/dh^{(l)}_2) than (h^{(l)}_5). Then PASS considers (v_3) more informative than (v_5) for (v_2) because node (v_3)’s embedding (h^{(l)}_3) helps (v_2)’s embedding (h^{(l)}_2) move in the direction (-dmathcal{L} / dh^{(l)}_2) that decreases the performance loss (mathcal{L}), while aggregating with node (v_5)’s embedding would move (h^{(l)}_2) in the opposite direction.

This reasoning process leads to two important considerations. First, it crystallizes our understanding of the aggregation operation in GCNs. The aggregation operation enables a node’s embedding to move towards its informative neighbors’ embeddings to reduce the performance loss. Second, this reasoning process shows the benefits of joint optimization of the GCN and sampling policy. Without optimizing the sampling policy jointly, the GCN depends solely on its parameters to move node embeddings towards the minimum performance loss. Joint optimization with the sampling policy helps the GCN to move the node embeddings more efficiently by aggregating with informative neighbors’ embeddings, leading to the minimum loss more efficiently.

PASS catches two birds, “accuracy” and “scalability”, with one stone. 

Figure 6. PASS achieves both accuracy and scalability using a performance-adaptive sampling strategy.

Today, we introduced a novel sampling algorithm PASS for graph convolutional networks. By sampling neighbors informative for task performance, PASS improves both the accuracy and scalability of CGNs. In nine different real-world graphs, PASS consistently outperforms state-of-the-art samplers, being up to 10.4% more accurate. In the presence of graph noises, PASS shows up to 53.1% higher accuracy than the baselines, proving its ability to read the context and distinguish the noises. By dissecting the back-propagation process, PASS explains why a neighbor is considered informative and assigned a high sampling probability.

In this era of big data, new graphs and tasks are generated every day. Graphs become bigger and bigger, and different tasks require different relational information within the graphs. By sampling informative neighbors adaptively for a given task, PASS allows GCNs to be applied on larger-scale graphs and a more diverse range of tasks. We believe that PASS can bring even more impact on a wider range of users across academia and industry in the future.

Links: paper, video, slide, code will be released at the end of 2021.

If you would like to reference this article in an academic publication, please use this BibTeX:

@inproceedings{yoon2021performance,
  title={Performance-Adaptive Sampling Strategy Towards Fast and Accurate Graph Neural Networks},
  author={Yoon, Minji and Gervet, Th{'e}ophile and Shi, Baoxu and Niu, Sufeng and He, Qi and Yang, Jaewon},
  booktitle={Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining},
  pages={2046--2056},
  year={2021}
}

References

  1. Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).
  2. Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. In Advances in neural information processing systems.
  3. Jie Chen, Tengfei Ma, and Cao Xiao. 2018. Fastgcn: fast learning with graph convolutional networks via importance sampling. arXiv preprint arXiv:1801.10247 (2018).
  4. Wenbing Huang, Tong Zhang, Yu Rong, and Junzhou Huang. 2018. Adaptive sampling towards fast graph representation learning. In Advances in neural information processing systems. 4558–4567
  5. Ziqi Liu, Zhengwei Wu, Zhiqiang Zhang, Jun Zhou, Shuang Yang, Le Song, and Yuan Qi. 2020. Bandit Samplers for Training Graph Neural Networks. arXiv preprint arXiv:2006.05806 (2020).
  6. Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8, 3-4 (1992), 229–256.

Read More

AWS BugBust sets the Guinness World Record for the largest bug fixing challenge

AWS BugBust is the first global bug-busting challenge for developers to eliminate 1 million software bugs and save $100 million in technical debt for their organizations. AWS BugBust allows you to create and manage private events that transform and gamify the process of finding and fixing bugs in your software. With automated code analysis, built-in leaderboards, custom challenges, and rewards, AWS BugBust helps foster team building and introduces some friendly competition into improving code quality and application performance.

AWS BugBust utilizes the machine learning (ML)-powered developer tools in Amazon CodeGuru—CodeGuru Reviewer and CodeGuru Profiler—to automatically scan your code to weed out gnarly bugs, and gamifies fixing and eliminating them.

Since launch in June 2021, thousands of Java and Python developers have participated in AWS BugBust events hosted internally by their organizations. They have used their coding skills to collectively fix over 33,000 software bugs and helped organizations save approximately $4 million in technical debt—all while earning points and exclusive prizes.

Take a look at how the students of Miami Dade College took on the challenge to improve code quality using AWS BugBust for Games for Love, a 501(c)(3) public charity dedicated to easing suffering, saving lives, and creating sustainable futures for children.

First annual AWS BugBust re:Invent challenge

To increase the impact of the AWS BugBust events, we launched the first annual AWS BugBust re:Invent challenge—an open competition for Java and Python developers to help fix bugs in open-source code bases. Beginning November 29, 2021, at 10 AM PST, thousands of Java and Python developers including enthusiasts, students, and professionals settled into a 76-hour session to fix bugs, earn points, and win an array of prizes such as hoodies, Amazon Echo Dots, as well as the coveted title of the Ultimate AWS BugBuster, accompanied by a cash prize of $1,500. The mammoth challenge included 613 developers of all skills levels, participating on-site at AWS re:Invent and virtually in over 20 countries. The event captured the Guinness World Record for the largest bug fixing challenge and helped the open-source community by providing 30,667 software bug fixes.

As part of the challenge, participants tackled and fixed a myriad of software bugs ranging from security issues, to duplicate code, to resource leaks and more. For each bug that a participant fixed, they received points based on the complexity of the bug. Each bug fix submitted was verified by CodeGuru to determine if the issue was resolved.

Join the mission to exterminate 1 million bugs today!

Get started by going head-to-head with your teammates in your own private AWS BugBust events. With just a few clicks, you can set up private AWS BugBust virtual events quickly and easily on the AWS Management Console, with built-in leaderboards, challenges, and rewards. BugBusters (your developers) from around the world can join your BugBust events to fix as many bugs as possible, score points, and contribute to busting 1 million bugs—and compete for prizes and prestige by fixing bugs in their applications.

Watch AWS customer Nextroll’s experience hosting an AWS BugBust event for their developers to eliminate software bugs and improve application reliability for their organization.

 You can get started with AWS BugBust at no cost. When you create your first AWS BugBust event, all costs incurred by the underlying usage of CodeGuru Reviewer and CodeGuru Profiler are free of charge for 30 days per AWS account. See the CodeGuru pricing page for details.


About the Authors

Sama Bali is a Product Marketing Manager within the AWS AI Services team.

Jordan Gruber is a Product Manager-Technical within the AWS AI-DevOps team.

Read More

Evaluating Syntactic Abilities of Language Models

Posted by Jason Wei, AI Resident and Dan Garrette, Research Scientist, Google Research

In recent years, pre-trained language models, such as BERT and GPT-3, have seen widespread use in natural language processing (NLP). By training on large volumes of text, language models acquire broad knowledge about the world, achieving strong performance on various NLP benchmarks. These models, however, are often opaque in that it may not be clear why they perform so well, which limits further hypothesis-driven improvement of the models. Hence, a new line of scientific inquiry has arisen: what linguistic knowledge is contained in these models?

While there are many types of linguistic knowledge that one may want to investigate, a topic that provides a strong basis for analysis is the subject–verb agreement grammar rule in English, which requires that the grammatical number of a verb agree with that of the subject. For example, the sentence “The dogs run.” is grammatical because “dogs” and “run” are both plural, but “The dogs runs.” is ungrammatical because “runs” is a singular verb.

One framework for assessing the linguistic knowledge of a language model is targeted syntactic evaluation (TSE), in which minimally different pairs of sentences, one grammatical and one ungrammatical, are shown to a model, and the model must determine which one is grammatical. TSE can be used to test knowledge of the English subject–verb agreement rule by having the model judge between two versions of the same sentence: one where a particular verb is written in its singular form, and the other in which the verb is written in its plural form.

With the above context, in “Frequency Effects on Syntactic Rule-Learning in Transformers”, published at EMNLP 2021, we investigated how a BERT model’s ability to correctly apply the English subject–verb agreement rule is affected by the number of times the words are seen by the model during pre-training. To test specific conditions, we pre-trained BERT models from scratch using carefully controlled datasets. We found that BERT achieves good performance on subject–verb pairs that do not appear together in the pre-training data, which indicates that it does learn to apply subject–verb agreement. However, the model tends to predict the incorrect form when it is much more frequent than the correct form, indicating that BERT does not treat grammatical agreement as a rule that must be followed. These results help us to better understand the strengths and limitations of pre-trained language models.

Prior Work
Previous work used TSE to measure English subject–verb agreement ability in a BERT model. In this setup, BERT performs a fill-in-the-blank task (e.g., “the dog _ across the park”) by assigning probabilities to both the singular and plural forms of a given verb (e.g., “runs” and “run”). If the model has correctly learned to apply the subject–verb agreement rule, then it should consistently assign higher probabilities to the verb forms that make the sentences grammatically correct.

This previous work evaluated BERT using both natural sentences (drawn from Wikipedia) and nonce sentences, which are artificially constructed to be grammatically valid but semantically nonsensical, such as Noam Chomsky’s famous example “colorless green ideas sleep furiously”. Nonce sentences are useful when testing syntactic abilities because the model cannot just fall back on superficial corpus statistics: for example, while “dogs run” is much more common than “dogs runs”, “dogs publish” and “dogs publishes” will both be very rare, so a model is not likely to have simply memorized the fact that one of them is more likely than the other.

BERT achieves an accuracy of more than 80% on nonce sentences (far better than the random-chance baseline of 50%), which was taken as evidence that the model had learned to apply the subject–verb agreement rule. In our paper, we went beyond this previous work by pre-training BERT models under specific data conditions, allowing us to dig deeper into these results to see how certain patterns in the pre-training data affect performance.

Unseen Subject–Verb Pairs
We first looked at how well the model performs on subject–verb pairs that were seen during pre-training, versus examples in which the subject and verb were never seen together in the same sentence:

BERT’s error rate on natural and nonce evaluation sentences, stratified by whether a particular subject–verb (SV) pair was seen in the same sentence during training or not. BERT’s performance on unseen SV pairs is far better than simple heuristics such as picking the more frequent verb or picking the more frequent SV pair.

BERT’s error rate increases slightly for unseen subject–verb (SV) pairs, for both natural and nonce evaluation sentences, but it is still much better than naïve heuristics, such as picking the verb form that occurred more often in the pre-training data or picking the verb form that occurred more frequently with the subject noun. This tells us that BERT is not just reflecting back the things that it sees during pre-training: making decisions based on more than just raw frequencies and generalizing to novel subject–verb pairs are indications that the model has learned to apply some underlying rule concerning subject–verb agreement.

Frequency of Verbs
Next, we went beyond just seen versus unseen, and examined how the frequency of a word affects BERT’s ability to use it correctly with the subject–verb agreement rule. For this study, we chose a set of 60 verbs, and then created several versions of the pre-training data, each engineered to contain the 60 verbs at a specific frequency, ensuring that the singular and plural forms appeared the same number of times. We then trained BERT models from these different datasets and evaluated them on the subject–verb agreement task:

BERT’s ability to follow the subject–verb agreement rule depends on the frequency of verbs in the training set.

These results indicate that although BERT is able to model the subject–verb agreement rule, it needs to see a verb about 100 times before it can reliably use it with the rule.

Relative Frequency Between Verb Forms
Finally, we wanted to understand how the relative frequencies of the singular and plural forms of a verb affect BERT’s predictions. For example, if one form of the verb (e.g., “combat”) appeared in the pre-training data much more frequently than the other verb form (e.g., “combats”), then BERT might be more likely to assign a high probability to the more frequent form, even when it is grammatically incorrect. To evaluate this, we again used the same 60 verbs, but this time we created manipulated versions of the pre-training data where the frequency ratio between verb forms varied from 1:1 to 100:1. The figure below shows BERT’s performance for these varying levels of frequency imbalance:

As the frequency ratio between verb forms in training data becomes more imbalanced, BERT’s ability to use those verbs grammatically decreases.

These results show that BERT achieves good accuracy at predicting the correct verb form when the two forms are seen the same number of times during pre-training, but the results become worse as the imbalance between the frequencies increases. This implies that even though BERT has learned how to apply subject–verb agreement, it does not necessarily use it as a “rule”, instead preferring to predict high-frequency words regardless of whether they violate the subject–verb agreement constraint.

Conclusions
Using TSE to evaluate the performance of BERT reveals its linguistic abilities on syntactic tasks. Moreover, studying its syntactic ability in relation to how often words appear in the training dataset reveals the ways that BERT handles competing priorities — it knows that subjects and verbs should agree and that high frequency words are more likely, but doesn’t understand that agreement is a rule that must be followed and that the frequency is only a preference. We hope this work provides new insight into how language models reflect properties of the datasets on which they are trained.

Acknowledgements
It was a privilege to collaborate with Tal Linzen and Ellie Pavlick on this project.

Read More

Announcing the recipients of Instagram research awards on safety and community health

At Meta, we conduct our own research and consult external experts and research to understand how we can improve our products and policies to better support our community. For the past several years, we’ve awarded external researchers around the world funding to study topics like well-being and polarization that may affect people’s experiences on our platforms.

In May 2021, Instagram launched a request for proposals (RFP) focused on research on safety and community health, especially as it relates to young people and underserved communities. Today, we’re announcing the winners of these awards.

“It’s important for us to collaborate with external researchers to identify ways we can improve Instagram’s policies and products. This round of research recipients is looking at critical issues like supporting trans people’s journeys with gender affirmation online and decreasing harassment among young people.” said Kristin Hendrix, Head of Instagram Research.

Specifically, we were interested in proposals for research that would help us (1) better understand equity and fairness issues in our community, (2) develop better policies, (3) assess possible improvements to protect our younger community, or (4) better understand the mechanisms (e.g., social support, social comparison) through which Instagram usage could impact the people that use our service.

Applications were initially reviewed by members of our internal research team with diverse subject matter expertise based on a broad range of criteria including anticipated impact, plan for conducting the research and more. Finalists were reviewed and winners were ultimately chosen by leaders from Meta’s research organization.

The RFP attracted more than 200 proposals from 172 universities and institutions around the world. The winners and their research proposals are listed below. The work all of the applicants are doing is critically important, and we’re grateful to everyone who applied. We look forward to sharing upcoming RFPs in the future.

Research award winners

Principal investigators are listed first unless otherwise noted.

Chatbots as Social Support Actors (CASSA)
Celeste Campos-Castillo, Linnea I. Laestadius (University of Wisconsin Milwaukee)

“We hope the inclusion of diverse voices from communities that have experienced structural racism will help promote equity in the design of technology-driven mental health interventions like this chatbot.” – Campos-Castillo and Laestadius

Enhancing trans people’s experiences of gender affirmation on Instagram
Denton Callander, Teddy Cook (University of New South Wales)

“Instagram and other social media are really important social spaces for many trans people, which is why we’re so excited to launch the #TransIsBeautiful project. This innovative social research will help us learn about how to maximize visual social media as a safe, healthy, and affirming space for trans people of all genders in Australia and around the world.” – Cook, #TransIsBeautiful Co-Lead Investigator, Manager of Trans Health Equity at ACON, and Vice President of the Australian Professional Association for Trans Health

Improving water safety at risky Instagram hotspots via targeted information
Amy Peden, Robert Brander, William Koon (University of New South Wales)

“Chasing the perfect selfie can result in temporary loss of concentration and self-awareness which can lead to injury and death. Drowning is the leading cause of selfie-related death, with many picturesque locations near to water or on rocky outcrops or clifftops. We are thrilled to be working with Instagram to develop, implement and evaluate the best way to get water safety information to people who are hashtagging or tagging themselves at known risky locations, starting with sites of concern across Australia and California in the United States. This research has the power to save lives and we’re so excited to get started.” – Peden

Mitigating cyberbullying experiences of younger users on Instagram
Jorge Goncalves, Louise La Sala, Senuri Wijenayake, Simon D’Alfonso (University of Melbourne)

“We aim to investigate the occurrence of cyberbullying from a novel socio-psychological perspective to further understand the rationale behind this behaviour. We hope that the outcomes of our work will extend Instagram’s recent endeavours to provide a safer environment for its young users by uncovering new approaches to mitigate bad experiences on the platform.” – Goncalves

Towards proactive moderation of coordinated harassment on Instagram
Gianluca Stringhini, Chen Ling (Boston University)

“Online harassment against social media users does not come out of the blue, but it is often the result of coordination from hateful communities that pick their targets and coordinate hateful attacks against them. In this project we aim to understand what kind of Instagram content commonly receives coordinate harassment, with the goal of improving content moderation. Identifying content that is likely to receive harassment can help human moderators focus their efforts and improve the safety of Instagram users.” – Stringhini

Using IG to increase physical activity & support among BIPOC students
Olivia Johnson, Desmond Delk (University of Houston)

“From improving mental and physical health to enhancing our emotional well-being, the benefits of engaging in daily physical activity are limitless. However, many individuals do not meet the daily physical activity recommendations. As such, we are exploring the impact of IG social support communities in addressing physical fitness adherence of BIPOC college women.” – Johnson and Delk

Finalists

#LesbiansofInstagram: investigating Instagram’s role in queer women’s lives
Stefanie Duguay (Concordia University)

Cancer hoax victims on Instagram: learning from community responses
Lisbeth Klastrup (IT University of Copenhagen)

Effect of likes on diverse adolescent girls’ social comparison & body image
Jessica Faye Saunders, Asia Eaton (Clark University)

Imag(in)ing better bodies: investigating teens’ narratives of photo editing
Ysabel Gerrard, Ruth Holliday (University of Sheffield)

Instagram influence on teen food choice and equity in diverse communities
Darcy A. Freedman, Callie Ogland-Hand, Nora L. Nock (Case Western Reserve University)

Sharing family narratives including people with disabilities best practices
Renee Barnes, Gerard Goggin, Katie Ellis, Tama Leaver (University of the Sunshine Coast)

The post Announcing the recipients of Instagram research awards on safety and community health appeared first on Facebook Research.

Read More

Omniverse Creator Takes Viewers Down an Artistic Time Tunnel in OmniRacer Video

Movies like The Matrix and The Lord of the Rings inspired a lifelong journey in computer graphics for Piotr Skiejka, a senior visual effects artist at Ubisoft.

A headshot of Piotr SkiejkaBorn in Poland and based in Singapore, Skiejka turned his childhood passion of playing with motion design — as well as compositing, lighting and rendering effects — into a career.

He was recently named a winner of the #CreateYourRetroverse contest, in which creators using NVIDIA Omniverse shared scenes that flashback to when and where their love for graphics began.

Skiejka uses the Omniverse real-time collaboration and simulation platform for his 3D workflows when designing animation, film and video game scenes.

Over the past 13 years, he’s worked on the visual effects for a filmography of nearly two dozen listings, including Marvel’s Avengers and an episode of Game of Thrones, as well as several commercials and video games.

“I enjoy learning new workflows and expanding my creative knowledge every day, especially in such an evolving field,” he said.

A Lord of the Rende(rings)

Skiejka’s creative process begins with collecting references and creating a board full of pictures. Then, he blocks scenes in Omniverse, using simple shapes and meshes to fill the space before taking a first pass at lighting.

After experimenting with a scene’s different layers, Skiejka replaces his blocked prototypes with high-resolution assets — perfecting the lighting and tweaking material textures as final touches.

Watch a stunning example of his creative process in this video, which was recognized in the NVIDIA Omniverse #CreateYourRetroverse contest:

According to Skiejka, the main challenges he previously faced were long rendering times and slow software feedback when lighting and shading.

“Now, with NVIDIA RTX technology, render time is greatly decreased, and visual feedback occurs in real time,” he said. “The Omniverse Kit framework and the Omniverse Nucleus server are superb game-changers and perfect additions to my workflow.”

Skiejka’s favorite feature of the platform, however, is the Omniverse Create scene composition application. He said it’s “packed with valuable extensions, like PhysX and Flow,” which he used while designing the retroverse scene above.

“I hope I showed the spirit of childhood in my #CreateYourRetroverse video, and that this artistic time tunnel between the past and present will inspire others to showcase their experiences, too,” Skiejka said.

Learn more about NVIDIA Omniverse.

The post Omniverse Creator Takes Viewers Down an Artistic Time Tunnel in OmniRacer Video appeared first on The Official NVIDIA Blog.

Read More