An Inferential Perspective on Federated Learning

An Inferential Perspective on Federated Learning

TL;DR: motivated to better understand the fundamental tradeoffs in federated learning, we present a probabilistic perspective that generalizes and improves upon federated optimization and enables a new class of efficient federated learning algorithms.

Thanks to deep learning, today we can train better machine learning models when given access to massive data. However, the standard, centralized training is impossible in many interesting use-cases—due to the associated data transfer and maintenance costs (most notably in video analytics), privacy concerns (e.g., in healthcare settings), or sensitivity of the proprietary data (e.g., in drug discovery). And yet, different parties that own even a small amount of data want to benefit from access to accurate models. This is where federated learning comes to the rescue!

Broadly, federated learning (FL) allows multiple data owners (or clients) to train shared models collaboratively under the orchestration of a central server without having to share any data. Typically, FL proceeds in multiple rounds of communication between the server and the clients: the clients compute model updates on their local data and send them to the server which aggregates and applies these updates to the shared model. While gaining popularity very quickly, FL is a relatively new subfield with many open questions and unresolved challenges.

Here is one interesting conundrum driving our work:

Client-server communication is often too slow and expensive. To speed up training (often x10-100) we can make clients spend more time at each round on local training (e.g., do more local SGD steps), thereby reducing the total number of communication rounds. However, because of client data heterogeneity (natural in practice), it turns out that increasing the amount of local computation per round results in convergence to inferior models!

This phenomenon is illustrated below in Figure 1 on a toy convex problem, where we see that more local steps lead the classical federated averaging (FedAvg) algorithm to converge to points that are much further away from the global optimum. But why does this happen?

Figure 1: A toy 2D setting with two clients and quadratic objectives that illustrates the convergence issues of FedAvg. Left: convergence trajectories in the parameter space. Right: convergence in terms of distance from the global optimum. Each drawing of the plot corresponds to a run of federated optimization from a different starting point in the parameter space. More local SGD steps per round speed up training, but the progress eventually stagnates at an inferior point further away from the global optimum.

In this post, we will present a probabilistic perspective on federated learning that will help us better understand this phenomenon and design new FL algorithms that can utilize local computation much more efficiently, converging faster, to better optima.

The classical approach: FL as a distributed optimization problem

Federated learning was originally introduced as a new setting for distributed optimization with a few distinctive properties such as a massive number of distributed nodes (or clients), slow and expensive communication, and unbalanced and non-IID data scattered across the nodes. The main goal of FL is to approximate centralized training (the gold-standard) and converge to the same optimum as the centralized optimization would have, at the fastest rate possible.

Mathematically, FL is formulated as minimization of a linear combination of local objectives, (f_i): $$min_{theta in mathbb{R}^d} left{F(theta) := sum_{i=1}^N q_i f_i(theta) right}$$ where the weights (q_i) are usually set proportional to the sizes (n_i) of the local datasets to make (F(theta)) match the centralized training objective. So, how can we solve this optimization problem within a minimal number of communication rounds?

The trick is simple: at each round (t), instead of asking clients to estimate and send gradients of their local objective functions (as done in conventional distributed optimization), let them optimize their objectives for multiple steps (or even epochs!) to obtain (theta^t_{i}) and send differences (or “deltas”) between the initial (theta^t) and updated states (theta^t_{i}) to the server as pseudo-gradients, which the server then averages, scales by a learning rate (alpha_t), and uses to update the model state: $$theta^{t+1} = theta^t + alpha_t sum_{i=1}^N q_i Delta_i^t, quad text{where} Delta_i^t := theta^t – theta_i^t$$ This approach, known as FedAvg or local SGD, allows clients to make more progress at each round. And since taking additional SGD steps locally is orders of magnitude faster than communicating with the server, the method converges much faster both in the number of rounds and in wall-clock time.

The problem (a.k.a. “client drift”): as we mentioned in the beginning, allowing multiple local SGD steps between client-server synchronization makes the algorithm converge to an inferior optimum in the non-IID setting (i.e., when clients have different data distributions) since the resulting pseudo-gradients turn out to be somehow biased compared to centralized training.

There are ways to overcome client drift using local regularization, carefully setting learning rate schedules, or using different control variate methods, but most of these mitigation strategies intentionally have to limit the optimization progress clients can make at each round.

Fundamentally, viewing FL as a distributed optimization problem runs into a tradeoff between the amount of local progress allowed and the quality of the final solution.

So, is there a way around this fundamental limitation?

An alternative approach: FL via posterior inference

Typically, client objectives (f_i(theta)) correspond to log-likelihoods of their local data. Therefore, statistically speaking, FL is solving a maximum likelihood estimation (MLE) problem. Instead of solving it using distributed optimization techniques, however, we can take a Bayesian approach: first, infer the posterior distribution, (P(theta mid D)), then identify its mode which will be the solution.

Why is posterior inference better than optimization? Because any posterior can be exactly decomposed into a product of sub-posteriors: $$P(theta mid D) propto prod_{i=1}^N P(theta mid D_i)$$

Thus, we are guaranteed to find the correct solution in three simple steps:

  1. Infer local sub-posteriors on each client and send their sufficient statistics to the server.
  2. Multiplicatively aggregate sub-posteriors on the server into the global posterior.
  3. Find and return the mode of the global posterior.

Wait, isn’t posterior inference intractable!? 😱

Indeed, there is a reason why posterior inference is not as popular as optimization: it is either intractable or often significantly more complex and computationally expensive. Moreover, posterior distributions rarely have closed form expressions and require various approximations.

For example, consider federated least squares regression, with quadratic local objectives: (f_i(theta) = frac{1}{2} |X_i^toptheta – y_i|^2.) In this case, the global posterior mode has a closed form expression: $$theta^star = left( sum_{i=1}^N q_i Sigma_i^{-1} right)^{-1} left( sum_{i=1}^N q_i Sigma_i^{-1} mu_i right)$$ where (mu_i) and (Sigma_i) are the means and covariances of the local posteriors. Even though in this simple case the posterior is Gaussian and inference is technically tractable, computing (theta^star) requires inverting multiple matrices and communicating local means and covariances from the clients to the server. In comparison to FedAvg, which requires only (O(d)) computation and (O(d)) communication per round, posterior inference seems like a very bad idea…

Approximate inference FTW! 😎

Turns out that we can compute approximately using an elegant distributed inference algorithm which we call federated posterior averaging (or FedPA):

  1. On the server, we can compute iteratively over multiple rounds: $$theta^{t+1} = theta^t – alpha_t sum_{i=1}^N q_i underbrace{Sigma_i^{-1}left( theta^t – mu_i right)}_{:= Delta_i^t}$$ where (alpha_t) is the server learning rate. This procedure avoids the outer matrix inverse and requires clients to send to the server only some delta vectors instead of full covariance matrices. Also, the summation can be substituted with a stochastic approximation, i.e., only a subset of clients must participate in each round. Note how similar it is to FedAvg!
  2. On the clients, we can compute (Delta_i^t := Sigma_i^{-1}left( theta^t – mu_i right)) very efficiently in two steps:
    1. Use stochastic gradient Markov chain Monte Carlo (SG-MCMC) to produce multiple approximate samples from the local posterior.
    2. Use an efficient dynamic programming procedure to compute the inverse covariance matrix multiplied by a vector in (O(d)) time and memory.

Note: in the case of arbitrary non-Gaussian likelihoods (which is the case for deep neural nets), FedPA essentially approximates the local and global posteriors with the best fitting Gaussians (a.k.a. the Laplace approximation).

What is the difference between FedAvg and FedPA? 🤔

FedPA has the same computation and communication complexity as FedAvg. In fact, the algorithms differ only in how the client updates (Delta_i^t) are computed. Since FedAvg computes (Delta_i^t approx theta^t – mu_i), we can also view it as an approximate posterior inference algorithm that estimates local covariances (Sigma_i) with identity matrices, which results in biased updates!

Figure 2: Bias and variance of the deltas computed by FedAvg and FedPA for 10-dimensional federated least squares. More local steps increase the bias of FedAvg; FedPA is able to utilize additional computation to reduce that bias.

Figure 2 illustrates the difference between FedAvg and FedPA in terms of the bias and variance of updates they compute at each round as functions of the number of SGD steps:

  • More local SGD steps increase the bias of FedAvg updates, leading the algorithm to converge to a point further away from the optimum.
  • FedPA uses local SGD steps to produce more posterior samples, which improves the estimates of the local means and covariances and reduces the bias of model updates.

Does FedPA actually work in practice? 🧐

The bias-variance tradeoff argument seems great in theory, but does it actually work in practice? First, let’s revisit our toy 2D example with 2 clients and quadratic objectives:

Figure 3: FedPA vs. FedAvg in our toy 2D setting with two clients and quadratic objectives.

We see that not only is FedPA as fast as FedAvg initially but it also converges to a point that is significantly closer to the global optimum. At the end of convergence, FedPA exhibits some oscillations that could be further eliminated by increasing the number of local posterior samples.

Next, let’s compare FedPA with FedAvg head-to-head on realistic and challenging benchmarks, such as the federated CIFAR100 and StackOverflow datasets:

Figure 4: CIFAR-100: Evaluation loss (left) and accuracy (right) for FedAvg and FedPA. Each algorithm used 20 clients per round and ran local SGD with momentum for 10 epochs (hence “-ME” suffixes, which stand for “multi-epoch”).
Figure 5: StackOverlfow LR: Evaluation loss (left) and macro-F1 (right) for FedAvg and FedPA. Each algorithm used 10 clients per round and ran local SGD with momentum for 5 epochs (hence “-ME” suffixes, which stand for “multi-epoch”).

For clients to be able to sample from local posteriors using SG-MCMC, their models have to be close enough to local optima in the parameter space. Therefore, we first “burn-in” FedPA for a few rounds by running it in the FedAvg regime (i.e., compute the deltas the same way as FedAvg). At some point, we switch to local SG-MCMC sampling. Figures 4 and 5 show the evaluation metrics over the course of training. We clearly see a significant jump in performance right at the point when the algorithm was essentially switched from FedAvg to FedPA.

Concluding thoughts & what’s next?

Viewing federated learning through the lens of probabilistic inference turned out to be fruitful. Not only were we able to reinterpret FedAvg as a biased approximate inference algorithm and explain the strange effect of multiple local SGD steps on its convergence, but this new perspective allowed us to design a new FL algorithm that blends together optimization with local MCMC-based posterior sampling and utilizes local computation efficiently.

We believe that FedPA is just the beginning of a new class of approaches to federated learning. One of the biggest advantages of the distributed optimization over posterior inference so far is a strong theoretical understanding of FedAvg’s convergence and its variations in different IID and non-IID settings, which was developed over the past few years by the optimization community. Convergence analysis of posterior inference in different federated settings is an important research avenue to pursue next.

While FedPA relies on a number of specific design choices we had to make (the Laplace approximation, MCMC-based local inference, the shrinkage covariance estimation, etc.), our inferential perspective connects FL to a rich toolbox of techniques from Bayesian machine learning literature: variational inference, expectation propagation, ensembling and Bayesian deep learning, privacy guarantees for posterior sampling, among others. Exploring application of these techniques in different FL settings may lead us to even more interesting discoveries!

Want to learn more?

ACKNOWLEDGEMENTS: Thanks to Jenny Gillenwater, Misha Khodak, Peter Kairouz, and Afshin Rostamizadeh for feedback on this blog post.

DISCLAIMER: All opinions expressed in this post are those of the author and do not represent the views of CMU.

Read More

Representational Aspects of Depth and Conditioning in Normalizing Flows

Representational Aspects of Depth and Conditioning in Normalizing Flows

Top and Bottom Right: RealNVP [3] uses checkerboard and channel-wise partitioning schemes in order to factor out parameters and ensure that there aren’t redundant partitions from previous layers. GLOW [4] uses an invertible 1×1 convolution which allows the partitioned to be ‘learned’ by a linear layer. We show that arbitrary partitions can be simulated in a constant number of layers with a fixed partition, showing that these ideas increase representational power by at most a constant factor. Bottom Left: Random points are well-separated with high probability on a high-dimensional sphere, which allows us to construct a distribution that is challenging for flows.

The promise of unsupervised learning lies in its potential to take advantage of cheap and plentiful unlabeled data to learn useful representations or generate high-quality samples. For the latter task, neural network-based generative models have recently enjoyed a lot of success in producing realistic images and text. Two major paradigms in deep generative modeling are generative adversarial networks (GANs) and normalizing flows. When successfully scaled up and trained, both can generate high-quality and diverse samples from high-dimensional distributions. The training procedure for GANs involves min-max (saddle-point) optimization, which is considerably more difficult than standard loss minimization, leading to problems like mode dropping.

Samples from a GLOW [4] model trained on the CelebA Faces Dataset.

Normalizing flows [1] have been proposed as an alternative type of generative model which allows not only efficient sampling but also training via maximum likelihood through a closed-form computation of the likelihood function. They are written as pushforwards of a simple distribution (typically a Gaussian) through an invertible transformation (f), typically parametrized as a composition of simple invertible transformations. The main reason for this parametrization is the change-of-variables formula: if (z) is a random variable sampled from a known base distribution (P(z)) (typically a standard multivariate normal), (f: mathbb{R}^dto mathbb{R}^d) is invertible and differentiable, and (x = f^{-1}(z)) then $$p(x) = p(f(x))left|detleft(frac{partial f(x)}{partial x^T}right)right|.$$ Here, (frac{partial f(x)}{partial x^T}) is the Jacobian of (f).

Normalizing flows are trained by maximizing the likelihood using gradient descent. However, in practice, training normalizing flows runs into difficulties as well: models which produce good samples typically need to be extremely deep — which comes with accompanying vanishing/exploding gradient problems. A very related problem is that they are often poorly conditioned. Data like images are often inherently lower-dimensional than the ambient space, the map from low dimensional data to high dimensional latent variables can be difficult to invert and therefore train.

In our recent work [2], we tackle representational questions around depth and conditioning of normalizing flows—first for general invertible architectures, then for a particular common architecture—where the normalizing flow is a composition of so-called affine couplings.

Depth Bound on General Invertible Architectures

The most fundamental restriction of the normalizing flow paradigm is that each layer needs to be invertible. We ask whether this restriction has any ‘cost’ in terms of the size, and in particular the depth, of the model. Here we’re counting depth in terms of the number of the invertible transformations that make up the flow. A requirement for large depth would explain training difficulties due to exploding (or vanishing) gradients.

Since the Jacobian of a composition of functions is the product of the Jacobians of the functions being composed, the min (max) singular value of the Jacobian of the composition is the product of the min (max) singular value of the Jacobians of the functions. This implies that the smallest (largest) singular value of the Jacobian will get exponentially smaller (larger) with the number of compositions.

A natural way of formalizing this question is by exhibiting a distribution which is easy to model for an unconstrained generator network but hard for a shallow normalizing flow. Precisely, we ask: is there a probability distribution that can be represented by a shallow generator with a small number of parameters that could not be approximately represented by a shallow composition of invertible transformations?

We demonstrate that such a distribution exists. Specifically, we show that

Theorem: For every k, s.t. (k = o(exp(d))) and any parameterized family of compositions of Lipschitz invertible transformations with (p) parameters per transformation and at most (O(k / p)) transformations, there exists a generator (g:mathbb{R}^{d+1} to mathbb{R}) with depth (O(1)) and (O(k)) parameters s.t the pushforward of a Gaussian through (g) cannot be approximated in either KL or Wasserstein-1 distance by a network in this family.

The result above is extremely general: it only requires a bound on the number of parameters per transformation in the parametrization of the normalizing flow and Lipschitzness of these maps. As such it easily includes common choices used in practice like affine couplings with at most (p) parameters per layer or invertible feedforward networks, where each intermediate layer is of dimension (d) and the nonlinearity is invertible (e.g. leaky ReLU). On the flip side, for possible architectures with a large number of parameters per transformation, this theorem gives a (possibly loose) lower bound of a small number of transformations.

Proof Sketch: The generator for our construction approximates a mixture of (k) Gaussians with means placed uniformly randomly on a (d)-dimensional sphere in the ambient space. We will use the probabilistic method to show there is a family of such mixtures, s.t. each pair of members in this family are far apart (say, in Wasserstein distance). Furthermore, by an epsilon net discretization argument we can count how many “essentially” distinct invertible Lipschitz networks there are. If the number of mixtures in the family is much larger than the size of the epsilon net, at least one mixture must be far from all invertible networks.

We choose well-separated modes on a hypersphere in order to generate a large family of well separated mixtures of Gaussians.

The family of mixtures is constructed by choosing the (k) means for the components uniformly at random on a sphere. It’s well known that (exp(o(d))) randomly chosen points on a unit sphere will, with high probability, have constant pairwise distance. Similarly, coding-theoretic arguments (used to prove the so-called Gilbert-Varshamov bound) can be used to show that selecting (exp(o(d))) (k)-tuples of those means will, with high probability, ensure that each pair of (k)-tuples is such that the average pair of means is at constant distance. This suffices to ensure the Wasserstein distance between pairs of mixtures is large. ∎

Results for Affine Couplings

Affine Couplings [3] are one of the most common transformations in scalable architectures for normalizing flows. An affine coupling is a map (f: mathbb{R}^dto mathbb{R}^d) such that for some partition into a set containing approximately half of the coordinates (S) and it’s complement, $$f(x_S, x_{[d]setminus S}) = (x_S, x_{[d]setminus S} odot s(x_s) + t(x_s))$$ for some scaling and translation functions (typically parameterized by neural networks) (s) and (t). Clearly, an affine coupling block only transforms one partition of the coordinates at a time by an affine function while leaving the other partition intact. It’s easy to see that an affine coupling is invertible if each coordinate of (s) is invertible. Moreover, the Jacobian of this function is $$begin{bmatrix}I & 0\frac{partial t}{partial x_S^T} & text{diag}(s(x_S))end{bmatrix}$$ In particular it’s lower triangular, so we can calculate the determinant in linear time by multiplying the (d) diagonal elements (in general determinants take (O(d^3)) time to compute). This allows us to efficiently compute likelihoods and their gradients for SGD on large models via the change of variables formula.
These affine coupling blocks are stacked, often while changing the part of the partition that is updated or more generally, permuting the elements in between the application of the coupling.

Affine couplings consist of nonlinear affine transformations of half of the data dimensions at a time which end in a normal distribution. We show that the choice of which dimensions are in which half can be simulated in a constant number of couplings. Source

Effect of the Choice of Partition on Representational Power

Partitions in Real NVP. Source

The choice of partition is often somewhat ad-hoc and involves domain knowledge (e.g. for image datasets, a typical choice is a checkerboard pattern or a channel-wise partition). In fact, some recent approaches like GLOW [4] try to “learn” permutations to apply between each pair of affine couplings. (Technically, since a permutation is a discrete object, in [4] the authors learn a 1×1 convolutions instead.)

While ablation experiments provide definite evidence that including learned 1×1 convolutions is beneficial for modeling image data in practice, it’s unclear whether this effect is from increased modeling power or algorithmic effects — and even less so how to formally quantify it. In this section, we come to a clear understanding of the representational value of adding this flexibility in partitioning. We knew from GLOW that adding these partitions helped. Now we know why!

We formalize the representational question as follows: how many affine couplings with a fixed partition are needed to simulate an arbitrary linear map? Since a linear map is more general than a 1×1 convolution, if it’s possible to do so with a small (say constant) number of affine couplings, we can simulate any affine coupling-based normalizing flow including 1×1 convolutions by one that does not include them which is merely a constant factor larger.

Concretely, we consider linear functions of the form $$T = prod_{i=1}^Kbegin{bmatrix}I & 0\A_i & B_iend{bmatrix} begin{bmatrix}C_i & D_i\0 & Iend{bmatrix},$$ for matrices (A_i, D_i in mathbb{R}^{dtimes d}) and diagonal matrices (B_i, C_i in mathbb{R}^{dtimes d}). The right hand side is precisely a composition of affine coupling blocks with linear maps (s, t) with a fixed partition (the parts of the input that are being updated alternate). We show the following result:

Theorem: To represent an arbitrary invertible (T), (K) must be at most 24. Additionally, there exist invertible matrices (T) such that (K geq 3).

Proof sketch: The statement hopefully reminds the reader of the standard LU decomposition — the twist of course being that the matrices on the right-hand side have a more constrained structure than merely being triangular. Our proof starts with the existence of a (LUP) decomposition for every matrix.

We first show that we can construct an arbitrary permutation (up to sign) using at most 21 alternating matrices of the desired form. The argument is group theoretic: we use the fact that a permutation decomposes into a composition of two permutations of order 2, which must be disjoint products of swaps and show that swapping elements can be implemented “in parallel” using several partitioned matrices of the type we’re considering.

Next, we show that we can produce an arbitrary triangular matrix with our partitioned matrices. We use similar techniques as above to reduce the matrix to a regular system of block linear equations which we can then solve. Our upper bound comes from just counting the total number of matrices required for these operations: the 21 for the permutation and 13 for each triangular matrix (upper and lower), giving a total of 47 required matrices. ∎

To reiterate the takeaway: a GLOW-style linear layer in between affine couplings could in theory make your network between 5 and 47 times smaller while representing the same function. We now have a precise understanding of the value of that architectural choice!

We also verified empirically in the figure below how well these linear models would fit randomly chosen (i.e. with iid Gaussian entries) linear functions. It seems empirically that at least for this ensemble our upper bound is loose and we can fit the functions well without using the full 47 layers. Closing this gap is an interesting problem for future work.

We regress affine couplings with linear scaling and translation functions at a variety of depths on linear functions determined by random matrices. It seems like we can fit these functions arbitrarily well with 4-16 layers, suggesting that at least in random cases the true number of layers required is closer to our lower bound.

Universal Approximation with Poorly Conditioned Networks

In our earlier result on the depth of invertible networks, we assumed that our network was Lipschitz and therefore well-conditioned. A natural question is then, if we remove this requirement, how powerful is the resulting class of models? In particular, we ask: are poorly conditioned affine coupling-based normalizing flows universal approximators as they are used in practice?

Curiously, this question has in fact not been answered in prior work. In a very recent work [5], it was shown that if we allow for padding of our data with extra dimensions that take a constant value 0, affine couplings are universal approximators. (Note, this kind of padding clearly results in a singular Jacobian — as the value in the added dimensions is constant.) The idea for why padding helps is that these extra dimensions are used as a “scratch pad” for the computation the network is performing. Another recent work [6] gives a proof of universal approximation for affine couplings assuming arbitrary permutations in between the layers are allowed (ala Glow) and a partition separating (d -1) dimensions from the other. However, in practice, these models are trained using a roughly half-half split and often without linear layers in between couplings (which already works quite well). We prove that none of these architectural modifications to affine couplings are necessary for universal approximation and additionally suggest a trade-off between the conditioning of the model and the quality of its approximation. Concretely, we show:

Theorem: For any bounded and absolutely continuous distribution (Q) over (mathbb{R}^n) and any (epsilon > 0), there exists a 3-layer affine coupling (g) with maps (s, t) represented by feedforward ReLU networks such that (W_2(g_# P, Q) leq epsilon), where (g_# P) is the pushforward of a standard Gaussian through (g).

We note that the construction for the theorem trades off quality of approximation ((epsilon)) with conditioning: the smallest singular value of the Jacobian in the construction for our theorem above will scale like (1/epsilon) — thus suggesting that if we want to use affine couplings as universal approximators, conditioning may be an issue even if we don’t pad with a constant value for the added dimensions like prior works — which obviously results in a singular Jacobian.

Proof sketch: The proof is based on two main ideas.

The first is a deep result from optimal transport, Brenier’s theorem, which for sufficiently “regular” distributions (p) over (mathbb{R}^d) guarantees an invertible map (phi), s.t. the pushforward of the Gaussian through (phi) equals (p). This reduces our problem to approximating (phi) using a sequence of affine couplings.

The difficulty in approximating (phi) is the fact that affine couplings are only allowed to change one part of the input, and in a constrained way. The trick we use to do this without a “scratchpad” to store intermediate computation as in prior works is to instead hide information in the “low order bits” of the other partition. For details, refer to our paper. ∎

Finally, on the experimental front, we wanted to experiment with how padding affects the conditioning of a learned model. We considered synthetic 2d datasets (see figure below) and found that padding with zeros resulted in a very poorly conditioned model which produced poor samples, as might be expected. We also considered a type of padding which is reasonable but for which we have no theory — namely, to use iid Gaussian samples as values for the added dimensions (in this case, the resulting Jacobians are not prima facie singular, and the model can still use them as a “scratch pad”). While we have no result that this in any formal sense can result in better-conditioned networks, we found that in practice it frequently does and it also results in better samples. This seems like a very fruitful direction for future research. Finally, without padding, the model produces samples of middling quality and has a condition number in between that of zero and Gaussian padding.

In both examples, Gaussian padding of the data gives a sharper distribution and a better-conditioned model.


Normalizing flows are one of the most popular generative models across various domains, though we still have a relatively narrow understanding of their relative pros and cons compared to other models. We show in this work that there are fundamental tradeoffs between depth and conditioning and representational power of this type of function. Though we have cleared up considerably the representational aspects of these models, the algorithmic and statistical questions are still wide open. We hope that this work guides both users of flows and theoreticians as to the fine-grained properties of flows as compared to other generative models.


[1] Rezende and Mohamed, 2015, Variational Inference with Normalizing Flows, ICML 2015

[2] Koehler, Mehta, and Risteski, 2020, Representational aspects of depth and conditioning in normalizing flows, Under Submission.

[3] Dinh, Sohl-Dickstein, and S. Bengio, 2016, Density estimation using Real NVP, ICLR 2016

[4] Kingma and Dhariwal, 2018, GLOW: Generative flow with 1×1 convolutions, NeurIPS 2018

[5] Huang, Dinh, and Courville, 2020, Augmented Normalizing Flows: Bridging the Gap Between Generative Flows and Latent Variable Models

[6] Teshima, Ishikawa, Tojo, Oono, Ikeda, and Sugiyama, 2020, Coupling-based Invertible Neural Networks Are Universal Diffeomorphism Approximators, NeurIPS 2020

Read More

Carnegie Mellon University at NeurIPS 2020

Carnegie Mellon University at NeurIPS 2020

Carnegie Mellon University is proud to present 88 papers at the 34th Conference on Neural Information Processing Systems (NeurIPS 2020), which will be held virtually this week. Our faculty and researchers are also giving invited talks at 7 workshops and are involved in organizing 14 workshops at the conference.

Here is a quick overview of the areas our researchers are working on:

We are also proud to collaborate with many other researchers in academia and industry:


Reinforcement Learning

Breaking the Sample Size Barrier in Model-Based Reinforcement Learning with a Generative Model
Gen Li (Tsinghua University) · Yuting Wei (Carnegie Mellon University) · Yuejie Chi (CMU) · Yuantao Gu (Tsinghua University) · Yuxin Chen (Princeton University)
Mon Dec 07 09:00 PM — 11:00 PM (PST) @ Poster Session 0 #82

Reinforcement Learning with General Value Function Approximation: Provably Efficient Approach via Bounded Eluder Dimension
Ruosong Wang (Carnegie Mellon University) · Russ Salakhutdinov (Carnegie Mellon University) · Lin Yang (UCLA)
Mon Dec 07 09:00 PM — 11:00 PM (PST) @ Poster Session 0 #167

Provably Efficient Exploration for Reinforcement Learning Using Unsupervised Learning
Fei Feng (University of California, Los Angeles) · Ruosong Wang (Carnegie Mellon University) · Wotao Yin (Alibaba US, DAMO Academy) · Simon Du (Institute for Advanced Study) · Lin Yang (UCLA)
Mon Dec 07 09:00 PM — 11:00 PM (PST) @ Poster Session 0 #169

Object Goal Navigation using Goal-Oriented Semantic Exploration [code] [video]Devendra Singh Chaplot (Carnegie Mellon University) · Dhiraj Prakashchand Gandhi (Carnegie Mellon University) · Abhinav Gupta (Facebook AI Research/CMU) · Russ Salakhutdinov (Carnegie Mellon University)
Tue Dec 08 09:00 AM — 11:00 AM (PST) @ Poster Session 1 #358

Sparse Graphical Memory for Robust Planning [code]Scott Emmons (UC Berkeley) · Ajay Jain (UC Berkeley) · Misha Laskin (UC Berkeley) · Thanard Kurutach (University of California Berkeley) · Pieter Abbeel (UC Berkeley & · Deepak Pathak (Carnegie Mellon University)
Tue Dec 08 09:00 AM — 11:00 AM (PST) @ Poster Session 1 #419

Task-Agnostic Online Reinforcement Learning with an Infinite Mixture of Gaussian Processes
Mengdi Xu (Carnegie Mellon University) · Wenhao Ding (Carnegie Mellon University) · Jiacheng Zhu (Carnegie Mellon University) · ZUXIN LIU (Carnegie Mellon University) · Baiming Chen (Tsinghua University) · Ding Zhao (Carnegie Mellon University)
Tue Dec 08 09:00 AM — 11:00 AM (PST) @ Poster Session 1 #420

On Reward-Free Reinforcement Learning with Linear Function Approximation
Ruosong Wang (Carnegie Mellon University) · Simon Du (Institute for Advanced Study) · Lin Yang (UCLA) · Russ Salakhutdinov (Carnegie Mellon University)
Tue Dec 08 09:00 AM — 11:00 AM (PST) @ Poster Session 1 #499

Rewriting History with Inverse RL: Hindsight Inference for Policy Improvement [code]Ben Eysenbach (Carnegie Mellon University) · Xinyang Geng (UC Berkeley) · Sergey Levine (UC Berkeley) · Russ Salakhutdinov (Carnegie Mellon University)
Tue Dec 08 09:00 PM — 11:00 PM (PST) @ Poster Session 2 #594

Planning with General Objective Functions: Going Beyond Total Rewards
Ruosong Wang (Carnegie Mellon University) · Peilin Zhong (Columbia University) · Simon Du (Institute for Advanced Study) · Russ Salakhutdinov (Carnegie Mellon University) · Lin Yang (UCLA)
Tue Dec 08 09:00 PM — 11:00 PM (PST) @ Poster Session 2 #600

Preference-based Reinforcement Learning with Finite-Time Guarantees
Yichong Xu (Carnegie Mellon University) · Ruosong Wang (Carnegie Mellon University) · Lin Yang (UCLA) · Aarti Singh (CMU) · Artur Dubrawski (Carnegie Mellon University)
Tue Dec 08 09:00 PM — 11:00 PM (PST) @ Poster Session 2 #601

Is Long Horizon RL More Difficult Than Short Horizon RL?
Ruosong Wang (Carnegie Mellon University) · Simon Du (Institute for Advanced Study) · Lin Yang (UCLA) · Sham Kakade (University of Washington & Microsoft Research)
Tue Dec 08 09:00 PM — 11:00 PM (PST) @ Poster Session 2 #602

Neural Dynamic Policies for End-to-End Sensorimotor Learning
Shikhar Bahl (Carnegie Mellon University) · Mustafa Mukadam (Facebook AI Research) · Abhinav Gupta (Facebook AI Research/CMU) · Deepak Pathak (Carnegie Mellon University)
Thu Dec 10 09:00 AM — 11:00 AM (PST) @ Poster Session 5 #1371

Weakly-Supervised Reinforcement Learning for Controllable Behavior
Lisa Lee (CMU / Google Brain / Stanford) · Ben Eysenbach (Carnegie Mellon University) · Russ Salakhutdinov (Carnegie Mellon University) · Shixiang (Shane) Gu (Google Brain) · Chelsea Finn (Stanford)
Thu Dec 10 09:00 PM — 11:00 PM (PST) @ Poster Session 6 #1832

Estimation & Inference

Robust Density Estimation under Besov IPM Losses
Ananya Uppal (Carnegie Mellon University) · Shashank Singh (Google) · Barnabas Poczos (Carnegie Mellon University)
Tue Dec 08 09:00 AM — 11:00 AM (PST) @ Poster Session 1 #429

Rewriting History with Inverse RL: Hindsight Inference for Policy Improvement [code]Ben Eysenbach (Carnegie Mellon University) · Xinyang Geng (UC Berkeley) · Sergey Levine (UC Berkeley) · Russ Salakhutdinov (Carnegie Mellon University)
Tue Dec 08 09:00 PM — 11:00 PM (PST) @ Poster Session 2 #594

Domain Adaptation as a Problem of Inference on Graphical Models
Kun Zhang (CMU) · Mingming Gong (University of Melbourne) · Petar Stojanov (Carnegie Mellon Univerisity) · Biwei Huang (Carnegie Mellon University) · Qingsong Liu (Unisound Intelligence Co., Ltd.) · Clark Glymour (Carnegie Mellon University)
Tue Dec 08 09:00 PM — 11:00 PM (PST) @ Poster Session 2 #698

Efficient semidefinite-programming-based inference for binary and multi-class MRFs [code]Chirag Pabbaraju (Carnegie Mellon University) · Po-Wei Wang (CMU) · J. Zico Kolter (Carnegie Mellon University / Bosch Center for AI)
Wed Dec 09 09:00 AM — 11:00 AM (PST) @ Poster Session 3 #851

Randomized tests for high-dimensional regression: A more efficient and powerful solution
Yue Li (Carnegie Mellon University) · Ilmun Kim (CMU) · Yuting Wei (Carnegie Mellon University)
Wed Dec 09 09:00 AM — 11:00 AM (PST) @ Poster Session 3 #999

Distribution-free binary classification: prediction sets, confidence intervals and calibration
Chirag Gupta (Carnegie Mellon University) · Aleksandr Podkopaev (Carnegie Mellon University) · Aaditya Ramdas (CMU)
Thu Dec 10 09:00 AM — 11:00 AM (PST) @ Poster Session 5 #1537

Deep Learning

Neural Methods for Point-wise Dependency Estimation [code]Yao-Hung Hubert Tsai (Carnegie Mellon University) · Han Zhao (Carnegie Mellon University) · Makoto Yamada (Kyoto University/RIKEN AIP) · Louis-Philippe Morency (Carnegie Mellon University) · Russ Salakhutdinov (Carnegie Mellon University)
Mon Dec 07 09:00 PM — 11:00 PM (PST) @ Poster Session 0 #15

Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing [code]Zihang Dai (Carnegie Mellon University) · Guokun Lai (Carnegie Mellon University) · Yiming Yang (CMU) · Quoc V Le (Google)
Mon Dec 07 09:00 PM — 11:00 PM (PST) @ Poster Session 0 #64

Big Bird: Transformers for Longer Sequences [unofficial video]Manzil Zaheer (Google) · Guru Guruganesh (Google Research) · Kumar Avinava Dubey (Carnegie Mellon University) · Joshua Ainslie (Google) · Chris Alberti (Google) · Santiago Ontanon (Google LLC) · Philip Pham (Google) · Anirudh Ravula (Google) · Qifan Wang (Google Research) · Li Yang (Google) · Amr Ahmed (Google Research)
Mon Dec 07 09:00 PM — 11:00 PM (PST) @ Poster Session 0 #65

Deep Transformers with Latent Depth [code]Xian Li (Facebook) · Asa Cooper Stickland (University of Edinburgh) · Yuqing Tang (Facebook AI) · Xiang Kong (Carnegie Mellon University)
Tue Dec 08 09:00 AM — 11:00 AM (PST) @ Poster Session 1 #287

Multiscale Deep Equilibrium Models [code]Shaojie Bai (Carnegie Mellon University) · Vladlen Koltun (Intel Labs) · J. Zico Kolter (Carnegie Mellon University / Bosch Center for AI)
Tue Dec 08 09:00 AM — 11:00 AM (PST) @ Poster Session 1 #320

Monotone operator equilibrium networks [code]Ezra Winston (Carnegie Mellon University) · J. Zico Kolter (Carnegie Mellon University / Bosch Center for AI)
Tue Dec 08 09:00 AM — 11:00 AM (PST) @ Poster Session 1 #323

Beyond Homophily in Graph Neural Networks: Current Limitations and Effective Designs [code]Jiong Zhu (University of Michigan) · Yujun Yan (University of Michigan) · Lingxiao Zhao (Carnegie Mellon University) · Mark Heimann (University of Michigan) · Leman Akoglu (CMU) · Danai Koutra (U Michigan)
Tue Dec 08 09:00 AM — 11:00 AM (PST) @ Poster Session 1 #374

On Completeness-aware Concept-Based Explanations in Deep Neural Networks
Chih-Kuan Yeh (Carnegie Mellon University) · Been Kim (Google) · Sercan Arik (Google) · Chun-Liang Li (Google) · Tomas Pfister (Google) · Pradeep Ravikumar (Carnegie Mellon University)
Tue Dec 08 09:00 PM — 11:00 PM (PST) @ Poster Session 2 #640

A Causal View on Robustness of Neural Networks
Cheng Zhang (Microsoft Research, Cambridge, UK) · Kun Zhang (CMU) · Yingzhen Li (Microsoft Research Cambridge)
Wed Dec 09 09:00 AM — 11:00 AM (PST) @ Poster Session 3 #805

Improving GAN Training with Probability Ratio Clipping and Sample Reweighting
Yue Wu (Carnegie Mellon University) · Pan Zhou (National University of Singapore) · Andrew Wilson (New York University) · Eric Xing (Petuum Inc. / Carnegie Mellon University) · Zhiting Hu (Carnegie Mellon University)
Wed Dec 09 09:00 AM — 11:00 AM (PST) @ Poster Session 3 #945

AutoSync: Learning to Synchronize for Data-Parallel Distributed Deep Learning
Hao Zhang (Carnegie Mellon University, Petuum Inc.) · Yuan Li (Duke University) · Zhijie Deng (Tsinghua University) · Xiaodan Liang (Sun Yat-sen University) · Lawrence Carin (Duke University) · Eric Xing (Petuum Inc. / Carnegie Mellon University)
Wed Dec 09 09:00 AM — 11:00 AM (PST) @ Poster Session 3 #1037

Deep Archimedean Copulas
Chun Kai Ling (Carnegie Mellon University) · Fei Fang (Carnegie Mellon University) · J. Zico Kolter (Carnegie Mellon University / Bosch Center for AI)
Thu Dec 10 09:00 PM — 11:00 PM (PST) @ Poster Session 6 #1754

A Study on Encodings for Neural Architecture Search [code]Colin White (Abacus.AI) · Willie Neiswanger (Carnegie Mellon University) · Sam Nolen (RealityEngines.AI) · Yash Savani (RealityEngines.AI)
Thu Dec 10 09:00 PM — 11:00 PM (PST) @ Poster Session 6 #1777

Is normalization indispensable for training deep neural network?
Jie Shao (Fudan University) · Kai Hu (Carnegie Mellon University) · Changhu Wang (ByteDance.Inc) · Xiangyang Xue (Fudan University) · Bhiksha Raj (Carnegie Mellon University)
Thu Dec 10 09:00 PM — 11:00 PM (PST) @ Poster Session 6 #1887

Algorithms & Optimization

Latent Dynamic Factor Analysis of High-Dimensional Neural Recordings [code]Heejong Bong (Carnegie Mellon University) · Zongge Liu (Carnegie Mellon University) · Zhao Ren (University of Pittsburgh) · Matthew Smith (Carnegie Mellon University) · Valerie Ventura (Carnegie Mellon University) · Kass E Robert (CMU)
Mon Dec 07 09:00 PM — 11:00 PM (PST) @ Poster Session 0 #32

Neutralizing Self-Selection Bias in Sampling for Sortition
Bailey Flanigan (Carnegie Mellon University) · Paul Goelz (Carnegie Mellon University) · Anupam Gupta (Carnegie Mellon University) · Ariel Procaccia (Harvard University)
Tue Dec 08 09:00 PM — 11:00 PM (PST) @ Poster Session 2 #710

Efficient semidefinite-programming-based inference for binary and multi-class MRFs [code]Chirag Pabbaraju (Carnegie Mellon University) · Po-Wei Wang (CMU) · J. Zico Kolter (Carnegie Mellon University / Bosch Center for AI)
Wed Dec 09 09:00 AM — 11:00 AM (PST) @ Poster Session 3 #851

Linear Dynamical Systems as a Core Computational Primitive
Shiva Kaul (Carnegie Mellon University)
Wed Dec 09 09:00 AM — 11:00 AM (PST) @ Poster Session 3 #1083

Distributed Training with Heterogeneous Data: Bridging Median- and Mean-Based Algorithms
Xiangyi Chen (University of Minnesota) · Tiancong Chen (University of Minnesota) · Haoran Sun (University of Minnesota) · Steven Wu (Carnegie Mellon University) · Mingyi Hong (University of Minnesota)
Wed Dec 09 09:00 AM — 11:00 AM (PST) @ Poster Session 3 #1144

WOR and p’s: Sketches for lp-Sampling Without Replacement
Edith Cohen (Google) · Rasmus Pagh (University of Copenhagen) · David Woodruff (Carnegie Mellon University)
Thu Dec 10 09:00 AM — 11:00 AM (PST) @ Poster Session 5 #1435

Confidence sequences for sampling without replacement
Ian Waudby-Smith (Carnegie Mellon University) · Aaditya Ramdas (CMU)
Thu Dec 10 09:00 AM — 11:00 AM (PST) @ Poster Session 5 #1445

PLLay: Efficient Topological Layer based on Persistent Landscapes
Kwangho Kim (Carnegie Mellon University) · Jisu Kim (Inria Saclay) · Manzil Zaheer (Google) · Joon Kim (Carnegie Mellon University) · Frederic Chazal (INRIA) · Larry Wasserman (Carnegie Mellon University)
Thu Dec 10 09:00 AM — 11:00 AM (PST) @ Poster Session 5 #1582

Tackling the Objective Inconsistency Problem in Heterogeneous Federated Optimization
Jianyu Wang (Carnegie Mellon University) · Qinghua Liu (Princeton University) · Hao Liang (Carnegie Mellon University) · Gauri Joshi (Carnegie Mellon University) · H. Vincent Poor (Princeton University)
Thu Dec 10 09:00 AM — 11:00 AM (PST) @ Poster Session 5 #1636

Transferable Graph Optimizers for ML Compilers
Yanqi Zhou (Google Brain) · Sudip Roy (Google) · Amirali Abdolrashidi (UC Riverside) · Daniel Wong (Carnegie Mellon University) · Peter Ma (Google) · Qiumin Xu (Google) · Hanxiao Liu (Google Brain) · Phitchaya Phothilimtha (Google Brain) · Shen Wang (Google Inc) · Anna Goldie (Google Brain / Stanford) · Azalia Mirhoseini (Google Brain) · James Laudon (Google)
Thu Dec 10 09:00 PM — 11:00 PM (PST) @ Poster Session 6 #1781

Community detection using fast low-cardinality semidefinite programming
Po-Wei Wang (CMU) · J. Zico Kolter (Carnegie Mellon University / Bosch Center for AI)
Thu Dec 10 09:00 PM — 11:00 PM (PST) @ Poster Session 6 #1803

Learning Theory

Sample Complexity of Asynchronous Q-Learning: Sharper Analysis and Variance Reduction
Gen Li (Tsinghua University) · Yuting Wei (Carnegie Mellon University) · Yuejie Chi (CMU) · Yuantao Gu (Tsinghua University) · Yuxin Chen (Princeton University)
Mon Dec 07 09:00 PM — 11:00 PM (PST) @ Poster Session 0 #160

Reinforcement Learning with General Value Function Approximation: Provably Efficient Approach via Bounded Eluder Dimension
Ruosong Wang (Carnegie Mellon University) · Russ Salakhutdinov (Carnegie Mellon University) · Lin Yang (UCLA)
Mon Dec 07 09:00 PM — 11:00 PM (PST) @ Poster Session 0 #167

Provably Efficient Exploration for Reinforcement Learning Using Unsupervised Learning
Fei Feng (University of California, Los Angeles) · Ruosong Wang (Carnegie Mellon University) · Wotao Yin (Alibaba US, DAMO Academy) · Simon Du (Institute for Advanced Study) · Lin Yang (UCLA)
Mon Dec 07 09:00 PM — 11:00 PM (PST) @ Poster Session 0 #169

Agnostic Q-learning with Function Approximation in Deterministic Systems: Near-Optimal Bounds on Approximation Error and Sample Complexity
Simon Du (Institute for Advanced Study) · Jason Lee (Princeton University) · Gaurav Mahajan (University of California, San Diego) · Ruosong Wang (Carnegie Mellon University)
Tue Dec 08 09:00 AM — 11:00 AM (PST) @ Poster Session 1 #226

Generalized Boosting
Arun Suggala (Carnegie Mellon University) · Bingbin Liu (Carnegie Mellon University) · Pradeep Ravikumar (Carnegie Mellon University)
Tue Dec 08 09:00 AM — 11:00 AM (PST) @ Poster Session 1 #364

PAC-Bayes Learning Bounds for Sample-Dependent Priors
Pranjal Awasthi (Google/Rutgers University) · Satyen Kale (Google) · Stefani Karp (Google/CMU) · Mehryar Mohri (Google Research & Courant Institute of Mathematical Sciences)
Tue Dec 08 09:00 AM — 11:00 AM (PST) @ Poster Session 1 #436

Revisiting the Sample Complexity of Sparse Spectrum Approximation of Gaussian Processes
Minh Hoang (Carnegie Mellon University) · Nghia Hoang (Amazon) · Hai Pham (Carnegie Mellon University) · David Woodruff (Carnegie Mellon University)
Tue Dec 08 09:00 PM — 11:00 PM (PST) @ Poster Session 2 #666

Follow the Perturbed Leader: Optimism and Fast Parallel Algorithms for Smooth Minimax Games
Arun Suggala (Carnegie Mellon University) · Praneeth Netrapalli (Microsoft Research)
Wed Dec 09 09:00 AM — 11:00 AM (PST) @ Poster Session 3 #1021

On Learning Ising Models under Huber’s Contamination Model
Adarsh Prasad (Carnegie Mellon University) · Vishwak Srinivasan (Carnegie Mellon University) · Sivaraman Balakrishnan (Carnegie Mellon University) · Pradeep Ravikumar (Carnegie Mellon University)
Wed Dec 09 09:00 PM — 11:00 PM (PST) @ Poster Session 4 #1186

Axioms for Learning from Pairwise Comparisons
Ritesh Noothigattu (Carnegie Mellon University) · Dominik Peters (Carnegie Mellon University) · Ariel Procaccia (Harvard University)
Thu Dec 10 09:00 AM — 11:00 AM (PST) @ Poster Session 5 #1447

A Unified View of Label Shift Estimation
Saurabh Garg (CMU) · Yifan Wu (Carnegie Mellon University) · Sivaraman Balakrishnan (CMU) · Zachary Lipton (Carnegie Mellon University)
Thu Dec 10 09:00 AM — 11:00 AM (PST) @ Poster Session 5 #1535

Weak Supervision

Unsupervised Data Augmentation for Consistency Training [code]Qizhe Xie (CMU, Google Brain) · Zihang Dai (Carnegie Mellon University) · Eduard Hovy (CMU) · Thang Luong (Google Brain) · Quoc V Le (Google)
Mon Dec 07 09:00 PM — 11:00 PM (PST) @ Poster Session 0 #21

Provably Efficient Exploration for Reinforcement Learning Using Unsupervised Learning
Fei Feng (University of California, Los Angeles) · Ruosong Wang (Carnegie Mellon University) · Wotao Yin (Alibaba US, DAMO Academy) · Simon Du (Institute for Advanced Study) · Lin Yang (UCLA)
Mon Dec 07 09:00 PM — 11:00 PM (PST) @ Poster Session 0 #169

Model-based Policy Optimization with Unsupervised Model Adaptation
Jian Shen (Shanghai Jiao Tong University) · Han Zhao (Carnegie Mellon University) · Weinan Zhang (Shanghai Jiao Tong University) · Yong Yu (Shanghai Jiao Tong Unviersity)
Tue Dec 08 09:00 AM — 11:00 AM (PST) @ Poster Session 1 #547

Modeling Task Effects on Meaning Representation in the Brain via Zero-Shot MEG Prediction
Mariya Toneva (Carnegie Mellon University) · Otilia Stretcu (Carnegie Mellon University) · Barnabas Poczos (Carnegie Mellon University) · Leila Wehbe (Carnegie Mellon University) · Tom Mitchell (Carnegie Mellon University)
Thu Dec 10 09:00 AM — 11:00 AM (PST) @ Poster Session 5 #1687

Demystifying Contrastive Self-Supervised Learning: Invariances, Augmentations and Dataset Biases
Senthil Purushwalkam Shiva Prakash (Carnegie Mellon University) · Abhinav Gupta (Facebook AI Research/CMU)
Thu Dec 10 09:00 AM — 11:00 AM (PST) @ Poster Session 5 #1696

Comprehensive Attention Self-Distillation for Weakly-Supervised Object Detection
Zeyi Huang (carnegie mellon university) · Yang Zou (Carnegie Mellon University) · B. V. K. Vijaya Kumar (CMU, USA) · Dong Huang (Carnegie Mellon University)
Thu Dec 10 09:00 AM — 11:00 AM (PST) @ Poster Session 5 #1704

Weakly-Supervised Reinforcement Learning for Controllable Behavior
Lisa Lee (CMU / Google Brain / Stanford) · Ben Eysenbach (Carnegie Mellon University) · Russ Salakhutdinov (Carnegie Mellon University) · Shixiang (Shane) Gu (Google Brain) · Chelsea Finn (Stanford)
Thu Dec 10 09:00 PM — 11:00 PM (PST) @ Poster Session 6 #1832

Computational Linguistics

Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing [code]Zihang Dai (Carnegie Mellon University) · Guokun Lai (Carnegie Mellon University) · Yiming Yang (CMU) · Quoc V Le (Google)
Mon Dec 07 09:00 PM — 11:00 PM (PST) @ Poster Session 0 #64

Big Bird: Transformers for Longer Sequences [unofficial video]Manzil Zaheer (Google) · Guru Guruganesh (Google Research) · Kumar Avinava Dubey (Carnegie Mellon University) · Joshua Ainslie (Google) · Chris Alberti (Google) · Santiago Ontanon (Google LLC) · Philip Pham (Google) · Anirudh Ravula (Google) · Qifan Wang (Google Research) · Li Yang (Google) · Amr Ahmed (Google Research)
Mon Dec 07 09:00 PM — 11:00 PM (PST) @ Poster Session 0 #65

Learning Sparse Prototypes for Text Generation
Junxian He (Carnegie Mellon University) · Taylor Berg-Kirkpatrick (University of California San Diego) · Graham Neubig (Carnegie Mellon University)
Tue Dec 08 09:00 AM — 11:00 AM (PST) @ Poster Session 1 #184

Deep Transformers with Latent Depth [code]Xian Li (Facebook) · Asa Cooper Stickland (University of Edinburgh) · Yuqing Tang (Facebook AI) · Xiang Kong (Carnegie Mellon University)
Tue Dec 08 09:00 AM — 11:00 AM (PST) @ Poster Session 1 #287

Computer Vision

Swapping Autoencoder for Deep Image Manipulation [website] [unofficial code] [video]Taesung Park (UC Berkeley) · Jun-Yan Zhu (Adobe, CMU) · Oliver Wang (Adobe Research) · Jingwan Lu (Adobe Research) · Eli Shechtman (Adobe Research, US) · Alexei Efros (UC Berkeley) · Richard Zhang (Adobe)
Mon Dec 07 09:00 PM — 11:00 PM (PST) @ Poster Session 0 #105

Residual Force Control for Agile Human Behavior Imitation and Extended Motion Synthesis [website] [code] [video]Ye Yuan (Carnegie Mellon University) · Kris Kitani (Carnegie Mellon University)
Tue Dec 08 09:00 AM — 11:00 AM (PST) @ Poster Session 1 #354

See, Hear, Explore: Curiosity via Audio-Visual Association [code] [video]Victoria Dean (Carnegie Mellon University) · Shubham Tulsiani (Facebook AI Research) · Abhinav Gupta (Facebook AI Research/CMU)
Tue Dec 08 09:00 AM — 11:00 AM (PST) @ Poster Session 1 #355

SDF-SRN: Learning Signed Distance 3D Object Reconstruction from Static Images [code]Chen-Hsuan Lin (Carnegie Mellon University) · Chaoyang Wang (Carnegie Mellon University) · Simon Lucey (CMU)
Tue Dec 08 09:00 AM — 11:00 AM (PST) @ Poster Session 1 #480

Measuring Robustness to Natural Distribution Shifts in Image Classification [code]Rohan Taori (Stanford University) · Achal Dave (Carnegie Mellon University) · Vaishaal Shankar (UC Berkeley) · Nicholas Carlini (Google) · Benjamin Recht (UC Berkeley) · Ludwig Schmidt (UC Berkeley)
Tue Dec 08 09:00 PM — 11:00 PM (PST) @ Poster Session 2 #679

Pixel-Level Cycle Association: A New Perspective for Domain Adaptive Semantic Segmentation [code]Guoliang Kang (Carnegie Mellon University) · Yunchao Wei (UTS) · Yi Yang (UTS) · Yueting Zhuang (Zhejiang University) · Alexander Hauptmann (Carnegie Mellon University)
Tue Dec 08 09:00 PM — 11:00 PM (PST) @ Poster Session 2 #693

Group Contextual Encoding for 3D Point Clouds [code]Xu Liu (The University of Tokyo) · Chengtao Li (MIT) · Jian Wang (Carnegie Mellon University) · Jingbo Wang (Peking University) · Boxin Shi (Peking University) · Xiaodong He (JD AI research)
Wed Dec 09 09:00 PM — 11:00 PM (PST) @ Poster Session 4 #1151

Comprehensive Attention Self-Distillation for Weakly-Supervised Object Detection
Zeyi Huang (carnegie mellon university) · Yang Zou (Carnegie Mellon University) · B. V. K. Vijaya Kumar (CMU, USA) · Dong Huang (Carnegie Mellon University)
Thu Dec 10 09:00 AM — 11:00 AM (PST) @ Poster Session 5 #1704

Graphical Models

Domain Adaptation as a Problem of Inference on Graphical Models
Kun Zhang (CMU) · Mingming Gong (University of Melbourne) · Petar Stojanov (Carnegie Mellon Univerisity) · Biwei Huang (Carnegie Mellon University) · Qingsong Liu (Unisound Intelligence Co., Ltd.) · Clark Glymour (Carnegie Mellon University)
Tue Dec 08 09:00 PM — 11:00 PM (PST) @ Poster Session 2 #698

Generalized Independent Noise Condition for Estimating Latent Variable Causal Graphs
Feng Xie (Peking University) · Ruichu Cai (Guangdong University of Technology) · Biwei Huang (Carnegie Mellon University) · Clark Glymour (Carnegie Mellon University) · Zhifeng Hao (Guangdong University of Technology) · Kun Zhang (CMU)
Wed Dec 09 09:00 AM — 11:00 AM (PST) @ Poster Session 3 #887

On the Role of Sparsity and DAG Constraints for Learning Linear DAGs
Ignavier Ng (University of Toronto) · AmirEmad Ghassami (Johns Hopkins University) · Kun Zhang (CMU)
Thu Dec 10 09:00 AM — 11:00 AM (PST) @ Poster Session 5 #1665

Transfer Learning

Pixel-Level Cycle Association: A New Perspective for Domain Adaptive Semantic Segmentation [code]Guoliang Kang (Carnegie Mellon University) · Yunchao Wei (UTS) · Yi Yang (UTS) · Yueting Zhuang (Zhejiang University) · Alexander Hauptmann (Carnegie Mellon University)
Tue Dec 08 09:00 PM — 11:00 PM (PST) @ Poster Session 2 #693

Domain Adaptation as a Problem of Inference on Graphical Models [code]Kun Zhang (CMU) · Mingming Gong (University of Melbourne) · Petar Stojanov (Carnegie Mellon Univerisity) · Biwei Huang (Carnegie Mellon University) · Qingsong Liu (Unisound Intelligence Co., Ltd.) · Clark Glymour (Carnegie Mellon University)
Tue Dec 08 09:00 PM — 11:00 PM (PST) @ Poster Session 2 #698

Look-ahead Meta Learning for Continual Learning [code]Gunshi Gupta (University of montreal) · Karmesh Yadav (Carnegie Mellon) · Liam Paull (Université de Montréal)
Wed Dec 09 09:00 AM — 11:00 AM (PST) @ Poster Session 3 #767

Mitigating Forgetting in Online Continual Learning via Instance-Aware Parameterization
Hung-Jen Chen (National Tsing Hua University) · An-Chieh Cheng (National Tsing Hua University) · Da-Cheng Juan (Google) · Wei Wei (CMU) · Min Sun (Appier, Inc.)
Wed Dec 09 09:00 AM — 11:00 AM (PST) @ Poster Session 3 #770

Domain Adaptation with Conditional Distribution Matching and Generalized Label Shift
Remi Tachet des Combes (Microsoft Research Montreal) · Han Zhao (Carnegie Mellon University) · Yu-Xiang Wang (UC Santa Barbara) · Geoffrey Gordon (MSR Montréal & CMU)
Wed Dec 09 09:00 AM — 11:00 AM (PST) @ Poster Session 3 #1008

Privacy & Robustness

Denoised Smoothing: A Provable Defense for Pretrained Classifiers [code]Hadi Salman (Microsoft Research AI) · Mingjie Sun (Carnegie Mellon University) · Greg Yang (Microsoft Research) · Ashish Kapoor (Microsoft) · J. Zico Kolter (Carnegie Mellon University / Bosch Center for AI)
Tue Dec 08 09:00 AM — 11:00 AM (PST) @ Poster Session 1 #302

Multi-Robot Collision Avoidance under Uncertainty with Probabilistic Safety Barrier Certificates
Wenhao Luo (Carnegie Mellon University) · Wen Sun (Cornell University) · Ashish Kapoor (Microsoft)
Tue Dec 08 09:00 AM — 11:00 AM (PST) @ Poster Session 1 #312

A Closer Look at Accuracy vs. Robustness [code]Yao-Yuan Yang (UCSD) · Cyrus Rashtchian (UCSD) · Hongyang Zhang (TTIC) · Russ Salakhutdinov (Carnegie Mellon University) · Kamalika Chaudhuri (UCSD)
Tue Dec 08 09:00 PM — 11:00 PM (PST) @ Poster Session 2 #667

Measuring Robustness to Natural Distribution Shifts in Image Classification [code]Rohan Taori (Stanford University) · Achal Dave (Carnegie Mellon University) · Vaishaal Shankar (UC Berkeley) · Nicholas Carlini (Google) · Benjamin Recht (UC Berkeley) · Ludwig Schmidt (UC Berkeley)
Tue Dec 08 09:00 PM — 11:00 PM (PST) @ Poster Session 2 #679

Smoothed Geometry for Robust Attribution
Zifan Wang (Carnegie Mellon University) · Haofan Wang (Carnegie Mellon University) · Shakul Ramkumar (Carnegie Mellon University) · Piotr Mardziel (Carnegie Mellon University) · Matt Fredrikson (CMU) · Anupam Datta (Carnegie Mellon University)
Wed Dec 09 09:00 AM — 11:00 AM (PST) @ Poster Session 3 #936

Trade-offs and Guarantees of Adversarial Representation Learning for Information Obfuscation
Han Zhao (Carnegie Mellon University) · Jianfeng Chi (University of Virginia) · Yuan Tian (University of Virginia) · Geoffrey Gordon (MSR Montréal & CMU)
Wed Dec 09 09:00 AM — 11:00 AM (PST) @ Poster Session 3 #1066

Understanding Gradient Clipping in Private SGD: A Geometric Perspective
Xiangyi Chen (University of Minnesota) · Steven Wu (Carnegie Mellon University) · Mingyi Hong (University of Minnesota)
Wed Dec 09 09:00 AM — 11:00 AM (PST) @ Poster Session 3 #1081

Fairness & Interpretability

Fair Hierarchical Clustering
Sara Ahmadian (Google Research) · Alessandro Epasto (Google) · Marina Knittel (University of Maryland, College Park) · Ravi Kumar (Google) · Mohammad Mahdian (Google Research) · Benjamin Moseley (Carnegie Mellon University) · Philip Pham (Google) · Sergei Vassilvitskii (Google) · Yuyan Wang (Carnegie Mellon University)
Wed Dec 09 09:00 AM — 11:00 AM (PST) @ Poster Session 3 #859

Metric-Free Individual Fairness in Online Learning
Yahav Bechavod (Hebrew University of Jerusalem) · Christopher Jung (University of Pennsylvania) · Steven Wu (Carnegie Mellon University)
Wed Dec 09 09:00 AM — 11:00 AM (PST) @ Poster Session 3 #861

How do fair decisions fare in long-term qualification?
Xueru Zhang (University of Michigan) · Ruibo Tu (KTH Royal Institute of Technology) · Yang Liu (UC Santa Cruz) · mingyan liu (university of Michigan, Ann Arbor) · Hedvig Kjellstrom (KTH Royal Institute of Technology) · Kun Zhang (CMU) · Cheng Zhang (Microsoft Research, Cambridge, UK)
Wed Dec 09 09:00 AM — 11:00 AM (PST) @ Poster Session 3 #869

Regularizing Black-box Models for Improved Interpretability
Gregory Plumb (Carnegie Mellon University) · Maruan Al-Shedivat (Carnegie Mellon University) · Ángel Alexander Cabrera (Carnegie Mellon University) · Adam Perer (Carnegie Mellon University) · Eric Xing (Petuum Inc. / Carnegie Mellon University) · Ameet Talwalkar (CMU)
Wed Dec 09 09:00 AM — 11:00 AM (PST) @ Poster Session 3 #1078

Explainable Voting
Dominik Peters (Carnegie Mellon University) · Ariel Procaccia (Harvard University) · Alexandros Psomas (Purdue University) · Zixin Zhou (Peking University)
Thu Dec 10 09:00 AM — 11:00 AM (PST) @ Poster Session 5 #1560

Counterfactual Predictions under Runtime Confounding
Amanda Coston (Carnegie Mellon University) · Edward Kennedy (Carnegie Mellon University) · Alexandra Chouldechova (CMU)
Thu Dec 10 09:00 AM — 11:00 AM (PST) @ Poster Session 5 #1622

Multi-agent Systems

Improving Policy-Constrained Kidney Exchange via Pre-Screening
Duncan McElfresh (University of Maryland) · Michael Curry (University of Maryland) · Tuomas Sandholm (CMU, Strategic Machine, Strategy Robot, Optimized Markets) · John Dickerson (University of Maryland)
Mon Dec 07 09:00 PM — 11:00 PM (PST) @ Poster Session 0 #126

Mitigating Manipulation in Peer Review via Randomized Reviewer Assignments
Steven Jecmen (Carnegie Mellon University) · Hanrui Zhang (Duke University) · Ryan Liu (Carnegie Mellon University) · Nihar Shah (CMU) · Vincent Conitzer (Duke University) · Fei Fang (Carnegie Mellon University)
Tue Dec 08 09:00 AM — 11:00 AM (PST) @ Poster Session 1 #267

Polynomial-Time Computation of Optimal Correlated Equilibria in Two-Player Extensive-Form Games with Public Chance Moves and Beyond
Gabriele Farina (Carnegie Mellon University) · Tuomas Sandholm (CMU, Strategic Machine, Strategy Robot, Optimized Markets)
Tue Dec 08 09:00 AM — 11:00 AM (PST) @ Poster Session 1 #341

No-Regret Learning Dynamics for Extensive-Form Correlated Equilibrium
Andrea Celli (Politecnico di Milano) · Alberto Marchesi (Politecnico di Milano) · Gabriele Farina (Carnegie Mellon University) · Nicola Gatti (Politecnico di Milano)
Tue Dec 08 09:00 AM — 11:00 AM (PST) @ Poster Session 1 #535

EvolveGraph: Multi-Agent Trajectory Prediction with Dynamic Relational Reasoning
Jiachen Li (University of California, Berkeley) · Fan Yang (Carnegie Mellon University) · Masayoshi Tomizuka (University of California, Berkeley) · Chiho Choi (Honda Research Institute US)
Wed Dec 09 09:00 PM — 11:00 PM (PST) @ Poster Session 4 #1236

Evaluating and Rewarding Teamwork Using Cooperative Game Abstractions
Tom Yan (Carnegie Mellon University) · Christian Kroer (Columbia University) · Alexander Peysakhovich (Facebook)
Wed Dec 09 09:00 PM — 11:00 PM (PST) @ Poster Session 4 #1272

Small Nash Equilibrium Certificates in Very Large Games
Brian Zhang (Carnegie Mellon University) · Tuomas Sandholm (CMU, Strategic Machine, Strategy Robot, Optimized Markets)
Thu Dec 10 09:00 AM — 11:00 AM (PST) @ Poster Session 5 #1465


Invited Speakers

Differentiable Computer Vision, Graphics, and Physics in Machine Learning
Abhinav Gupta
Fri Dec 11 05:00 AM — 12:30 PM (PST)

Advances and Opportunities: Machine Learning for Education
Carolyn Rose, Ken Koedinger
Fri Dec 11 05:30 AM — 02:10 PM (PST)

Human in the Loop Dialogue Systems
Maxine Eskenazi, Alexander Rudnicky
Fri Dec 11 06:10 AM — 05:20 PM (PST)

Causal Discovery and Causality-Inspired Machine Learning
Clark Glymour
Fri Dec 11 06:50 AM — 04:50 PM (PST)

Self-Supervised Learning — Theory and Practice
Katerina Fragkiadaki, Abhinav Gupta, Ruslan Salakhutdinov
Sat Dec 12 08:50 AM — 06:40 PM (PST)

Algorithmic Fairness through the Lens of Causality and Interpretability
Hoda Heidari
Sat Dec 12 01:00 AM — 12:00 PM (PST)

International Workshop on Scalability, Privacy, and Security in Federated Learning (SpicyFL 2020)
Ruslan Salakhutdinov, Virginia Smith
Sat Dec 12


Differentiable Computer Vision, Graphics, and Physics in Machine Learning
Krishna Murthy Jatavallabhula · Kelsey Allen · Victoria Dean · Johanna Hansen · Shuran Song · Florian Shkurti · Liam Paull · Derek Nowrouzezahrai · Josh Tenenbaum
Fri Dec 11 05:00 AM — 12:30 PM (PST)

Self-Supervised Learning for Speech and Audio Processing
Abdel-rahman Mohamed · Hung-yi Lee · Shinji Watanabe · Shang-Wen Li · Tara Sainath · Karen Livescu
Fri Dec 11 06:50 AM — 04:25 PM (PST)

Causal Discovery and Causality-Inspired Machine Learning
Biwei Huang · Sara Magliacane · Kun Zhang · Danielle Belgrave · Elias Bareinboim · Daniel Malinsky · Thomas Richardson · Christopher Meek · Peter Spirtes · Bernhard Schölkopf
Fri Dec 11 06:50 AM — 04:50 PM (PST)

Machine Learning and the Physical Sciences
Anima Anandkumar · Kyle Cranmer · Shirley Ho · Mr. Prabhat · Lenka Zdeborová · Atilim Gunes Baydin · Juan Carrasquilla · Adji Bousso Dieng · Karthik Kashinath · Gilles Louppe · Brian Nord · Michela Paganini · Savannah Thais
Fri Dec 11 07:00 AM — 03:15 PM (PST)

First Workshop on Quantum Tensor Networks in Machine Learning
Xiao-Yang Liu · Qibin Zhao · Jacob Biamonte · Cesar Caiafa · Paul Pu Liang · Nadav Cohen · Stefan Leichenauer
Fri Dec 11 08:00 AM — 07:00 PM (PST)

ML Retrospectives, Surveys & meta-Analyses (ML- RSA)
Chhavi Yadav · Prabhu Pradhan · Abhishek Gupta · Ryan Lowe · Peter Henderson · Jessica Forde · Mayoore Jaiswal · Jesse Dodge
Fri Dec 11 08:30 AM — 09:00 PM (PST)

BabyMind: How Babies Learn and How Machines Can Imitate
Byoung-Tak Zhang · Gary Marcus · Angelo Cangelosi · Pia Knoeferle · Klaus Obermayer · David Vernon · Chen Yu
Fri Dec 11 08:40 AM — 05:30 PM (PST)

Machine Learning for Autonomous Driving
Rowan McAllister · Xinshuo Weng · Xinshuo Weng · Daniel Omeiza · Nick Rhinehart · Fisher Yu · German Ros · Vladlen Koltun
Fri Dec 11 08:55 AM — 05:00 PM (PST)

Workshop on Dataset Curation and Security
Nathalie Baracaldo Angel · Yonatan Bisk · Avrim Blum · Michael Curry · John Dickerson · Micah Goldblum · Tom Goldstein · Bo Li · Avi Schwarzschild
Fri Dec 11

Tackling Climate Change with ML
David Dao · Evan Sherwin · Priya Donti · Yumna Yusuf · Lauren Kuntz · Lynn Kaack · David Rolnick · Catherine Nakalembe · Claire Monteleoni · Yoshua Bengio
Fri Dec 11

HAMLETS (Human And Machine in-the-Loop Evaluation and Learning Strategies)
Divyansh Kaushik · Bhargavi Paranjape · Bhargavi Paranjape · Forough Arabshahi · Yanai Elazar · Yixin Nie · Max Bartolo · Polina Kirichenko · Pontus Lars Erik Saito Stenetorp · Mohit Bansal · Zachary Lipton · Douwe Kiela
Sat Dec 12 08:15 AM — 08:00 PM (PST)

Self-Supervised Learning — Theory and Practice
Pengtao Xie · Shanghang Zhang · Pulkit Agrawal · Ishan Misra · Cynthia Rudin · Abdel-rahman Mohamed · Wenzhen Yuan · Barret Zoph · Laurens van der Maaten · Eric Xing
Sat Dec 12 08:50 AM — 06:40 PM (PST)

International Workshop on Scalability, Privacy, and Security in Federated Learning (SpicyFL 2020)
Xiaolin Andy Li · Dejing Dou · Ameet Talwalkar · Hongyu Li · Jianzong Wang · Yanzhi Wang
Sat Dec 12

Machine Learning for Engineering Modeling, Simulation and Design
Alex Beatson · Priya Donti · Amira Abdel-Rahman · Stephan Hoyer · Rose Yu · J. Zico Kolter · Ryan Adams
Sat Dec 12

Read More

Experiments with the ICML 2020 Peer-Review Process

Experiments with the ICML 2020 Peer-Review Process

This post is cross-listed on 

The International Conference on Machine Learning (ICML) is a flagship machine learning conference that in 2020 received 4,990 submissions and managed a pool of 3,931 reviewers and area chairs. Given that the stakes in the review process are high — the careers of researchers are often significantly affected by the publications in top venues — we decided to scrutinize several components of the peer-review process in a series of experiments. Specifically, in conjunction with the ICML 2020 conference, we performed three experiments that target: resubmission policies, management of reviewer discussions, and reviewer recruiting. In this post, we summarize the results of these studies.

Resubmission Bias

Motivation. Several leading ML and AI conferences have recently started requiring authors to declare previous submission history of their papers. In part, such measures are taken to reduce the load on reviewers by discouraging resubmissions without substantial changes. However, this requirement poses a risk of bias in reviewers’ evaluations.

Research question. Do reviewers get biased when they know that the paper they are reviewing was previously rejected from a similar venue?

Procedure. We organized an auxiliary conference review process with 134 junior reviewers from 5 top US schools and 19 papers from various areas of ML. We assigned participants 1 paper each and asked them to review the paper as if it was submitted to ICML. Unbeknown to participants, we allocated them to a test or control condition uniformly at random:

  • Control. Participants review the papers as usual.
  • Test. Before reading the paper, participants are told that the paper they review is a resubmission.

Hypothesis. We expect that if the bias is present, reviewers in the test condition should be harsher than in the control. 

Key findings. Reviewers give almost one point lower score (95% Confidence Interval: [0.24, 1.30]) on a 10-point Likert item for the overall evaluation of a paper when they are told that a paper is a resubmission. In terms of narrower review criteria, reviewers tend to underrate “Paper Quality” the most.

Implications. Conference organizers need to evaluate a trade-off between envisaged benefits such as the hypothetical reduction in the number of submissions and the potential unfairness introduced to the process by the resubmission bias. One option to reduce the bias is to postpone the moment in which the resubmission signal is revealed until after the initial reviews are submitted. This finding must also be accounted for when deciding whether the reviews of rejected papers should be publicly available on systems like and others. 


Herding Effects in Discussions

Motivation. Past research on human decision making shows that group discussion is susceptible to various biases related to social influence. For instance, it is documented that the decision of a group may be biased towards the opinion of the group member who proposes the solution first. We call this effect herding and note that, in peer review, herding (if present) may result in undesirable artifacts in decisions as different area chairs use different strategies to select the discussion initiator.

Research question. Conditioned on a set of reviewers who actively participate in a discussion of a paper, does the final decision of the paper depend on the order in which reviewers join the discussion?

Procedure. We performed a randomized controlled trial on herding in ICML 2020 discussions that involved about 1,500 papers and 2,000 reviewers. In peer review, the discussion takes place after the reviewers submit their initial reviews, so we know prior opinions of reviewers about the papers. With this information, we split a subset of ICML papers into two groups uniformly at random and applied different discussion-management strategies to them: 

  • Positive Group. First ask the most positive reviewer to start the discussion, then later ask the most negative reviewer to contribute to the discussion.
  • Negative Group. First ask the most negative reviewer to start the discussion, then later ask the most positive reviewer to contribute to the discussion.

Hypothesis. The only difference between the strategies is the order in which reviewers are supposed to join the discussion. Hence, if the herding is absent, the strategies will not impact submissions from the two groups disproportionately. However, if the herding is present, we expect that the difference in the order will introduce a difference in the acceptance rates across the two groups of papers.

Key findings. The analysis of outcomes of approximately 1,500 papers does not reveal a statistically significant difference in acceptance rates between the two groups of papers. Hence, we find no evidence of herding in the discussion phase of peer review.

Implications. Regarding the concern of herding which is found to occur in other applications involving people, discussion in peer review does not seem to be susceptible to this effect and hence no specific measures to counteract herding in peer-review discussions are needed.


Novice Reviewer Recruiting

Motivation.  A surge in the number of submissions received by leading ML and  AI conferences has challenged the sustainability of the review process by increasing the burden on the pool of qualified reviewers. Leading conferences have been addressing the issue by relaxing the seniority bar for reviewers and inviting very junior researchers with limited or no publication history, but there is mixed evidence regarding the impact of such interventions on the quality of reviews. 

Research question. Can very junior reviewers be recruited and guided such that they enlarge the reviewer pool of leading ML and AI conferences without compromising the quality of the process?

Procedure. We implemented a twofold approach towards managing novice reviewers:

  • Selection. We evaluated reviews written in the aforementioned auxiliary conference review process involving 134 junior reviewers, and invited 52 of these reviewers who produced the strongest reviews to join the reviewer pool of ICML 2020. Most of these 52 “experimental” reviewers come from the population not considered by the conventional way of reviewer recruiting used in ICML 2020.
  • Mentoring. In the actual conference, we provided these experimental reviewers with a senior researcher as a point of contact who offered additional mentoring.

Hypothesis. If our approach allows to bring strong reviewers to the pool, we expect experimental reviewers to perform at least as good as reviewers from the main pool on various metrics, including the quality of reviews as rated by area chairs.

Key findings. A combination of the selection and mentoring mechanisms results in reviews of at least comparable and on some metrics even higher-rated quality as compared to the conventional pool of reviews: 30% of reviews written by the experimental reviewers exceeded the expectations of area chairs (compared to only 14% for the main pool).

Implications. The experiment received positive feedback from participants who appreciated the opportunity to become a reviewer in ICML 2020 and from authors of papers used in the auxiliary review process who received a set of useful reviews without submitting to a real conference. Hence, we believe that a promising direction is to replicate the experiment at a larger scale and evaluate the benefits of each component of our approach.



All in all, the experiments we conducted in ICML 2020 reveal some useful and actionable insights about the peer-review process. We hope that some of these ideas will help to design a better peer-review pipeline in future conferences.

We thank ICML area chairs, reviewers, and authors for their tremendous efforts. We would also like to thank the Microsoft Conference Management Toolkit (CMT) team for their continuous support and implementation of features necessary to run these experiments, the authors of papers contributed to the auxiliary review process for their responsiveness, and participants of the resubmission bias experiment for their enthusiasm. Finally, we thank Ed Kennedy and Devendra Chaplot for their help with designing and executing the experiments.

The post is based on joint works with Nihar B. Shah, Aarti Singh, Hal Daumé III, and Charvi Rastogi.

Read More

On Learning Language-Invariant Representations for Universal Machine Translation

On Learning Language-Invariant Representations for Universal Machine Translation

Figure 1: An encoder-decoder generative model of translation pairs, which helps to circumvent the limitation discussed before. There is a global distribution (mathcal{D}) over the representation space (mathcal{Z}), from which sentences of language (L_i) are generated via decoder (D_i). Similarly, sentences could also be encoded via (E_i) to (mathcal{Z}).

Despite the recent improvements in neural machine translation (NMT), training a large NMT model with hundreds of millions of parameters usually requires a collection of parallel corpora at a large scale, on the order of millions or even billions of aligned sentences for supervised training (Arivazhagan et al.). While it might be possible to automatically crawl the web to collect parallel sentences for high-resource language pairs, such as German-English and French-English, it is often infeasible or expensive to manually translate large amounts of sentences for low-resource language pairs, such as Nepali-English, Sinhala-English, etc. To this end, the goal of the so-called multilingual universal machine translation, a.k.a., universal machine translation (UMT), is to learn to translate between any pair of languages using a single system, given pairs of translated documents for some of these languages. The hope is that by learning a shared “semantic space” between multiple source and target languages, the model can leverage language-invariant structure from high-resource translation pairs to transfer to the translation between low-resource language pairs, or even enable zero-shot translation.

Indeed, training such a single massively multilingual model has gained impressive empirical results, especially in the case of low-resource language pairs (see Fig. 2). However, such success also comes with a cost. From Fig. 2 we observe that the translation quality over high-resource language pairs by using such a single UMT system is worse than the corresponding bilingual baselines.

Figure 2: Translation quality by using a single massively multilingual model against bilingual baselines that are trained for each one of the 103 language pairs. While the translation performances over low resource languages increase, the performances over high resource languages decrease. Our work provides a theoretical explanation for this empirical phenomenon. Figure credit: Exploring Massively Multilingual, Massive Neural Machine Translation.

Is this empirical phenomenon by coincidence? If not, why does it happen? Furthermore, what kind of structural assumptions about languages could help us get over this detrimental effect? In this blog post, based on our recent ICML paper, we take the first step towards understanding universal machine translation by providing answers to the above questions. The key takeaways of this blog post could be summarized as follows:

  • In a completely assumption-free setup, based on a common shared representation, no matter what decoder is used to translate the target languages, it is impossible to avoid making a large translation error on at least one pair of the translation pairs.
  • Under a natural generative model assumption for the data, after seeing aligned sentences for a linear number of language pairs (instead of quadratic!), we can learn encoder/decoders that perform well on any unseen language pair, i.e., zero-shot translation is possible.

An Impossibility Theorem on UMT via Language-Invariant Representations

Suppose we have an unlimited amount of parallel sentences for each pair of languages, with unbounded computational resources. Could we train a single model that performs well on all pairs of translation tasks based on a common representation space? Put it in other words, is there any information-theoretic limit of such systems for the task of UMT? In this paragraph we will show that there is an inherent tradeoff between the translation quality and the degree of representation invariance w.r.t. languages: the better the language invariance, the higher the cost on at least one of the translation pairs. At a high-level, this result holds due to the general data-processing principle: if a representation is invariant to multiple source languages, then any decoder based on this representation will have to generate the same language model on the target language. But on the other hand, the parallel corpora we use to train such a system could have drastically different sentence distributions on the target language, thus leading to a discrepancy (error) between the generated sentence distribution and the ground-truth sentence distribution over the target language.

To keep our discussions simple and transparent, let’s start with a basic Two-to-One setup where there are only two source languages (L_0) and (L_1) and one target language (L). Furthermore, for each source language (L_i, iin{0, 1}), let’s assume that there is a perfect translator (f_{L_ito L}^*) that takes a sentence (or string, sequence) from (L_i) and outputs the corresponding translation in (L). Under this setup, it is easy to see that there exists a perfect translator (f_L^*) in this Two-to-One task: $$f_L^*(x) = sum_{iin{0, 1}}mathbb{I}(xin L_i)cdot f_{L_ito L}^*(x)$$ In words: upon receiving a sentence (x), (f_L^*) simply checks which source language (x) comes from and then call the corresponding ground-truth translator.

To make the idea of language-invariant representations formal, let (g: Sigma^*to mathcal{Z}) be an encoder that takes a sentence (string) from alphabet (Sigma) to a representation in a vector space (mathcal{Z}). We call (g) an (epsilon)-universal language mapping if the distributions of sentence representations from different languages (L_0) and (L_1) are (epsilon)-close to each other. In words, (d(g_sharpmathcal{D}_0, g_sharpmathcal{D}_1)leq epsilon) for some divergence measure (d), where (g_sharpmathcal{D}_i) is the induced distribution of sentence (from (L_i)) representations in the shared space (mathcal{Z}). Subsequently, a multilingual system will train a decoder (h) that takes a sentence representation (z) and outputs the corresponding target translation in language (L). The hope here is that (z) encodes the language-invariant semantic information about the input sentence (either from (L_0) or from (L_1)) based on which to translate to the target language (L).

So far so good, but could we recover the perfect translator (f_L^*) by learning a common, shared representation (Z), i.e., (epsilon) is small? Unfortunately, the answer here is negative if we don’t have any assumption on the parallel corpora we use to train our encoder (g) and decoder (h):

Theorem (informal): Let (g:Sigma^*tomathcal{Z}) be an (epsilon)-universal language mapping. Then for any decoder (h:mathcal{Z}to Sigma_L^*), the following lower bound holds: $$text{Err}_{mathcal{D}_0}^{L_0to L}(hcirc g) + text{Err}_{mathcal{D}_1}^{L_1to L}(hcirc g)geq d(mathcal{D}_0(L), mathcal{D}_1(L))- epsilon.$$

Here the error term (text{Err}_{mathcal{D}_i}^{L_ito L}(hcirc g)) measures the (0-1) translation performance given by the encoder-decoder pair (hcirc g) from (L_i) to (L) over distribution (mathcal{D}_i). The first term (d(mathcal{D}_0(L), mathcal{D}_1(L))) in the lower bound measures the difference of distributions over sentences from the target language in the two parallel corpora, i.e., (L_0-L) and (L_1 – L). For example, in many practical scenarios, it may happen that the parallel corpus of high-resource language pair, e.g., German-English, contains sentences over a diverse domain whereas as a comparison, the parallel corpus of low-resource language pair, e.g., Sinhala-English, only contains target translations from a specific domain, e.g., sports, news, product reviews, etc. In this case, despite the fact that the target is the same language (L), the corresponding sentence distributions from English are quite different between different corpora, leading to a large lower bound. As a result, our theorem, which could be interpreted as a kind of uncertainty principle in UMT, says that no matter what kind of decoder we are going to use, it has to incur a large error on at least one of the translation pairs. It is also worth pointing out that our lower bound is algorithm-independent and it holds even with unbounded computation and data. As a final note, realize that for fixed distributions (mathcal{D}_i, iin{0, 1}), the smaller the (epsilon) (hence the better the language-invariant representations), the larger the lower bound, demonstrating an inherent tradeoff between language-invariance and translation performance in general.

Proof Sketch: Here we provide a proof-by-picture (Fig. 3) in the special case of perfectly language-invariant representations, i.e., (epsilon = 0), to highlight the main idea in our proof of the above impossibility theorem. Please refer to our paper for more detailed proof as well as an extension of the above impossibility theorem in the more general many-to-many translation setting.

Figure 3: Proof by picture: Language-invariant representation (g) induces the same feature distribution over (mathcal{Z}), which leads to the same output distribution over the target language (Sigma_L^*). However, the parallel corpora of the two translation tasks in general have different marginal distributions over the target language, hence a triangle inequality over the output distributions gives the desired lower bound.

How can we Bypass this Limitation?

One way is to allow the decoder (h) to have access to the input sentences (besides the language-invariant representations) during the decoding process — e.g. via an attention mechanism on the input level. Technically, such information flow from input sentences during decoding would break the Markov structure of “input-representation-output” in Fig. 3, which is an essential ingredient in the proof of our theorem. Intuitively, in this case both language-invariant (hence language-independent) and language-dependent information would be used.

Another way would be to assume extra structure on the distributions of our corpora (mathcal{D}_{i}), i.e., by assuming some natural generative process capturing the distribution of the parallel corpora that are used for training. Since languages share a lot of semantic and syntactic characteristics, this would make a lot of sense — and intuitively, this is what universal translation approaches are banking on. In the next paragraph, we will do exactly this — we will show that under a suitable generative model, not only will there be a language-invariant representation, but it will be learnable using corpora from a very small (linear) number of pairs of language.

A Generative Model for UMT: A Linear Number of Translation Pairs Suffices!

In this section we will discuss a generative model, under which not only will there be a language-invariant representation, but it will be learnable using corpora from a very small (linear) number of pairs of language. Note that there are a quadratic number of translation pairs in our universe, hence our result shows that under this generative model zero-shot translation is actually possible.

To start with, what kind of generative model is suitable for the task of UMT? Ideally, we would like to have a feature space where vectors correspond to the semantic encoding of sentences from different languages. One could also understand it as a sort of “meaning” space. Then, language-dependent decoders would take these semantic vectors and decode them as the observable sentences. Figure 1 illustrates the generative process of our model, where we assume there is a common distribution (mathcal{D}) over the feature space (mathcal{Z}), from which parallel sentences are sampled and generated.

For ease of presentation, let’s first assume that each encoder-decoder pair ((E_i, D_i)) consists of deterministic mappings (see our paper on extensions with randomized encoders/decoders). The first question to ask is: how does this generative model assumption circumvent our previous lower bound in the last paragraph? We can easily observe that under the encoder-decoder generative assumption in Figure 1, the first term in our lower bound, (d(mathcal{D}_0(L), mathcal{D}_1(L))), gracefully reduces to 0, hence even if we try to learn perfectly language-invariant representations ((epsilon = 0)), there will be no loss of translation accuracy using universal language mapping. Perhaps what’s more interesting is that, under proper assumptions on the structure of (mathcal{F}), the class of encoders and decoders we learn from, by using the traditional empirical risk minimization (ERM) framework to learn the language-dependent encoders and decoders on a small number of language pairs, we could expect the learned encoders/decoders to well generalize on unseen language pairs as well! Informally,

Theorem (informal): Let (H) be a connected graph where each node (L_i) corresponds to a language and each edge ((L_i, L_j)) means that the learner has been trained on language pair (L_i) and (L_j), with empirical translation error (epsilon_{i,j}) and corpus of size (Omega(1 / epsilon_{i,j}^2 cdot log C(mathcal{F}))). Then with high probability, for any pair of language (L) and (L’) that are connected by a path (L = L_0, L_1, ldots, L_m = L’) in (H), its population level translation error is upper bounded by (O(sum_{k=0}^{m-1}epsilon_{k,k+1})).

In the theorem above, (C(mathcal{F})) is some complexity measure of the class (mathcal{F}). If we slightly simplify the theorem above by defining (epsilon := max_{(L_i, L_j)in H}epsilon_{i,j}) and realizing that the path length (m) is upper bounded by the diameter of the graph (H), (text{diam}(H)), we immediately obtain the following intuitive result:

For any pair of languages (L, L’) (the parallel corpus between (L) and (L’) may not necessarily appear in our training corpora), the translation error between (L) and (L’) is upper bounded by (O(text{diam}(H) cdot epsilon)).

The above corollary says that graphs (H) that do not have long paths are preferable. For example, (H) could be a star graph, where a central (high-resource) language acts as a pivot node. The proof of the theorem above essentially boils down to two steps: first, we use an epsilon-net argument to show that the learned encoders/decoders generalize on a pair of language that appears in our training corpora, and then by using the connectivity of the graph (H), we apply a chain of triangle-like inequalities to bound the error along the path connecting any pair of languages.

Some Concluding Thoughts

The prospect of building a single system for universal machine translation is appealing. Compared with building a quadratic number of bilingual translators, such a single system is easier to train, build, deploy, and maintain. More importantly, this could potentially allow the system to transfer some common knowledge in translation from high-resource languages to low-resource ones. However, such promise often comes with a price, which calls for proper assumptions on the generative process of the parallel corpora used for training. Our paper takes a first step towards better understanding the tradeoff in this regard and proposes a simple setup that allows for zero-shot translation. On the other hand, there are still some gaps between theory and practice. For example, it would be interesting to see whether the BLEU score, a metric used in the empirical evaluation of translation quality, bears a similar kind of lower bound. Also, could we further extend our generative modeling of sentences so that there are more hierarchical structures in the semantic space (mathcal{Z})? Empirically, it would be interesting to implement the above generative model on synthetic data to see the actual performance of zero-shot translation under the model assumption. These challenging problems (and more) will require collaborative efforts from a wide range of research communities and we hope our initial efforts could inspire more efforts in bridging the gap.


  1. Massively Multilingual Neural Machine Translation in the Wild: Findings and Challenges, Arivazhagan et al.,
  2. Investigating Multilingual NMT Representations at Scale, Kudugunta et al., EMNLP 2019,
  3. On Learning Language-Invariant Representations for Universal Machine Translation, Zhao et al., ICML 2020,
  4. The Source-Target Domain Mismatch Problem in Machine Translation, Shen et al.,
  5. How multilingual is Multilingual BERT? Pires et al., ACL 2019,

DISCLAIMER: All opinions expressed in this post are those of the author and do not represent the views of CMU.

Read More

FACT Diagnostic: How to Better Understand Trade-offs Involving Group Fairness

FACT Diagnostic: How to Better Understand Trade-offs Involving Group Fairness

Figure 1. The FACT diagnostic is a general framework that allows easier and flexible analyses of trade-offs between group fairness and predictive performance (type-1 trade-off), or among different types of group fairness definitions (type-2 trade-off).

As machine learning continues to be more widely used for applications with a societal impact like mortgage lending and predictive policing, model developers face increased regulatory scrutiny to verify and understand model fairness. To provide quantitative tests of model fairness, practitioners further need to choose between multiple definitions of fairness that exist in the machine learning literature. One prevalent class of these definitions is group fairness, which measures how a group of individuals with certain protected attributes (like gender or race) is impacted differently from other groups. This general notion is widely studied under the name of disparate impact in the legal context, and one specific instance of this notion has been accepted by the US government as a guideline towards a fair employment process in 1978.

From a technical point of view, however, several definitions of group fairness have been shown to conflict with one another usually with a necessary cost in loss of accuracy. Throughout this post, we will refer to this inherent trade-off between accuracy and fairness as type-1 trade-off, and the trade-off among different notions of group fairness as type-2 trade-off. Such considerations complicate the practical development and assessment of machine learning models designed to satisfy group fairness, as the conditions under which these trade-offs necessarily occur can be too abstract to understand and time-consuming to verify. As a result, it is difficult in general for model developers to explore these trade-offs efficiently. Although previous works have studied these trade-offs in an ad hoc and definition-specific manner, there remains a pressing need for a more general and unified perspective.

Figure 2. Understanding type-1 (left) and type-2 (right) trade-offs in group fairness can speed up the model development process by reducing the time and effort spent on trying to obtain specifications that are not feasible.

To put these issues into context, consider an engineer training a model to satisfy both fairness and performance specifications. Shown in Figure 2 (left), typically the engineer needs to resort to several iterations of training and evaluating models with different levels of performance and fairness (yellow circles). But with a knowledge of the trade-off boundary representing type-1 trade-off (blue solid line), the engineer can easily understand the frontier of achievable accuracy and fairness levels and quickly rule out specifications that are not feasible, all before training or evaluating any models. This reduces the time and effort spent on trying to obtain a model with unrealistic configurations. Furthermore, it is important for not only the engineer but also the regulators to fully grasp type-2 trade-offs with a list of compatible/incompatible group fairness notions (Figure 2 right), in order to provide reasonable guidelines.

In our ICML 2020 paper, we present the FACT (FAirness-Confusion Tensor) diagnostic as a tool for addressing the above desiderata for better understanding trade-offs involving group fairness. The diagnostic hinges on the observation that multiple group fairness definitions can be represented in a unified fashion with the FACT, which is the traditional confusion matrix for each group with different attribute values stacked together (Figure 3). All group fairness definitions take the form of equating conditional probabilities for different protected groups, and these conditional probabilities can be expressed succinctly using the elements of the FACT. 

Figure 3. Fairness-confusion tensor (FACT), denoted throughout the post as ( mathbf{z} ), is simply the confusion matrix (true-positives, false-positives, false-negatives, true-negatives) for each group with protected attribute values (A) stacked together. For a single binary protected attribute (e.g. (A in {0, 1} )), which is our main focus throughout this post, the resulting tensor can be flattened into an 8-dimensional vector as shown. This allows the simultaneous treatment of multiple protected attributes easily and provides a unified representation for all group fairness notions.

For instance, consider a binary classification task, with ( hat{Y} ) being the classifier prediction and (A) being the binary protected attribute. Demographic parity (DP), which is one of the most widely used notions of group fairness, is defined as equating the positive prediction rate for both groups with (A = 1) and (A = 0). In terms of conditional probability, this is (P(hat{Y}=1 | A = 1) = P(hat{Y} =1  | A = 0) ), which can be formatted as a linear system of the FACT: $$mathbf{M} mathbf{z} = 0, text{ where } mathbf{M} = frac{1}{N_0 + N_1} begin{pmatrix} N_0 & 0 & N_0 & 0 & -N_1 & 0 & -N_1 & 0 end{pmatrix}$$ with (N_a) being the sum of all elements of the slice of the FACT for group with (A = a). Other notions of group fairness can be similarly expressed either in linear or quadratic format with respect to the FACT.

With this tool for characterizing different group fairness notions, we can formulate type-1 trade-off in a unified model-agnostic fashion via linear programs over the FACT. This formulation extends similarly to type-2 trade-offs and model-specific scenarios with some tweaks, yielding an even more comprehensive framework for understanding a wide range of trade-offs involving group fairness.

Optimization over the FACT

We define a linear program over the possible FACTs called Least-squares Accuracy-Fairness Optimality Problem (LAFOP):

$$min_mathbf{z} mathbf{c}^T mathbf{z} quad  text{ such that } quad mathbf{M} mathbf{z} leq epsilon$$

Essentially the optimization problem searches for a valid FACT that satisfies a specified set of fairness conditions linearly expressed using the fairness matrix (mathbf{M}) while optimizing for the classification error rate in the objective. 

Solving this optimization problem for different values of (epsilon) yields different objective function values, which we denote as (delta). We are then interested in the resulting ((epsilon, delta))-solutions of LAFOP, which intuitively represent FACTs that deviate from perfect fairness and perfect accuracy by (epsilon) and (delta) respectively. These ((epsilon, delta)) value pairs naturally translate to the trade-off boundary for type-1 trade-off called the FACT Pareto frontier, just like the blue solid line in our earlier example: changes in (delta) as we vary (epsilon) will indicate the change in the best achievable classification error rate by the model (i.e. bigger (delta) means a bigger drop in accuracy). Note that by definition, this frontier is model-agnostic, enabling the engineer to apply it before training any models. We discuss more in the paper how LAFOP is also amenable to proving general incompatibility theorems for type-2 trade-offs.

While LAFOP is designed to be model agnostic, we can modify it to be model specific in case there is a trained model whose limitations in achieving fairness via post-processing need to be assessed. This leads to a model-specific (MS) variation of LAFOP called MS-LAFOP, which places additional model-dependent constraints on the solution space of the FACTs. Because now the problem is grounded on a specific model, ((epsilon, delta))-solutions of the MS-LAFOP yield a more realistic FACT Pareto frontier. The solutions of the MS-LAFOP naturally provide a way to post-process that model for better fairness guarantees, as we discuss in the paper

Demonstration on the UCI Adult Dataset

Figure 4. The FACT Pareto frontiers show the trade-off between accuracy and fairness, and several algorithms in the frontier are also plotted for comparison.

Using the UCI Adult dataset with gender as the protected attribute, we demonstrate the FACT diagnostic’s usefulness. Figure 4 shows both model-agnostic (MA) and model-specific (MS) FACT Pareto frontiers by plotting ((epsilon, delta))-solutions, under the equalized odds (EOd) fairness. Essentially the frontier allows us to gauge the type-1 trade-off, i.e., how accuracy inevitably drops for increasing levels of fairness. One thing to note when is that the MA FACT Pareto frontier is model-agnostic, and therefore does not take into account the Bayes error of the problem, which is an irreducible amount of error in the problem due to inherent statistical fluctuations in the data preventing a perfectly accurate classifier. This means that the (delta) of 0 (equivalent to accuracy of 1) for the MA FACT Pareto frontier should be interpreted as the Bayes error, not as the value 0 itself. In other words, when viewing the frontier, relative change of the accuracy is more important than the actual values on the y-axis for the model-agnostic case. Accordingly, the frontier tells us that only for the fairness gaps below 0.01 will the accuracy of any models actually start to drop. With results from some fair classification algorithms plotted in the frontier, we can also observe that FGP provides a better trade-off scheme compared to the other two methods presented. Unlike the MA FACT Pareto frontier, its model-specific counterpart has a benefit of providing tighter bounds, as it depends on the pre-trained model used as a reference point.

Figure 5. It is also possible to plot trade-offs that involve a set of fairness conditions together. Fairness gap for a set of fairness definitions is the sum of individual gaps for each definition in the set. For the details of each definition, refer to Table 1 of our paper

Modifying the constraints on LAFOP to encode multiple fairness definitions leads to the MA FACT Pareto frontier for scenarios when those fairness conditions are imposed simultaneously. This is shown in Figure 5, where we observe different sets of group fairness imposed lead to different behaviors of the frontier. Notably, the two halted lines in red and black that do not reach smaller fairness gaps verify the well-established type-2 trade-off result among the given group fairness definitions. 

What’s Next?

The FACT diagnostic aids an intuitive understanding of the trade-offs involved in group fairness by merging multiple definitions into a single framework. Using the FACT as a tool to characterize group fairness definitions, solving LAFOP defined over the FACTs directly shows the degree of trade-offs present in the problem, prior to any training of the model. In this post we have mostly focused on LAFOP and model-agnostic cases, but as we discuss further in the paper, the FACT diagnostic more broadly encompasses different optimization problems while at the same time demonstrating versatility of improving models via post-processing.

If you are interested, check out the paper and code for more details. This is joint work with Jiahao Chen (JPMorgan AI Research) and Ameet Talwalkar (CMU).

Here are some relevant references.

DISCLAIMER: All opinions expressed in this post are those of the author and do not represent the views of CMU.

Read More

F19 – Topics in data analysis

F19 – Topics in data analysis

This series of blog posts is based on the Fall 2019 10-718 Data Analysis class at Carnegie Mellon University, taught by Leila Wehbe, with the assistance of Jacob Tyo, Aria Wang and Fabricio Flores. The blog posts were written by the students and edited by the instructors and the ML@CMU blog team.

What is data analysis? A simple definition is: the application of machine learning and statistical methods to real world data to solve a problem. While this statement is simple, data analysis eventually requires expertise from a vast number of disciplines such as the real world domain in question (e.g. healthcare, specific scientific field, finance, etc.) and machine learning and statistics, but could also require knowledge from other fields as diverse as computing or policy or law. The complexity of data science leads to a plethora of possible pitfalls, with no clear instructions on how to avoid them. It is very difficult to construct a specific set of such instructions because every application domain has very specific setups, goals and constraints. 

We focus here on these issues from the perspective of a machine learning expert and attempt to provide some general guidelines to avoid pitfalls. In some cases where it’s difficult to provide guidelines, we present a set of notable mistakes to avoid. Unlike usual machine learning classes or tutorials that focus on introducing methods and algorithms, we focus on the higher level of motivating the use of these algorithms and testing the generalizability of their conclusions. We focus on the connection between machine learning and its practice.

In this series of educational blog posts, we highlight components of data analysis by focusing on 7 topics. Each topic is based on key papers, book chapters or blog posts that we have discussed in class. For each topic, we highlight pitfalls to watch out for and propose solutions when possible, some inspired by the literature and others by class discussion. We invite the readers to share their comments with us to help us improve the posts.

1 – The Importance of Domain Knowledge

Why is domain knowledge important in data science? This blog post shows the value of domain knowledge in data analysis from multiple perspectives. It includes some simple case studies to demonstrate how domain knowledge can help us with every stage of the data analysis workflow, focuses on several examples to give an in-depth view of the use of domain knowledge in specific tasks and includes an interesting discussion about the relationship between domain knowledge and machine learning algorithms.

2 – Data Exploration

Although sometimes practitioners tend to spend more time on model architecture design and parameters tuning, the importance of data exploration should not be ignored. If data breaks the assumption of the model or contains errors, it will not be possible to get desired results even with the best of models. This blog post introduces a protocol for data exploration along with several methods that may be useful in this process, including statistical and visualization methods, as well as examples of traps in data exploration and of how data exploration helps reduce bias in the dataset.

3 – Baselines

A baseline guides our selection of more complex models and provides insights into the task at hand. Nonetheless, such a useful tool is not easy to handle, and many researchers tend to compare their novel models against weak baselines which poses a problem in the current research sphere as it leads to optimistic, but false results. This blog provides a definition of different types of baselines, case studies of examples in which they are not correctly used, a discussion on such issues and questions that are still open-ended.

4 – The Overfitting Iceberg

Overfitting, as a conventional and important topic of machine learning, has been well-studied with tons of solid fundamental theories and empirical evidence. However, as breakthroughs in deep learning are rapidly changing science and society in recent years, practitioners have observed many phenomena that seem to contradict classical learning theory. This blog aims to understand the nuances and subtleties behind this apparent contradiction by introducing a proposed mechanism for their emergence; it also summarizes some state-of-the-art strategies to deal with overfitting in the modern DL practice.

5 – Reproducibility

It is now widely agreed that reproducibility is a key part of any scientific process and that it should be considered a regular practice to make our research reproducible. Despite this widely accepted notion, many fields including machine learning are experiencing a reproducibility crisis. This blog explains the different definitions of reproducibility, relates the reproducibility crisis and discusses its implications for scientific research and its more general impacts on society.

6 – Interpretability

The objectives that machine learning models optimize for do not always reflect the actual desiderata of the task at hand. Interpretability in models allow us to evaluate their decisions and obtain information that the objective alone cannot confer. Interpretability takes many forms and can be difficult to define; this blog explores general frameworks and sets of definitions in which model interpretability can be evaluated and compared and analyzes several well-known examples of interpretability methods in the context of this framework.

7 – Causal Inference

The rules of causality play a role in almost everything we do and it is reasonable to assume that considering causality in a world model will be a critical component of intelligent systems in the future. However, the formalisms, mechanisms, and techniques of causal inference remain a niche subject few explore. In this blog we formally consider the statement “association does not equal causation”, review some of the basics of causal inference, discuss causal relationship discovery, and describe a few examples of the benefits of utilizing causality in AI research.

Read More

High-Frequency Component Helps Explain the Generalization of Convolutional Neural Networks

High-Frequency Component Helps Explain the Generalization of Convolutional Neural Networks

Fig. 1: The central hypothesis: within a dataset with finite samples, there are correlations between the high-frequency component and the “semantic” component of the images. As a result, the model will perceive both the high-frequency component as well as the “semantic” ones, leading to generalization behavior counterintuitive to humans.

It’s all about data

There are many works aiming to explain the generalization behavior of neural networks using heavy mathematical machinery, but we will do something different here: with a simple and intuitive twist of data, we will show that many generalization mysteries (like adversarial vulnerability, BatchNorm’s efficacy, and the “generalization paradox”) might be results of our overconfidence in processing data through naked eyes. Or simply: 

The models may have not outsmarted us, but the data has.

Let’s start with an interesting observation (Fig. 2): we trained a ResNet-18 with the Cifar10 dataset, picked a test sample, and plotted the model’s prediction confidence for this sample. Then we mapped the sample into the frequency domain through Fourier transform, and cut the frequency representation into its high-frequency component (HFC) and low-frequency component (LFC). We reconstructed the image through these two components and fed the reconstructed image into the model:

  • HFC-reconstructed images look distinctly different from the original sample but predicted to be the same label.
  • LFC-reconstructed images look similar to the original sample but the model classifies them differently.
Fig. 2: The striking misalignment between human and models: HFC-reconstructed images look distinctly different from the original sample, but predicted to be the same label; LFC-reconstructed images look similar to the original sample but the model classifies them differently.

Although this phenomenon can only be observed with a subset of samples (~600 images), it’s striking enough to raise an alarm.

Why does a model behave like this?

We believe the underlying reason is the coincidental correlation between HFC and the “semantics” depicted within a dataset (Fig. 1). With a finite number of samples from the same distribution, chances are that the human-imperceptible HFC is correlated with how a human annotates the image; thus, when a model is optimized to reduce the training loss, it can pick up either the “semantics” or HFC to reduce the loss, leading to high prediction accuracy even though the model may not truly “understand” the data. 

Please note: we are not claiming that the model itself has a tendency to capture HFC. Instead, our main argument is that a generic model does not have the incentive to learn LFC only; thus, it may end up learning a mixture of LFC and HFC.

Also, one may wonder whether the fact that models can capture HFC is promising or worrisome. One side may argue that this enables the development of models that can surpass humans on the test data (likely only when the test data is from the same distribution as the training data), while the other side may argue that the resulting models, despite performing better on test data from the same distribution, may underperform after deployed on similar data from other distributions (i.e., HFC may be dataset-specific). This post does not intend to resolve this argument, but only to offer the observations we made.


We can leverage the main observation to help explain multiple previously elusive empirical phenomena. In this post, we will highlight two discussions from our paper.

One of the roots of adversarial vulnerability

Conceptually similar to the argument made in a preceding paper, we show that the predictive signal from HFC is one of the roots of adversarial vulnerability. However, in contrast to this prior work, we offer a concrete proposal regarding what the adversarial features might be: signals from HFC.

To investigate the relationship, we trained an adversarially robust model with Madry’s adversarial training method and studied the convolutional kernels of a robust model and a vanilla model. We notice that the convolutional kernels of a robust model tend to be smoother (“smooth” in the sense that differences between adjacent values are small), as shown in Fig. 3. Relevant mathematical tools suggest that smooth convolutional kernels only consider a negligible amount of HFC in data, thus linking HFC to adversarial vulnerability.

Fig. 3. Left: visualization of convolutional kernels of a vanilla CNN; Right: visualization of an adversarially robust CNN from adversarial training.

With this knowledge, a more enticing question is whether we can directly smooth a vanilla model’s convolutional kernel to improve its adversarial robustness. To answer this question, we tested multiple methods to smooth convolutional kernels:

  • To heuristically adjust the weights of trained kernels to improve the smoothness.
  • To filter out the high-frequency information of trained kernels (not reported in our paper).
  • To design regularization schemes limiting the differences of adjacent values of kernels during training (not reported in our paper).

Unfortunately, only marginal improvements are observed. Thus, we can conclude that adversarially robust models tend to have smooth convolutional kernels, but the inverse is not necessarily true. In other words, HFC is one of the issues of adversarial vulnerability, but not all of the issues.

However, one solution inspired by this observation can indeed defend against most adversarial attacks at a remarkable rate:

  • To filter out HFC of input images before feeding the data into models.

The method can improve the robustness of a model, but the method is effectively masking the gradient of the model, thus not solving the adversarial attack & defense problem the research community focuses on.

The Efficacy of BatchNormalization

Another interesting observation concerns the efficacy of BatchNorm. BatchNorm is one of the most effective training heuristics in modern deep learning research, especially in computer vision applications. However, why BatchNorm can help training so significantly is not yet well understood. Interestingly, our experiments offer another perspective on why BatchNorm often helps.

Fig. 4. How test accuracy changes along with the increment of epochs during training. Each panel depicts the performances of different heuristics. Each color represents the performance from a different radius to cut LFC and HFC. Solid lines represent performances for LFC and dashed lines show those for HFC. The higher the curves of dashed lines, the more HFC a method exploits.

In Fig. 4, along with the increment of training epochs, we report the accuracies of training data and various copies of test data, where r refers to the radius we used to cut LFC and HFC, and solid/dashed line denotes the performances from LFC/HFC, respectively. Thus, the higher the curves get in dashed lines, the more HFC a model takes advantage of.

Surprisingly, the model trained with BatchNorm exploits a significant amount of HFC, as the dashed curves of the 4th panel are remarkably higher than those of the other panels. This observation suggests that one of the reasons why BatchNorm helps is that it encourages the usage of HFC. As we argued previously, there are multiple predictive signals (LFC and HFC) in the data. It is expected that the more signals a model uses, the higher test accuracy a model can get, consistent with the fact that BatchNorm is widely known as a method to improve testing accuracy.

Intuitively, we conjecture that the performance boost is due to the fact that HFC usually only involves pixels with very small magnitude (as the reconstructed images look mostly black to humans). BatchNorm conveniently rescales these signals through normalization, thus leading to improved accuracy.

One may naturally wonder what our observation implies regarding the usage of BatchNorm: we suggest that we might need to reevaluate the value of BatchNorm, especially for models to meet the expectations of being robust in multiple datasets. Our observation also conveniently aligns with another observation suggesting that BatchNorm may encourage adversarial vulnerability.

In our paper, we also have discussions relating to the paradox widely known as “rethinking generalization”, formal results on the tradeoff between accuracy and robustness, and experiments suggesting that these interesting phenomena may appear in other vision tasks such as object detection.


For more information, please refer to our paper, where we mainly draw three conclusions:

  • Since HFC may be dataset-specific, SOTA (state-of-the-art) may not be as important as the community thought; the alignment between human and the models may be more important.
  • We may need a new testing regime for computer vision; for example, a simple way is to always test the models over LFC-reconstructed images in addition to the original testing images.
  • Explicit inductive bias design to align the models and the human may play an important role.

Relevant resources:

DISCLAIMER: All opinions expressed in this post are those of the author and do not represent the views of CMU.

Read More

Maintaining the Illusion of Reality:  Transfer in RL by Keeping Agents in the DARC

Maintaining the Illusion of Reality: Transfer in RL by Keeping Agents in the DARC

Reinforcement learning (RL) is often touted as a promising approach for costly and risk-sensitive applications, yet practicing and learning in those domains directly is expensive. It costs time (e.g., OpenAI’s Dota2 project used 10,000 years of experience), it costs money (e.g., “inexpensive” robotic arms used in research typically cost $10,000 to $30,000), and it could even be dangerous to humans. How can an intelligent agent learn to solve tasks in environments in which it cannot practice?

For many tasks, such as assistive robotics and self-driving cars, we may have access to a different practice area, which we will call the source domain. While the source domain has different dynamics than the target domain, experience in the source domain is much cheaper to collect. For example, a computer simulation of the real world can run much faster than real time, collecting (say) a year of experience in an hour; it is much cheaper to simulate 1,000 robot manipulators in parallel than to maintain 1,000 robot manipulators. The source domain need not be a simulator, but rather could be any practice facility, such as a farm of robot arms, a playpen for learning to walk, or a controlled testing facility for self-driving vehicles. Our aim is to learn a policy in the source domain that will achieve high reward in a different target domain, using a limited amount of experience from the target domain.

This problem (domain adaptation for RL) is challenging because the source and target domains might have different physics and appearances. These differences mean that strategies that work well in the practice facility often don’t transfer to the target domain. For example, a good approach to driving a car around a dry racetrack (the source domain) may entail aggressive acceleration and cutting corners. If the target domain is an icy, public road, this approach may cause the car to skid off the road or hit oncoming traffic. While prior work has thoroughly studied the domain adaptation of appearance in RL [Bousmalis 2017, Ganin 2016, Higgins 2017], it ignores the domain adaptation of the physics.


Our approach stems from the idea that the agent’s experience in the source domain should look similar to its experience in the target domain. We can achieve this goal by compensating for the difference in dynamics by modifying the reward function. The figure below above shows an illustrative example, where the real world (left) contains an obstacle that is unmodeled in the simulator (center).  Our method automatically learns to assign a high penalty when the agent takes actions in the simulator that would not be possible in the real world. For this example, our method puts a high penalty (red region) in the middle of the practice facility, as transitions through that region will not be possible in the real world. In effect, the agent is penalized for transitions that would indicate that the agent is interacting with the source domain, rather than the target domain.

How can a robot solve tasks in a target domain (left) that is different from the source domain where it practices (center)? Our method modifies the reward function to force the agent to learn behaviors that will be feasible in the target domain.

Formally, we will search for a policy whose behavior in the source domain both receives high reward and has high likelihood under the target domain dynamics. It turns out that this objective can be written as a combination of the standard MaxEnt RL objective [Ziebart 2010] together with a reward correction, $$max_pi mathbb{E}_pibigg[sum_t underbrace{r(s_t, a_t) + mathcal{H}_pi[a_t mid s_t]}_{text{MaxEnt RL}} + underbrace{Delta r(s_t, a_t, s_{t+1})}_{text{reward correction}} bigg],$$ where the reward correction ( Delta r) measures the discrepancy between the dynamics in the source domain (p_source) and the target domain (p_target): $$Delta r(s_t, a_t, s_{t+1}) = log p_text{target}(s_{t+1} mid s_t, a_t) – log p_text{source}(s_{t+1} mid s_t, a_t).$$

The idea behind our method, DARC, is to avoid taking actions that would reveal that you’re not in the real world. (left) As an example, In the Truman Show, the main character discovers that he has been living in a simulated world when he sails his boat into the wall of the set. (right) When we train a robot with our method, it avoids stepping off the stage, as doing so would indicate whether the robot is in the “real world” or the simulator. In short, our method avoids breaking the fourth wall.

The combined objective has a number of intuitive interpretations:

  • Viewing the next state shouldn’t give you any additional information about whether you are in the source or target domain.
  • You should maximize reward, subject to the constraint that physics seen by your robot in the source domain looks like the physics of the target domain.
  • Our method aims to acquire a policy that remains ignorant of whether it is interacting with the simulator or the real world.

While (Delta r) is defined in terms of dynamics, it turns out that we can estimate (Delta r) without learning a dynamics model. Applying Bayes’ Rule, we can rewrite (Delta r) as $$begin{aligned} Delta r(s_t, a_t, s_{t+1}) =& color{orange}{log p(text{target} mid s_t, a_t, s_{t+1})} – color{blue}{log p(text{target} mid s_t, a_t)} \ & – color{orange}{log p(text{source} mid s_t, a_t, s_{t+1})} + color{blue}{log p(text{source} mid s_t, a_t)}. end{aligned}$$ The orange terms can be estimated by learning a classifier that distinguishes source vs target given ((s, a, s’)); the blue terms can be estimated using a classifier that distinguishes source vs target given ((s, a)). We therefore call our method Domain Adaptation with Rewards from Classifiers, or DARC. While we do require a limited amount of experience from the target domain to learn these classifiers, our experiments show that learning a classifier (which has a binary output) is easier than learning a dynamics model (which must predict a high-dimensional output). DARC is simple to implement on top of any MaxEnt RL algorithm (e.g., on-policy, off-policy, and model-based).

Connection with Observation Adaptation: Now that we’ve introduced the DARC algorithm, we pause briefly to highlight the relationship between domain adaptation of dynamics versus observations. Consider tasks where the state (s_t triangleq (z_t, o_t)) is a combination of the system latent state (z_t) (e.g., the poses of all objects in a scene) and an observation (o_t) (e.g., a camera observation). We will define (q(o_t mid z_t)) and (p(o_t mid z_t)) as the observation models for the source and target domains (e.g., as defined by the rendering engine or the camera model). In this special case, we can write (Delta r) as the sum of two terms, which correspond to observation adaptation and dynamics adaptation: $$begin{aligned} Delta r(s_t, a_t, s_{t+1}) =& underbrace{log p_{text{target}}(o_t mid z_t) – log p_{text{source}}(o_t mid z_t)}_{text{Observation Adaptation}} \ & + underbrace{log p_{text{target}}(z_{t+1} mid z_t, a_t) – log p_{text{source}}(z_{t+1} mid z_t, a_t)}_{text{Dynamics Adaptation}} . end{aligned}$$ While prior methods excel at addressing observation adaptation [Bousmalis 2017, Gamrian 2019, Fernando 2013, Hoffman 2016, Wulfmeier 2017], only our method captures both the adaptation and dynamics terms.


We apply DARC to tasks where the robot is either broken in the target domain and where new obstacles are added in the target domain. Note that naïvely ignoring the shift in dynamics (green dashed line) performs quite poorly, and learning an explicit dynamics model (pink line) learns more slowly than our method.

We applied DARC to a number of simulated robotics tasks. On some tasks, we crippled one of the robot’s joints in the target domain, so the robot has to use practice on the fully-functional robot (source domain) to learn a policy that will work well on the broken robot. In another task, the target domain includes a new obstacle, so the robot has to use practice in a source domain (without the obstacle) to learn a policy that will avoid the obstacle. We show the results of this experiment in the figure above, plotting the reward on the target domain as a function of the number of transitions in the target domain. On all tasks, the policy learned in the simulator does not transfer to the target domain. On three of the four tasks our method matches (or even surpasses) the asymptotic performance of doing RL on the target domain, despite never doing RL on experience from the target domain, and despite observing 5 – 10x less experience from the target domain. However, as we scale to more complex tasks (“broken ant” and “half cheetah obstacle”), we observe that all baselines perform poorly.

Our method accounts for domain shift in the termination condition. In this experiment, the source domain includes safeguards that “catch” the robot and end the episode if the robot starts to fall; the target domain does not contain such safeguards. As shown on the right, our method learns to avoid taking actions that would trigger the safeguards.

Our final experiment showed how DARC can be used for safety. In many applications, robots have freedom to practice in a safe and cushioned practice facility, where they are preemptively stopped when they try to take unsafe actions. However, the real world may not contain such safeguards, so we want our robots to avoid relying on these safeguards. To study this setting, we used a simulated human robot. In the source domain, the episode terminates if the agent starts to fall; the target domain does not include this safeguard. As shown in the plot above, our method learns to remain standing for nearly the entire episode, while baselines fall almost immediately. While DARC was not designed as a method for safe RL [Tamar 2013, Achaim 2017, Berkenkamp 2017, Eysenbach 2017], this experiment suggests that safety may emerge automatically from DARC, without any manual reward function design.


The main idea of this project is that agents should avoid taking actions that would reveal whether they are in the source domain or the target domain. Even when practicing in a simulator, the agent should attempt to maintain the illusion that it is in the real world. The main limitation of our method is that the source dynamics must be sufficiently stochastic, an assumption that can usually be satisfied by adding noise to the dynamics, or ensembling a collection of sources.

For more information, check out the paper or code!

Acknowledgements: Thanks to Swapnil Asawa, Shreyas Chaudhari, Sergey Levine, Misha Khodak, Paul Michel, and Ruslan Salakhutdinov for feedback on this post.

Read More

In defense of weight-sharing for neural architecture search: an optimization perspective

In defense of weight-sharing for neural architecture search: an optimization perspective

Figure 1: Animation of how DARTS and other weight-sharing methods replace the discrete assignment of one of four operations (oin O) to an edge (e) with a (theta)-weighted combination of their outputs. At each edge (e) in the network, the value at input node (e_{in}) is passed to each operation in (O={texttt{1:Conv 3×3,  2:Conv 5×5, 3:Pool 3×3, 4:Skip Connect}}); the value at output node (e_{out}) will then be the sum of the operation outputs weighted by parameters (theta_{e,o}in[0,1]) that satisfy (sum_{oin O}theta_{e,o}=1).

This post is cross-listed on the Determined AI blog.

Neural architecture search (NAS) — selecting which neural model to use for your learning problem — is a promising but computationally expensive direction for automating and democratizing machine learning. The weight-sharing method, whose initial success at dramatically accelerating NAS surprised many in the field, has come under scrutiny due to its poor performance as a surrogate for full model-training (a miscorrelation problem known as rank disorder) and inconsistent results on recent benchmarks. In this post, we give a quick overview of weight-sharing and argue in favor of its continued use for NAS. To do so, we will consider a simple optimization formulation that reveals the following key takeaways:

  1. The fact that weight-sharing works should not really be too surprising to a community that embraces non-convex optimization of over-parameterized models.
  2. Rank disorder is not a concern for most weight-sharing methods, since we care about obtaining high-quality architectures rather than a ranking of them.
  3. The sometimes-poor performance of weight-sharing is a result of optimization issues that can be fixed while still using weight-sharing. We propose such a fix — a geometry-aware exponentiated algorithm (GAEA) — that is applicable to many popular NAS methods and achieves state-of-the-art results across several settings.

A brief history of NAS with weight-sharing

NAS is usually formulated as a bilevel optimization problem in which we are searching over some architecture domain (A) for the architecture (ain A) that achieves the lowest validation loss (ell_V(w_a,a)) after training weights (w_a) that minimize the training loss (ell_T(cdot,a)):

min_{ain A}~~ell_V(w_a,a)qquadtextrm{s.t.}qquad w_ainargmin_{winmathbb R^d}~~ell_T(w,a)

First-generation NAS methods were astronomically expensive due to the combinatorially large search space, requiring the training of thousands of neural networks to completion. Then, in their 2018 ENAS (for Efficient NAS) paper, Pham et al. introduced the idea of weight-sharing, in which only one shared set of model parameters (winmathbb R^d) is trained for all architectures. The validation losses (ell_V(w,a)) of different architectures (ain A) computed using these shared weights are then used as estimates of their validation losses (ell_V(w_a,a)) using standalone weights (w_a) (i.e. weights trained individually for each architecture by solving the inner optimization problem above). Because only one set of parameters has to be trained, weight-sharing led to a massive speedup over earlier methods, reducing search time on CIFAR-10 from 2,000-20,000 GPU-hours to just 16. Of great surprise to many was that the validation accuracies (ell_V(w,a)) computed using shared weights (w) were sufficiently good surrogates for (ell_V(w_a,a)), meaning ENAS was able to find good models cheaply. We will see that this correlation is actually a sufficient but not necessary condition for weight-sharing to do well and that weight-sharing’s overall success is less surprising than initially thought.

Following the ENAS breakthrough, several simpler methods such as DARTS and GDAS were proposed in which the categorical architecture decisions (e.g. for each network edge (ein E), which of some fixed set of operations (O) to use at (e)) are relaxed into continuous parameters (thetainTheta) (e.g. so (Theta) is a product of (|E|) simplices of size (|O|)). As animated in Figure 1, these architecture parameters govern the architecture used for updating the shared weights using the gradient of the training loss; for example, in DARTS, (theta_{e,o}) determines the weighting in the weighted sum of operations output by edge (e) in the network. Together, this parameterization leads to a continuous relaxation of the earlier bilevel problem:

min_{thetainTheta}~~ell_V(w_theta,theta)qquadtextrm{s.t.}qquad w_thetainargmin_{winmathbb R^d}~~ell_T(w,theta)

Since (Theta) is a constrained subset of (mathbb R^{|E||O|}), DARTS and GDAS avoid simplex projections by instead re-parameterizing to “logit” parameters (alphainmathbb R^{|E||O|}), with (theta=textrm{Softmax}(alpha)) defined as

theta_{e,o}=frac{exp(alpha_{e,o})}{sum_{o’in O}exp(alpha_{e,o’})}

The relaxed optimization problem can then be approximated via the following alternating gradient update scheme (here (eta>0) is a stepsize):

1.quad&texttt{initialize }w^{(0)}inmathbb R^d,alpha^{(0)}inmathbb R^{|E||O|}\
2.quad&texttt{for iteration }i=0,dots,n-1texttt{ do:}\
3.quad&quad w^{(i+1)}gets w^{(i)}-etanabla_well_T(w^{(i)},alpha^{(i)})\
4.quad&quad alpha^{(i+1)}getsalpha^{(i)}-etanabla_alphaell_V(w^{(i+1)},alpha^{(i)})\
5.quad &texttt{return }theta=textrm{Softmax}(alpha^{(n)})

Note that at the end of search, we need to recover a discrete architecture (ain A) from the architecture weights (theta); in DARTS this is done in a pruning step that simply chooses the highest-weighted operations. This post-hoc pruning is a source of error that our method, GAEA, ameliorates, as we discuss later.

A further simplification, and perhaps the most striking example of using shared weights as surrogates for standalone weights, is the Random Search with Weight-Sharing (RS-WS) method, in which the shared parameters are optimized by taking gradient steps using architectures sampled uniformly at random from the search space. Despite not even updating architecture parameters, RS-WS achieved competitive and, in some cases, state-of-the-art results.

Should we be using weight-sharing?

More recently, weight-sharing has come under increased scrutiny: does sharing weights between models really accelerate NAS? Can it preclude the recovery of optimal architectures? In particular, several papers have observed the issue of rank disorder, which is when the shared-weight performance of architectures does not correlate well with their stand-alone performance; this issue is illustrated in the figure below. Rank disorder is a problem for methods such as RS-WS that rely on the shared-weights performance to rank architectures for evaluation, as it will cause them to ignore networks that achieve high accuracy when their parameters are trained without sharing. 

Figure 2: Illustration of rank-disorder issue from Yu et al., 2020. It occurs when the performance ordering of architectures evaluated using shared-weights (right) does not match the true architecture performance when individual weights are trained from scratch (left).

This skepticism has been reinforced by recent cases where well-known weight-sharing methods have performed poorly; in particular, DARTS was found to overfit to the upper-level validation set in a recent robustness evaluation, and both GDAS and DARTS were outperformed by standard hyperparameter optimization methods on NAS-Bench-201 given a similar time budget. Here DARTS had especially poor performance, converging to architectures consisting of only identity maps (skip-connections).

Given the questions raised about weight-sharing and recent poor results, is it time to rethink its use? In the next section we answer in the negative by showing that (a) correlation of the performance of the weight-sharing “supernet” with that of fully trained models is a sufficient but not necessary condition for weight-sharing to work, i.e. we need not be afraid of rank disorder, and (b) obtaining high-quality architectural solutions using weight-sharing mostly comes down to regularization and optimization, two well-understood aspects of machine learning.

Vanilla weight-sharing is just empirical risk minimization

Weight-sharing is made possible by the fact that architecture parameters, unlike hyperparameters such as regularization and stepsize, directly affect the loss function, in that changing from one architecture (ain A) to a different architecture (a’in A) causes a change in the loss from (ell(w,a)) to (ell(w,a’)), as in the latter case a different function is being used for prediction. On the other hand, changing the stepsize setting would not change the loss unless the weights were also changed by training (w) using the new stepsize; this would mean the weights were no longer shared by different hyperparameter settings.

In fact, we can use the fact that architecture parameters can be subsumed as parameters of a supernet to pose NAS with weight-sharing as empirical risk minimization (ERM) over an extended class of functions encoded by both weights (winmathbb R^d) and architecture parameters (thetainTheta):

$$min_{winmathbb R^d,thetainTheta}ell_T(w,theta)$$

Most (but not all) empirical work on NAS uses the bilevel formulation described earlier rather than solving this single-level ERM problem. However, we believe ERM should be the baseline object of study in NAS because it is better understood statistically and algorithmically; the more common bilevel optimization can be viewed as a (possibly critical) method of regularizing by splitting the data.

The single-level formulation makes clear that, rather than rank disorder, NAS can fail for either of the usual reasons ERM fails: unsuccessful optimization or poor generalization. For example, these respective failures can largely explain the issues faced by DARTS on NAS-Bench-201 and NAS-Bench-1Shot1 that were discussed above. Of course, it is not surprising that supernets might face optimization and generalization issues, since they are non-convex and over-parameterized models; however, NAS practitioners are usually very comfortable training regular deep nets, which are also non-convex and have almost as many parameters. One major difference is that, in the latter case, we have had many years of intensive effort designing regularization schemes and optimization algorithms; if we were to dedicate a similar amount of NAS research to these two issues, then perhaps we would be no more afraid of weight-sharing than we are of training standard deep nets.

One caveat to this discussion is that NAS has the additional challenge of recovering discrete architectures from continuously-relaxed architecture weights. The optimization scheme we propose next ameliorates this issue by promoting sparse architectural parameters in the first place.

Fixing differentiable NAS with weight-sharing: a geometry-aware approach

Our discussion above reduces the problem of designing NAS methods to that of designing good regularization and optimization schemes for the supernet. There has been a good amount of recent work on better regularization schemes for the supernet, including partial channel connections, penalizing the Hessian of the validation loss, and the bilevel formulation itself; we thus focus instead on improving the optimization scheme, which we do with our geometry-aware exponentiated algorithm (GAEA).

As usual, we want an optimization method that can converge to a better solution faster. In the case of weight-sharing NAS, a high-quality solution will not only have good generalization but also induce sparse architecture weights (thetainTheta), which recall is related to the optimized parameters (alphainmathbb R^{|E||O|}) via a softmax. We expect sparse architecture parameters to be better because, as discussed earlier, the architecture parameters are rounded in a post-processing step to derive the final discrete architecture.

Figure 3: We want the final architecture weights (theta) to be sparse so that, when rounded, the resulting discrete architecture (ain A) is close to the supernet encoded by (theta). Otherwise the difference between the validation loss of the discrete architecture can be very different from that of the supernet. Since the elements of (Theta) lie in a product of simplices, sparsity means that, in each simplex, a single entry is 1 and the rest are 0.

In order to achieve this, we use exponentiated gradient descent to directly optimize over the elements (thetainTheta) instead of over unconstrained values (alphainmathbb R^{|E||O|}). The update scheme replaces subtracting the gradient w.r.t. (alpha) (line 4 in the pseudo-code) with element-wise multiplication by the negative exponentiated gradient w.r.t. (theta) (4.a), followed by projections to the simplices comprising (Theta) (4.b):

4.bquad&qquadtexttt{for }ein E,oin Otexttt{ do: }theta_{e,o}^{(i+1)}getsfrac{tildetheta_{e,o}}{sum_{o’in O}tildetheta_{e,o’}}

Note that each iteration is roughly as expensive as in SGD.

For convex problems, exponentiated gradient is known to be well-suited for the simplex geometry, with iteration complexity depending only logarithmically on the size (k) of the simplex, rather than the (mathcal O(k)) dependence of gradient descent. Under the mirror descent view of this result for linear prediction (video link), the improvement stems from the implicit regularizer encouraging larger updates when far away from a sparse target solution. For our non-convex problem, we obtain a similar guarantee by extending recent mirror descent results of Zhang & He to show that alternating the exponentiated update to the architecture parameters with SGD updates to the shared-weights yields an (varepsilon)-stationary-point in (mathcal O(log k/varepsilon^2)) iterations. We also show experimentally in Figure 4 that this approach encourages sparser solutions than DARTS and PC-DARTS.

Figure 4: Sparsity of architecture weights obtained by GAEA compared to the method it modifies on two different search spaces. Sparsity is measured using average entropy across architecture decision simplices.

Our GAEA approach, which is applicable to any method using the softmax formulation described earlier (this includes DARTS, GDAS, PC-DARTS, and others), can be summarized in two simple steps:

  1. Replace architecture weights (alpha) passed to the softmax with weights (theta) lying directly on the architecture decision simplices.
  2. Use the exponentiated gradient scheme (4.a & 4.b) to update these architecture weights (theta).

Empirical Evaluation of GAEA

Figure 5: Evaluation of GAEA on NAS-Bench-201. Standard hyperparameter optimization methods are in blue, while weight-sharing schemes are in purple. The optimal in the search space is in black. GAEA is the first weight-sharing scheme to outperform standard hyperparameter optimization on this search space and the only one to get within a standard deviation of the optimum on two of the three datasets, CIFAR-10 and CIFAR-100.

So does the sparsity and faster convergence rates of GAEA result in better performance empirically?  To test this, we simply apply the two steps above to modify existing state-of-the-art NAS methods. First, we evaluate GAEA applied to DARTS on the NAS-Bench-201 search space by Dong et al.  Of the methods evaluated by Dong et al., non-weight-sharing methods outperformed weight-sharing methods on all three datasets.  However, GAEA DARTS applied to the single-level ERM objective achieves the best accuracy across all three datasets, reaching near oracle-optimal performance on two of them. Notably, it fixes the catastrophically poor performance of DARTS on this space and is the only weight-sharing method that beats standard hyperparameter optimization.

Figure 6: Evaluation of GAEA on the DARTS search space. Weight-sharing methods are in purple, while non-weight-sharing methods are in blue. Note that the non-weight-sharing methods require more than 10,000 times as many GPU-hours as GAEA for search.

We also evaluated GAEA applied to PC-DARTS on the original DARTS CNN search space.  With improved regularization for the weight-sharing optimization problem, PC-DARTS was able to recently match the performance of computationally expensive non-weight-sharing methods on CIFAR-10 and ImageNet.  We are able to further boost the performance of PC-DARTS with GAEA and achieve state-of-the-art performance on this search space, demonstrating the importance of efficiently optimizing in the right geometry.

Interested in learning more?

To learn more about our results, weight-sharing, and NAS you can:

  • See our paper, joint with Nina Balcan and Ameet Talwalkar, for more details.
  • Download our code for a full implementation of GAEA.
  • Read about the original weight-sharing method, ENAS, and about the baseline weight-sharing method, RS-WS.
  • Read a survey of the field (Elsken, Metzen, & Hutter, JMLR 2019).

DISCLAIMER: All opinions expressed in this posts are those of the authors and do not represent the views of CMU.

Read More