Generative Teaching Networks: Accelerating Neural Architecture Search by Learning to Generate Synthetic Training Data

Kenneth O. Stanley and Jeff Clune served as co-senior authors of this article and its corresponding paper.

At Uber, many of the hard problems we work on can benefit from machine learning, such as improving safety, improving ETAs,

The post Generative Teaching Networks: Accelerating Neural Architecture Search by Learning to Generate Synthetic Training Data appeared first on Uber Engineering Blog.

Read More

PyTorch adds new tools and libraries, welcomes Preferred Networks to its community

PyTorch continues to be used for the latest state-of-the-art research on display at the NeurIPS conference next week, making up nearly 70% of papers that cite a framework. In addition, we’re excited to welcome Preferred Networks, the maintainers of the Chainer framework, to the PyTorch community. Their teams are moving fully over to PyTorch for developing their ML capabilities and services.

This growth underpins PyTorch’s focus on building for the needs of the research community, and increasingly, supporting the full workflow from research to production deployment. To further support researchers and developers, we’re launching a number of new tools and libraries for large scale computer vision and elastic fault tolerant training. Learn more on GitHub and at our NeurIPS booth.

Preferred Networks joins the PyTorch community

Preferred Networks, Inc. (PFN) announced plans to move its deep learning framework from Chainer to PyTorch. As part of this change, PFN will collaborate with the PyTorch community and contributors, including people from Facebook, Microsoft, CMU, and NYU, to participate in the development of PyTorch.

PFN developed Chainer, a deep learning framework that introduced the concept of define-by-run (also referred to as eager execution), to support and speed up its deep learning development. Chainer has been used at PFN since 2015 to rapidly solve real-world problems with the latest, cutting-edge technology. Chainer was also one of the inspirations for PyTorch’s initial design, as outlined in the PyTorch NeurIPS paper.

PFN has driven innovative work with CuPy, ImageNet in 15 minutes, Optuna, and other projects that have pushed the boundaries of design and engineering. As part of the PyTorch community, PFN brings with them creative engineering capabilities and experience to help take the framework forward. In addition, PFN’s migration to PyTorch will allow it to efficiently incorporate the latest research results to accelerate its R&D activities, given PyTorch’s broad adoption with researchers, and to collaborate with the community to add support for PyTorch on MN-Core, a deep learning processor currently in development.

We are excited to welcome PFN to the PyTorch community, and to jointly work towards the common goal of furthering advances in deep learning technology. Learn more about the PFN’s migration to PyTorch here.

Tools for elastic training and large scale computer vision

PyTorch Elastic (Experimental)

Large scale model training is becoming commonplace with architectures like BERT and the growth of model parameter counts into the billions or even tens of billions. To achieve convergence at this scale in a reasonable amount of time, the use of distributed training is needed.

The current PyTorch Distributed Data Parallel (DDP) module enables data parallel training where each process trains the same model but on different shards of data. It enables bulk synchronous, multi-host, multi-GPU/CPU execution of ML training. However, DDP has several shortcomings; e.g. jobs cannot start without acquiring all the requested nodes; jobs cannot continue after a node fails due to error or transient issue; jobs cannot incorporate a node that joined later; and lastly; progress cannot be made with the presence of a slow/stuck node.

The focus of PyTorch Elastic, which uses Elastic Distributed Data Parallelism, is to address these issues and build a generic framework/APIs for PyTorch to enable reliable and elastic execution of these data parallel training workloads. It will provide better programmability, higher resilience to failures of all kinds, higher-efficiency and larger-scale training compared with pure DDP.

Elasticity, in this case, means both: 1) the ability for a job to continue after node failure (by running with fewer nodes and/or by incorporating a new host and transferring state to it); and 2) the ability to add/remove nodes dynamically due to resource availability changes or bottlenecks.

While this feature is still experimental, you can try it out on AWS EC2, with the instructions here. Additionally, the PyTorch distributed team is working closely with teams across AWS to support PyTorch Elastic training within services such as Amazon Sagemaker and Elastic Kubernetes Service (EKS). Look for additional updates in the near future.

New Classification Framework

Image and video classification are at the core of content understanding. To that end, you can now leverage a new end-to-end framework for large-scale training of state-of-the-art image and video classification models. It allows researchers to quickly prototype and iterate on large distributed training jobs at the scale of billions of images. Advantages include:

  • Ease of use – This framework features a modular, flexible design that allows anyone to train machine learning models on top of PyTorch using very simple abstractions. The system also has out-of-the-box integration with AWS on PyTorch Elastic, facilitating research at scale and making it simple to move between research and production.
  • High performance – Researchers can use the framework to train models such as Resnet50 on ImageNet in as little as 15 minutes.

You can learn more at the NeurIPS Expo workshop on Multi-Modal research to production or get started with the PyTorch Elastic Imagenet example here.

Come see us at NeurIPS

The PyTorch team will be hosting workshops at NeurIPS during the industry expo on 12/8. Join the sessions below to learn more, and visit the team at the PyTorch booth on the show floor and during the Poster Session. At the booth, we’ll be walking through an interactive demo of PyTorch running fast neural style transfer on a Cloud TPU – here’s a sneak peek.

We’re also publishing a paper that details the principles that drove the implementation of PyTorch and how they’re reflected in its architecture.

Multi-modal Research to Production – This workshop will dive into a number of modalities such as computer vision (large scale image classification and instance segmentation) and Translation and Speech (seq-to-seq Transformers) from the lens of taking cutting edge research to production. Lastly, we will also walk through how to use the latest APIs in PyTorch to take eager mode developed models into graph mode via Torchscript and quantize them for scale production deployment on servers or mobile devices. Libraries used include:

  • Classification Framework – a newly open sourced PyTorch framework developed by Facebook AI for research on large-scale image and video classification. It allows researchers to quickly prototype and iterate on large distributed training jobs. Models built on the framework can be seamlessly deployed to production.
  • Detectron2 – the recently released object detection library built by the Facebook AI Research computer vision team. We will articulate the improvements over the previous version including: 1) Support for latest models and new tasks; 2) Increased flexibility, to enable new computer vision research; 3) Maintainable and scalable, to support production use cases.
  • Fairseq – general purpose sequence-to-sequence library, can be used in many applications, including (unsupervised) translation, summarization, dialog and speech recognition.

Responsible and Reproducible AI – This workshop on Responsible and Reproducible AI will dive into important areas that are shaping the future of how we interpret, reproduce research, and build AI with privacy in mind. We will cover major challenges, walk through solutions, and finish each talk with a hands-on tutorial.

  • Reproducibility: As the number of research papers submitted to arXiv and conferences skyrockets, scaling reproducibility becomes difficult. We must address the following challenges: aid extensibility by standardizing code bases, democratize paper implementation by writing hardware agnostic code, facilitate results validation by documenting “tricks” authors use to make their complex systems function. To offer solutions, we will dive into tool like PyTorch Hub and PyTorch Lightning which are used by some of the top researchers in the world to reproduce the state of the art.
  • Interpretability: With the increase in model complexity and the resulting lack of transparency, model interpretability methods have become increasingly important. Model understanding is both an active area of research as well as an area of focus for practical applications across industries using machine learning. To get hands on, we will use the recently released Captum library that provides state-of-the-art algorithms to provide researchers and developers with an easy way to understand the importance of neurons/layers and the predictions made by our models.`
  • Private AI: Practical applications of ML via cloud-based or machine-learning-as-a-service platforms pose a range of security and privacy challenges. There are a number of technical approaches being studied including: homomorphic encryption, secure multi-party computation, trusted execution environments, on-device computation, and differential privacy. To provide an immersive understanding of how some of these technologies are applied, we will use the CrypTen project which provides a community based research platform to take the field of Private AI forward.

We’d like to thank the entire PyTorch team and the community for all their contributions to this work.


Team PyTorch

Read More

OpenMined and PyTorch partner to launch fellowship funding for privacy-preserving ML community

OpenMined and PyTorch partner to launch fellowship funding for privacy-preserving ML community

Many applications of machine learning (ML) pose a range of security and privacy challenges. In particular, users may not be willing or allowed to share their data, which prevents them from taking full advantage of ML platforms like PyTorch. To take the field of privacy-preserving ML (PPML) forward, OpenMined and PyTorch are announcing plans to jointly develop a combined platform to accelerate PPML research as well as new funding for fellowships.

There are many techniques attempting to solve the problem of privacy in ML, each at various levels of maturity. These include (1) homomorphic encryption, (2) secure multi-party computation, (3) trusted execution environments, (4) on-device computation, (5) federated learning with secure aggregation, and (6) differential privacy. Additionally, a number of open source projects implementing these techniques were created with the goal of enabling research at the intersection of privacy, security, and ML. Among them, PySyft and CrypTen have taken an “ML-first” approach by presenting an API that is familiar to the ML community, while masking the complexities of privacy and security protocols. We are excited to announce that these two projects are now collaborating closely to build a mature PPML ecosystem around PyTorch.

Additionally, to bolster this ecosystem and take the field of privacy preserving ML forward, we are also calling for contributions and supporting research efforts on this combined platform by providing funding to support the OpenMined community and the researchers that contribute, build proofs of concepts and desire to be on the cutting edge of how privacy-preserving technology is applied. We will provide funding through the RAAIS Foundation, a non-profit organization with a mission to advance education and research in artificial intelligence for the common good. We encourage interested parties to apply to one or more of the fellowships listed below.

Tools Powering the Future of Privacy-Preserving ML

The next generation of privacy-preserving open source tools enable ML researchers to easily experiment with ML models using secure computing techniques without needing to be cryptography experts. By integrating with PyTorch, PySyft and CrypTen offer familiar environments for ML developers to research and apply these techniques as part of their work.

PySyft is a Python library for secure and private ML developed by the OpenMined community. It is a flexible, easy-to-use library that makes secure computation techniques like multi-party computation (MPC) and privacy-preserving techniques like differential privacy accessible to the ML community. It prioritizes ease of use and focuses on integrating these techniques into end-user use cases like federated learning with mobile phones and other edge devices, encrypted ML as a service, and privacy-preserving data science.

CrypTen is a framework built on PyTorch that enables private and secure ML for the PyTorch community. It is the first step along the journey towards a privacy-preserving mode in PyTorch that will make secure computing techniques accessible beyond cryptography researchers. It currently implements secure multiparty computation with the goal of offering other secure computing backends in the near future. Other benefits to ML researchers include:

  • It is ML first and presents secure computing techniques via a CrypTensor object that looks and feels exactly like a PyTorch Tensor. This allows the user to use automatic differentiation and neural network modules akin to those in PyTorch.
  • The framework focuses on scalability and performance and is built with real-world challenges in mind.

The focus areas for CrypTen and PySyft are naturally aligned and complement each other. The former focuses on building support for various secure and privacy preserving techniques on PyTorch through an encrypted tensor abstraction, while the latter focuses on end user use cases like deployment on edge devices and a user friendly data science platform.

Working together will enable PySyft to use CrypTen as a backend for encrypted tensors. This can lead to an increase in performance for PySyft and the adoption of CrypTen as a runtime by PySyft’s userbase. In addition to this, PyTorch is also adding cryptography friendly features such as support for cryptographically secure random number generation. Over the long run, this allows each library to focus exclusively on its core competencies while enjoying the benefits of the synergistic relationship.

New Funding for OpenMined Contributors

We are especially excited to announce that the PyTorch team has invested $250,000 to support OpenMined in furthering the development and proliferation of privacy-preserving ML. This gift will be facilitated via the RAAIS Foundation and will be available immediately to support paid fellowship grants for the OpenMined community.

How to get involved

Thanks to the support from the PyTorch team, OpenMined is able to offer three different opportunities for you to participate in the project’s development. Each of these fellowships furthers our shared mission to lower the barrier-to-entry for privacy-preserving ML and to create a more privacy-preserving world.

Core PySyft CrypTen Integration Fellowships

During these fellowships, we will integrate CrypTen as a supported backend for encrypted computation in PySyft. This will allow for the high-performance, secure multi-party computation capabilities of CrypTen to be used alongside other important tools in PySyft such as differential privacy and federated learning. For more information on the roadmap and how to apply for a paid fellowship, check out the project’s call for contributors.

Federated Learning on Mobile, Web, and IoT Devices

During these fellowships, we will be extending PyTorch with the ability to perform federated learning across mobile, web, and IoT devices. To this end, a PyTorch front-end will be able to coordinate across federated learning backends that run in Javascript, Kotlin, Swift, and Python. Furthermore, we will also extend PySyft with the ability to coordinate these backends using peer-to-peer connections, providing low latency and the ability to run secure aggregation as a part of the protocol. For more information on the roadmap and how to apply for a paid fellowship, check out the project’s call for contributors.

Development Challenges

Over the coming months, we will issue regular open competitions for increasing the performance and security of the PySyft and PyGrid codebases. For performance-related challenges, contestants will compete (for a cash prize) to make a specific PySyft demo (such as federated learning) as fast as possible. For security-related challenges, contestants will compete to hack into a PyGrid server. The first to demonstrate their ability will win the cash bounty! For more information on the challenges and to sign up to receive emails when each challenge is opened, sign up here.

To apply, select one of the above projects and identify a role that matches your strengths!


Andrew, Laurens, Joe, and Shubho

Read More

Deep Double Descent

Deep Double Descent

Deep Double Descent

We show that the double descent phenomenon occurs in CNNs, ResNets, and transformers: performance first improves, then gets worse, and then improves again with increasing model size, data size, or training time. This effect is often avoided through careful regularization. While this behavior appears to be fairly universal, we don’t yet fully understand why it happens, and view further study of this phenomenon as an important research direction.

Deep Double Descent

Read Paper

Many classes of modern deep learning models, including CNNs, ResNets, and transformers, exhibit the previously-observed double descent phenomenon when not using early stopping or regularization. The peak occurs predictably at a “critical regime,” where the models are barely able to fit the training set. As we increase the number of parameters in a neural network, the test error initially decreases, increases, and, just as the model is able to fit the train set, undergoes a second descent.

Neither classical statisticians’ conventional wisdom that too large models are worse nor the modern ML paradigm that bigger models are better uphold. We find that double descent also occurs over train epochs. Surprisingly, we show these phenomena can lead to a regime where more data hurts, and training a deep network on a larger train set actually performs worse.

Model-wise double descent

1. There is a regime where bigger models are worse.

Deep Double Descent

The model-wise double descent phenomenon can lead to a regime where training on more data hurts. In the chart above, the peak in test error occurs around the interpolation threshold, when the models are just barely large enough to fit the train set.

In all cases we’ve observed, changes which affect the interpolation threshold (such as changing the optimization algorithm, the number of train samples, or the amount of label noise) also affect the location of the test error peak correspondingly. The double descent phenomena is most prominent in settings with added label noise; without it, the peak is smaller and easy to miss. Adding label noise amplifies this general behavior and allows us to easily investigate.

Sample-wise non-monotonicity

2. There is a regime where more samples hurts.

Deep Double Descent

The above chart shows transformers trained on a language-translation task with no added label noise. As expected, increasing the number of samples shifts the curve downwards towards lower test error. However, since more samples require larger models to fit, increasing the number of samples also shifts the interpolation threshold (and peak in test error) to the right.

For intermediate model sizes (red arrows), these two effects combine, and we see that training on 4.5x more samples actually hurts test performance.

Epoch-wise double descent

3. There is a regime where training longer reverses overfitting.

Deep Double Descent

Deep Double Descent

The charts above show test and train error as a function of both model size and number of optimization steps. For a given number of optimization steps (fixed y-coordinate), test and train error exhibit model-size double descent. For a given model size (fixed x-coordinate), as training proceeds, test and train error decreases, increases, and decreases again; we call this phenomenon epoch-wise double descent.

In general, the peak of test error appears systematically when models are just barely able to fit the train set.

Our intuition is that, for models at the interpolation threshold, there is effectively only one model that fits the train data, and forcing it to fit even slightly noisy or misspecified labels will destroy its global structure. That is, there are no “good models” which both interpolate the train set and perform well on the test set. However, in the over-parameterized regime, there are many models that fit the train set and there exist such good models. Moreover, the implicit bias of stochastic gradient descent (SGD) leads it to such good models, for reasons we don’t yet understand.

We leave fully understanding the mechanisms behind double descent in deep neural networks as an important open question.


Thanks to Mikhail Belkin and Chris Olah for helpful discussions and feedback throughout this work. An expanded version of this post can also be found on Boaz Barak’s blog, Windows on Theory.


Text Feature Selection for Causal Inference

Text Feature Selection for Causal Inference

Making Causal Inferences with Text

Identifying the linguistic features that cause people to act a certain way after reading a text, regardless of confounding variables, is something people do all the time without even realizing it. For example,

  • Consider university course catalogues. Students peruse these each semester before signing up. What’s the magic 200-word blurb that jives with students enough to sign up? What kind of writing style recommendations could you give to any professor, regarding any subject?
  • Consider crowdfunding campaigns [1]. We want to know which writing styles pull in the most money, but the effect of language is confounded by the subject of the campaign – a campaign for someone’s medical bills will be written differently than a campaign for building wells. We want to find writing styles that could help any campaign.
  • Consider comments on reddit, where each post has a popularity score. Say that we’re interested in finding what writing styles will help posts become popular. Some authors list their genders on reddit, and a user’s gender may also affect popularity through tone, style, or topic choices [2]. How do you decide what kind of language to reccomend to any person, regardless of their gender.

Across three papers, we develop adversarial learning-based approaches for these kinds of tasks as well as a theory of causal inference to formalize the relationship between text and causality. Our method involves:

  1. Training a model which predicts outcomes from text. We control for confounds with adversarial learning [3], [4] or residualization [5].

  2. Interpreting the models’ learned parameters to identify the most important words and phrases for the outcome, regardless of confounders.

Compared to other feature selection methods, ours picks features that are more predictive of the outcome and less affected by confounding variables across four domains: e-commerce product descriptions (predictive of sales, regardless of brand), search advertisements (predictive of click-through rate, regardless of landing page), university course descriptions (predictive of enrollment, regardless of subject), and financial complaints (predictive of a short response time, regardless of topic).

Formalizing Textual Causality

Our goal is to find features of text(s) T which are predictive of some desired target variable(s) Y but unrelated to confounding variable(s) C (i.e. the blue bit in the figure below). This is equivalent to picking a lexicon L such that when words in T belonging to L are selected, the resulting set L(T) can explain Y but not C.

In the paper, we formalize this intuitive goal into maximizing an informativeness coefficient

which measures the explanatory power of the lexicon L(T) beyond the information already contained in the confounders C. The red tells us how much variation in Y is explainable by both L(T) and C. The blue fixes C, letting us focus on L(T)’s unique effects. In our paper, we show that under some conditions this coefficient is equivalent to the strength of T’s causal effects on Y! [6]

In practice I(L) can be estimated by this sequence of steps:

  1. Training a classifier A that predicts Y from L(T) and C
  2. Training a classifier B that predicts Y from C.
  3. Measuring error(B)error(A)

We continue by introducing two methods for coming up with the best lexicon L(T).

Method 1: Adversarial Learning

First, we encode T into a vector e via an attentional bi-LSTM. We then feed e into a series of feedforward neural networks which are trained to predict each target and confounding variable using a cross-entropy loss function. As gradients back-propagate from the confound prediction heads to the encoder, we pass them through a gradient reversal layer. In other words, If the cumulative loss of the target variables is L_t and that of the confounds is L_c, then the loss which is implicitly used to train the encoder is L_e = L_t – L_c. The encoder is encouraged to learn representations of the text which are unrelated to the confounds.

To get the “importance” of each feature, we simply look at the attention scores of the model, since ngrams the model focused on while making Y-predictions in a C-invariant way are themselves predictive of Y but not C!

Method 2: Deep Residualization

Recall that we can estimate I(L) by measuring the amount by which L can further improve predictions of Y compared to predictions of Y made from just C. Our Deep Residualization algorithm is directly motivated by this. It first predicts Y from C as well as possible, and then seeks to fine-tune those predictions using a bag-of-words representation of the text T. The parameters are then updated using the loss from both prediction steps. This two-stage prediction process implicitly controls for C because T is being used to explain the part of Y’s variance that the confounds can’t explain.

Then to get the “importance” of each feature, we trace all possible paths between the feature and output, multiply weights along these paths, then sum across paths.

Social Science Applications

Armed with our theoretical framework and algorithms, we can now pick words and phrases that are strongly associated with arbitrary outcomes, regardless of confounding information. In our papers, we do this for four domains:

  • Product descriptions for chocolate and health products on the Japanese e-commerce website Rakuten. We want to find language that explains sales, but not brand or price.
  • Written complaints to the Consumer Financial Protection Bureau (CFPB). We want to find language that predicts short response time, regardless of the financial product the complaint is about.
  • Search advertisements for real estate, job listings, and apparel on the website We want to find language that predicts a high click-through rate (CTR), regardless of the landing page the ad points to.
  • Course descriptions and enrollment figures for 6 years of undergraduate offerings at Stanford University. We want to find language that boosts enrollment, regardless of subject and requirements.

As we can see, in each setting one or both of our proposed methods outperform a number of existing feature selection algorithms: Residualized Regressions (RR), Regression with Confound features (RC), Mixed-effects Regression (MR), Mutual information (MI), and Log-Odds Ratio (OR).

Furthermore, we can interpret features these algorithms are selecting to learn about the linguistic dynamics of the associated domains!

  • Appeals to politeness and seasonality appear to help make for successful Japanese product descriptions – an interesting intersection of language and culture.
  • Concrete details (“multiple”, “xx/xx/xxxx”) and already having taken some steps (“submitted”, “ago”) appears important for writing a complaint that will get handled quickly.
  • Appeals to authority (“®“, “Official site”) and personalization (“your” “personalized”) are helpful for search advertising creatives.
  • Student choice (“or”) and dynamic activities (“eating”, “doing”, “guest”, “project”) make for successful course descriptions.


This work presented two methods for identifying text features which best explain an outcome, controlling for confounding variables we are not interested in. This method is generally applicable to a variety of data science and social science applications. In the future, we hope to strengthen the method’s theoretical guarantees in a causal inference framework.

The algorithms in this blog post have been open-sourced! Install via pip:

pip3 install causal-selection

This post was based on the following papers:

  1. Deconfounded Lexicon Induction for Interpretable Social Science

  2. Interpretable Neural Architectures for Attributing an Ad’s Performance to its Writing Style

  3. Predicting Sales from the Language of Product Descriptions

Read More