SAIL at ICLR 2020: Accepted Papers and Videos

SAIL at ICLR 2020: Accepted Papers and Videos

The International Conference on Learning Representations (ICLR) 2020 is being hosted virtually from April 26th – May 1st. We’re excited to share all the work from SAIL that’s being presented, and you’ll find links to papers, videos and blogs below. Feel free to reach out to the contact authors directly to learn more about the work that’s happening at Stanford!

List of Accepted Papers

Hierarchical Foresight: Self-Supervised Learning of Long-Horizon Tasks via Visual Subgoal Generation

paper

Suraj Nair, Chelsea Finn | contact: surajn@stanford.edu
keywords: visual planning; reinforcement learning; robotics

Active World Model Learning with Progress Curiosity

paper

Kuno Kim, Megumi Sano, Julian De Freitas, Nick Haber, Dan Yamins | contact: khkim@cs.stanford.edu
keywords: curiosity, reinforcement learning, cognitive science

Kaleidoscope: An Efficient, Learnable Representation For All Structured Linear Maps

paper | blog post

Tri Dao, Nimit Sohoni, Albert Gu, Matthew Eichhorn, Amit Blonder, Megan Leszczynski, Atri Rudra, Christopher Ré | contact: trid@stanford.edu
keywords: structured matrices, efficient ml, algorithms, butterfly matrices, arithmetic circuits

Weakly Supervised Disentanglement with Guarantees

paper

Rui Shu, Yining Chen, Abhishek Kumar, Stefano Ermon, Ben Poole | contact: ruishu@stanford.edu
keywords: disentanglement, generative models, weak supervision, representation learning, theory

Depth width tradeoffs for Relu networks via Sharkovsky’s theorem

paper

Vaggos Chatziafratis, Sai Ganesh Nagarajan, Ioannis Panageas, Xiao Wang | contact: vaggos@cs.stanford.edu
keywords: dynamical systems, benefits of depth, expressivity

Watch, Try, Learn: Meta-Learning from Demonstrations and Reward

paper

Allan Zhou, Eric Jang, Daniel Kappler, Alex Herzog, Mohi Khansari, Paul Wohlhart, Yunfei Bai, Mrinal Kalakrishnan, Sergey Levine, Chelsea Finn | contact: ayz@stanford.edu
keywords: imitation learning, meta-learning, reinforcement learning

Assessing robustness to noise: low-cost head CT triage

paper

Sarah Hooper, Jared Dunnmon, Matthew Lungren, Sanjiv Sam Gambhir, Christopher Ré, Adam Wang, Bhavik Patel | contact: smhooper@stanford.edu
keywords: ai for affordable healthcare workshop, medical imaging, sinogram, ct, image noise

Learning transport cost from subset correspondence

paper

Ruishan Liu, Akshay Balsubramani, James Zou | contact: ruishan@stanford.edu
keywords: optimal transport, data alignment, metric learning

Generalization through Memorization: Nearest Neighbor Language Models

paper

Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, Mike Lewis | contact: urvashik@stanford.edu
keywords: language models, k-nearest neighbors

Distributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for Worst-Case Generalization

paper

Shiori Sagawa, Pang Wei Koh, Tatsunori B. Hashimoto, Percy Liang | contact: ssagawa@cs.stanford.edu
keywords: distributionally robust optimization, deep learning, robustness, generalization, regularization

Phase Transitions for the Information Bottleneck in Representation Learning

paper

Tailin Wu, Ian Fischer | contact: tailin@cs.stanford.edu
keywords: information theory, representation learning, phase transition

Improving Neural Language Generation with Spectrum Control

paper

Lingxiao Wang, Jing Huang, Kevin Huang, Ziniu Hu, Guangtao Wang, Quanquan Gu | contact: jhuang18@stanford.edu
keywords: neural language generation, pre-trained language model, spectrum control

Understanding and Improving Information Transfer in Multi-Task Learning

paper | blog post

Sen Wu, Hongyang Zhang, Christopher Ré | contact: senwu@cs.stanford.edu
keywords: multi-task learning

Strategies for Pre-training Graph Neural Networks

paper | blog post

Weihua Hu, Bowen Liu, Joseph Gomes, Marinka Zitnik, Percy Liang, Vijay Pande, Jure Leskovec | contact: weihuahu@cs.stanford.edu
keywords: pre-training, transfer learning, graph neural networks

Query2box: Reasoning over Knowledge Graphs in Vector Space using Box Embeddings

paper

Hongyu Ren, Weihua Hu, Jure Leskovec | contact: hyren@cs.stanford.edu
keywords: knowledge graph embeddings

Learning Self-Correctable Policies and Value Functions from Demonstrations with Negative Sampling

paper

Yuping Luo, Huazhe Xu, Tengyu Ma | contact: roosephu@gmail.com
keywords: imitation learning, model-based imitation learning, model-based rl, behavior cloning, covariate shift

Improved Sample Complexities for Deep Neural Networks and Robust Classification via an All-Layer Margin

paper

Colin Wei, Tengyu Ma | contact: colinwei@stanford.edu
keywords: deep learning theory, generalization bounds, adversarially robust generalization, data-dependent generalization bounds

Selection via Proxy: Efficient Data Selection for Deep Learning

paper | blog post | code

Cody Coleman, Christopher Yeh, Stephen Mussmann, Baharan Mirzasoleiman, Peter Bailis, Percy Liang, Jure Leskovec, Matei Zaharia | contact: cody@cs.stanford.edu
keywords: active learning, data selection, deep learning


We look forward to seeing you at ICLR!

Read More

Automating Data Augmentation: Practice, Theory and New Direction

Automating Data Augmentation: Practice, Theory and New Direction

Data augmentation is a de facto technique used in nearly every state-of-the-art machine learning model in applications such as image and text classification. Heuristic data augmentation schemes are often tuned manually by human experts with extensive domain knowledge, and may result in suboptimal augmentation policies. In this blog post, we provide a broad overview of recent efforts in this exciting research area, which resulted in new algorithms for automating the search process of transformation functions, new theoretical insights that improve the understanding of various augmentation techniques commonly used in practice, and a new framework for exploiting data augmentation to patch a flawed model and improve performance on crucial subpopulation of data.

Why Data Augmentation?

Modern machine learning models, such as deep neural networks, may have billions of parameters and require massive labeled training datasets—which are often not available. The technique of artificially expanding labeled training datasets—known as data augmentation—has quickly become critical for combating this data scarcity problem. Today, data augmentation is used as a secret sauce in nearly every state-of-the-art model for image classification, and is becoming increasingly common in other modalities such as natural language understanding as well. The goal of this blog post is to provide an overview of recent efforts in this exciting research area.

Figure 1. Heuristic data augmentations apply a deterministic sequence of transformation functions tuned by human experts.The augmented data will be used for training downstream models.

Heuristic data augmentation schemes often rely on the composition of a set of simple transformation functions (TFs) such as rotations and flips (see Figure 1). When chosen carefully, data augmentation schemes tuned by human experts can improve model performance. However, such heuristic strategies in practice can cause large variances in end model performance, and may not produce augmentations needed for state-of-the-art models.

The Open Challenges in Data Augmentation

The limitations of conventional data augmentation approaches reveal huge opportunities for research advances. Below we summarize a few challenges that motivate some of the works in the area of data augmentation.

  • From manual to automated search algorithms: As opposed to performing suboptimal manual search, how can we design learnable algorithms to find augmentation strategies that can outperform human-designed heuristics?

  • From practical to theoretical understanding: Despite the rapid progress of creating various augmentation approaches pragmatically, understanding their benefits remains a mystery because of a lack of analytic tools. How can we theoretically understand various data augmentations used in practice?

  • From coarse-grained to fine-grained model quality assurance: While most existing data augmentation approaches focus on improving the overall performance of a model, it is often imperative to have a finer-grained perspective on critical subpopulations of data. When a model exhibits inconsistent predictions on important subgroups of data, how can we exploit data augmentations to mitigate the performance gap in a prescribed way?

In this blog, we will describe ideas and recent research works leading the way to overcome these challenges above.

Practical Methods of Learnable Data Augmentations

Learnable data augmentation is promising, in that it allows us to search for more powerful parameterizations and compositions of transformations. Perhaps the biggest difficulty with automating data augmentation is how to search over the space of transformations. This can be prohibitive due to the large number of transformation functions and associated parameters in the search space. How can we design learnable algorithms that explore the space of transformation functions efficiently and effectively, and find augmentation strategies that can outperform human-designed heuristics? In response to the challenge, we highlight a few recent methods below.

TANDA: Transformation Adversarial Networks for Data Augmentations

To address this problem, TANDA (Ratner et al. 2017) proposes a framework to learn augmentations, which models data augmentations as sequences of Transformation Functions (TFs) provided by users. For example, these might include “rotate 5 degrees” or “shift by 2 pixels”. At the core, this framework consists of two components (1) learning a TF sequence generator that results in useful augmented data points, and (2) using the sequence generator to augment training sets for a downstream model. In particular, the TF sequence generator is trained to produce realistic images by having to fool a discriminator network, following the GANs framework (Goodfellow et al. 2014). The underlying assumption here is that the transformations would either lead to realistic images, or indistinguishable garbage images that are off the manifold. As shown in Figure 1, the objective for the generator is to produce sequences of TFs such that the augmented data point can fool the discriminator; whereas the objective for the discriminator is to produce values close to 1 for data points in the original training set and values close to 0 for augmented data points.

Figure 2. Automating data augmentation with TANDA (Ratner et al. 2017). A TF sequence generator is trained adversarially to produce augmented images that are realistic compared to training data.

AutoAugment and Further Improvement
Using a similar framework, AutoAugment (Cubuk et al. 2018) developed by Google demonstrated state-of-the-art performance using learned augmentation policies. In this work, a TF sequence generator learns to directly optimize for validation accuracy on the end model. Several subsequent works including RandAugment (Cubuk et al. 2019) and Adversarial AutoAugment (Zhang et al. 2019) have been proposed to reduce the computational cost of AutoAugment, establishing new state-of-the-art performance on image classification benchmarks.

Theoretical Understanding of Data Augmentations

Despite the rapid progress of practical data augmentation techniques, precisely understanding their benefits remains a mystery. Even for simpler models, it is not well-understood how training on augmented data affects the learning process, the parameters, and the decision surface. This is exacerbated by the fact that data augmentation is performed in diverse ways in modern machine learning pipelines, for different tasks and domains, thus precluding a general model of transformation. How can we theoretically characterize and understand the effect of various data augmentations used in practice? To address this challenge, our lab has studied data augmentation from a kernel perspective, as well as under a simplified linear setting.

Data Augmentation As a Kernel

Dao et al. 2019 developed a theoretical framework by modeling data augmentation as a Markov Chain, in which augmentation is performed via a random sequence of transformations, akin to how data augmentation is performed in practice. We show that the effect of applying the Markov Chain on the training dataset (combined with a k-nearest neighbor classifier) is akin to using a kernel classifier, where the kernel is a function of the base transformations.

Built on the connection between kernel theory and data augmentation, Dao et al. 2019 show that a kernel classifier on augmented data approximately decomposes into two components: (i) an averaged version of the transformed features, and (ii) a data-dependent variance regularization term. This suggests a more nuanced explanation of data augmentation—namely, that it improves generalization both by inducing invariance and by reducing model complexity. Dao et al. 2019 validate the quality of our approximation empirically, and draw connections to other generalization-improving techniques, including recent work on invariant learning (van der Wilk et al. 2018) and robust optimization (Namkoong & Duchi, 2017).

Data Augmentation Under A Simplified Linear Setting

One limitation of the above works is that it is challenging to pin down the effect of applying a particular transformation on the resulting kernel. Furthermore, it is not yet clear how to apply data augmentation efficiently on kernel methods to get comparable performance to neural nets. In more recent work, we consider a simpler linear setting that is capable of modeling a wide range of linear transformations commonly used in image augmentation, as shown in Figure 3.

Theoretical Insights. We offer several theoretical insights by considering an over-parametrized linear model, where the training data lies in a low-dimensional subspace. We show that label-invariant transformations can add new information to the training data, and estimation error of the ridge estimator can be reduced by adding new points that are outside the span of the training data. In addition, we show that mixup (Zhang et al., 2017 can play an effect of regularization through shrinking the weight of the training data relative to the L2 regularization term on the training data.

Figure 3. Illustration of common linear transformations applied in data augmentation.

Theory-inspired New State-of-the-art. One insight from our theoretical investigation is that different (compositions of) transformations show very different end performance. Inspired by this observation, we’d like to make use of the fact that certain transformations are better performing than others. We propose an uncertainty-based random sampling scheme which, among the transformed data points, picks those with the highest losses, i.e. those “providing the most information” (see Figure 4). Our sampling scheme achieves higher accuracy by finding more useful transformations compared to RandAugment on three different CNN architectures, establishing new state-of-the-art performance on common benchmarks. For example, our method outperforms RandAugment by 0.59% on CIFAR-10 and 1.24% on CIFAR-100 using Wide-ResNet-28-10. Please check out our full paper here. Our code will be released soon for you to try out!

Figure 4. Uncertainty-based random sampling scheme for data augmentation. Each transformation function is randomly sampled from a set of pre-specified operations. We select among the transformed data points with highest loss for end model training.

New Direction: Data Augmentations for Model Patching

Most machine learning research carried out today is still solving fixed tasks. However, in the real world, machine learning models in deployment can fail due to unanticipated changes in data distribution. This raises the concerning question of how we can move from model building to model maintenance in an adaptive manner. In our latest work, we propose model patching—the first framework that exploits data augmentation to mitigate the performance issues of a flawed model in deployment.

A Medical Use Case of Model Patching

To provide a concrete example, in skin cancer detection, researchers have shown that standard classifiers have drastically different performance on two subgroups of the cancerous class, due to the classifier’s association between colorful bandages with benign images (see Figure 5, left). This subgroup performance gap has also been studied in parallel research from our group (Oakden-Rayner et al., 2019), and arises due to classifier’s reliance on subgroup-specific features, e.g. colorful bandages.

Figure 5: A standard model trained on a skin cancer dataset exhibits a subgroup performance gap between images of malignant cancers with and without colored bandages. GradCAM illustrates that the vanilla model spuriously associates the colored spot with benign skin lesions. With model patching, the malignancy is predicted correctly for both subgroups.

In order to fix such flaws in a deployed model, domain experts have to resort to manual data cleaning to erase the differences between subgroups, e.g. removing markings on skin cancer data with Photoshop (Winkler et al. 2019), and retrain the model with the modified data. This can be extremely laborious! Can we somehow learn transformations that allow augmenting examples to balance population among groups in a prescribed way? This is exactly what we are addressing through this new framework of model patching.

CLAMP: Class-conditional Learned Augmentations for Model Patching

The conceptual framework of model patching consists of two stages (as shown in Figure 6).

  • Learn inter-subgroup transformations between different subgroups. These transformations are class-preserving maps that allow semantically changing a datapoint’s subgroup identity (e.g. add or remove colorful bandages).
  • Retrain to patch the model with augmented data, encouraging the classifier to be robust to their variations.

Figure 6: Model Patching framework with data augmentation. The highlighted box contains samples from a class with differing performance between subgroups A and B. Conditional generative models are trained to transform examples from one subgroup to another (A->B and B->A) respectively.

We propose CLAMP, an instantiation of our first end-to-end model patching framework. We combine a novel consistency regularizer with a robust training objective that is inspired by recent work of Group Distributionally Robust Optimization (GDRO, Sagawa et al. 2019). We extend GDRO to a class-conditional training objective that jointly optimizes for the worst-subgroup performance in each class. CLAMP is able to balance the performance of subgroups within each class, reducing the performance gap by up to 24x. On a skin cancer detection dataset ISIC, CLAMP improves robust accuracy by 11.7% compared to the robust training baseline. Through visualization, we also show in Figure 5 that CLAMP successfully removes the model’s reliance on the spurious feature (colorful bandages), shifting its attention to the skin lesion—true feature of interest.

Our results suggest that the model patching framework is a promising direction for automating the process of model maintenance. In fact, model patching is becoming a late breaking area that would alleviate the major problem in safety-critical systems, including healthcare (e.g. improving models to produce MRI scans free of artifact) and autonomous driving (e.g. improving perception models that may have poor performance on irregular objects or road conditions). We envision that model patching can be widely useful for many other domain applications. If you are intrigued by the latest research on model patching, please follow our Hazy Research repository on Github where the code will be released soon. If you have any feedback for our drafts and latest work, we’d like to hear from you!

Further Reading

Acknowledgments

Thanks to members of Hazy Research who provided feedback on the blog post. Special thanks to Sidd Karamcheti and Andrey Kurenkov from the SAIL blog team for the editorial help.

About the Author

Sharon Y. Li is a postdoctoral fellow at Stanford, working with Chris Ré. She is an incoming Assistant Professor in the department of Computer Sciences at University of Wisconsin-Madison. Her research focuses on developing machine learning models and systems that can reduce human supervision during training, and enhance reliability during deployment in the wild.

Read More

Six from MIT elected to American Academy of Arts and Sciences for 2020

Six MIT faculty members are among more than 250 leaders from academia, business, public affairs, the humanities, and the arts elected to the American Academy of Arts and Sciences, the academy announced Thursday.

One of the nation’s most prestigious honorary societies, the academy is also a leading center for independent policy research. Members contribute to academy publications, as well as studies of science and technology policy, energy and global security, social policy and American institutions, the humanities and culture, and education.

Those elected from MIT this year are:

  • Robert C. Armstrong, Chevron Professor in Chemical Engineering;
  • Dave L. Donaldson, professor of economics;
  • Catherine L. Drennan, professor of biology and chemistry;
  • Ronitt Rubinfeld, professor of electrical engineering and computer science;
  • Joshua B. Tenenbaum, professor of brain and cognitive sciences; and
  • Craig Steven Wilder, Barton L. Weller Professor of History.

“The members of the class of 2020 have excelled in laboratories and lecture halls, they have amazed on concert stages and in surgical suites, and they have led in board rooms and courtrooms,” said academy President David W. Oxtoby. “With today’s election announcement, these new members are united by a place in history and by an opportunity to shape the future through the academy’s work to advance the public good.”

Since its founding in 1780, the academy has elected leading thinkers from each generation, including George Washington and Benjamin Franklin in the 18th century, Maria Mitchell and Daniel Webster in the 19th century, and Toni Morrison and Albert Einstein in the 20th century. The current membership includes more than 250 Nobel and Pulitzer Prize winners.

Read More

Reporting tool aims to balance hospitals’ Covid-19 load

As cases of Covid-19 continue to climb in parts of the United States, the number of people seeking treatment is threatening to overwhelm many hospitals, forcing some facilities to ration their care and reserve ventilators, hospital beds, and other limited medical resources for the sickest patients. 

Having a handle on local hospitals’ capacity and resource availability could help balance the load of Covid-19 patients requiring hospitalization across a region, for instance allowing an EMT to send a patient to a facility where they are more likely to be treated quickly. But many states lack real-time data on their current capacity to treat Covid-19 patients. 

A group of researchers in MIT’s Computer Science and Intelligence Laboratory (CSAIL), working with the MIT spinoff Mobi Systems, are aiming to help level demand across the entire health care network by providing real-time updates of hospital resources, which they hope will help patients, EMTs, and physicians quickly decide which facility is best equipped to handle a new patient at any given time. 

The team has developed a web app which is now publicly accessible at: https://Covid19hospitalstatus.com. The interface allows users such as patients, nurses, and doctors to report a hospital’s current status in a number of metrics, from the average wait time (something that a patient may get a sense for as they spend time in a waiting room), to the number of ventilators and ICU beds, which doctors and nurses may be able to approximate.

EMTS can use the app as a map, zooming in by state, county, or city to quickly gauge hospital capacity, and decide which nearby hospitals have available beds where they can send a patient requiring hospitalization. The app can also generate a list of hospitals, prioritized by availability, time of travel, and most recently updated data. 

“We want to flatten the Covid curve by physical distancing over the course of months,” says MIT graduate Anna Jaffe ’07, CEO of Mobi Systems. “But there’s another curve to flatten, which is this real-time challenge of getting the right patient to the right hospital, in the right moment, to level the load on hospitals and health care workers.”

“Do something”

As the pandemic began to unfold around the world, Jaffe was intrigued by the results of a short hackathon that one Mobi member, Julius Pätzold, recently attended in Germany. The weekend challenge, sponsored by the German government, included a problem to match supply and demand, for instance in a hospital facing a surge in patient visits. 

His team mapped the German hospital infrastructure, including the status of individual hospitals’ capacity, then simulated dispatching patients to hospitals according to a hospital’s capacity, its relative location to a patient, and a patient’s medical needs. The real-time maps developed over this short time suggested such tools would have a positive impact on a patient’s quality of care, specifically in decreasing death rates.

“That intersected with my feeling that I think everyone wants to do something around Covid-19 in response to the current crisis, and not just be cooped up in our respective homes,” says Jaffe, whose company, Mobi Systems, develops tools for large-scale network optimization problems surrounding mobility and hospitality. 

Mobi originally grew out of CSAIL’s Model-based Embedded Robotic Systems group, led by MIT Professor Brian Williams, whose work involves developing autonomous planning tools to help individuals make complex, real-time decisions in the face of uncertainty and risk. 

Jaffe reached out to Williams to help develop a web-based reporting tool for hospitals, to similarly help patients and medical professionals make critical, real-time decisions of where best to send a patient, based on resource availability. 

“Our question was, how can the resources statewide or nationwide be used most effectively, in order to keep the most people healthy,” Williams says. “And for the individual, which hospital will meet their needs, and how do they get there. That’s the exercise we’re tackling here.”

Crowd power

The team’s app is heavily dependent on crowdsourced data, and the willingness of patients and medical professionals to report on various metrics, from a hospital’s current wait time to the approximate number of ICU beds and ventilators available. 

“The reporting options right now are very specific,” Jaffe says. “But what we really want to know is, can your hospital accept a patient right now?” 

A user can enter their role — patient, nurse, or physician — then report on, for instance, a hospital’s average wait time. With a sliding scale, they can rate their confidence in their report before submitting it. 

But what if those users are reporting false or inaccurate data, whether intentionally or not? 

Williams says in order to guard against such uncertainty, the team takes a probabilistic approach. For instance, the app assumes that one user’s reporting of a hospital’s status is one of low confidence, which is initially not weighed heavily in the overall estimation for that metric. They can then incorporate this one data point into all the other reports they’ve received for that metric. If most of those reports have also been rated with low confidence, but report the same result, that estimate, such as of wait time, is automatically weighed more heavily, and therefore rated at a higher confidence overall.  

Additionally, he says if the app receives reports from more trusted sources — for instance, if hospitals make in-house, aggregated data available to the app — those sources would “swamp out” or take higher priority over low-confidence reports of the same metric. 

The team is testing the app with just such a trustworthy dataset, from the state of Pennsylvania, which for the last several years has had a system in place for hospitals to report resource availability, that is updated at least twice a day. The team has used data from the last week to track Covid-19 visits across the state’s hospital system.

“In this data, you can see that not all hospitals are overrun — there are clear differences in availability,” says MIT graduate Peng Yu ’SM 13, ’PhD 17, chief technology officer at Mobi, highlighting the potential for distributing patients across a region’s hospitals, to balance resources across a hospital network. 

However, most states lack such aggregated, updated information. In most other states, for instance, EMTs either have a handful of default facilities where they typically send patients, or they have to call around to surrounding hospitals to check availability. 

“It’s really about word of mouth — who do you know, and who do you call up,” says Williams, whose nephew is an EMT who has worked in regions with varying decision-making practices. “We’re trying to aggregate that information, to make these recommendations much faster.

The team is now reaching out to thousands of medical professionals to test-drive the reporting tool, in hopes of boosting the crowdsourcing component for the app, which is now available on any internet-enabled device. To address the pandemic, the team believes that data need to be made available at a faster rate than the virus’ spread. Their hope is that states will follow in Pennsylvania’s footsteps and, for instance, mandate that hospitals report resource data, and provide reporting tools such as the new app to doctors and EMTs. 

“This project is very much for the people, by the people, and will be kept open and free,” Williams says.  

“Unfortunately, it doesn’t feel like this is a flash pandemic,” Jaffe says. “Even in a recovery period, hospitals will have to resume normal care, concurrently with treating Covid-19 over time. Our app may help load balance in that way as well, so hospitals can more effectively predict how many floors they need to quarantine for Covid-19, so that the rest of the hospital can go back to things like having families around a mother giving birth. We aim to really understand how to bring things back to a more normal operational status, while still handling the crisis.

Read More

Shedding light on complex power systems

Marija Ilic — a senior research scientist at the Laboratory for Information and Decision Systems, affiliate of the MIT Institute for Data, Systems, and Society, senior staff in MIT Lincoln Laboratory’s Energy Systems Group, and Carnegie Mellon University professor emerita — is a researcher on a mission: making electric energy systems future-ready.

Since the earliest days of streetcars and public utilities, electric power systems have had a fairly standard structure: for a given area, a few large generation plants produce and distribute electricity to customers. It is a one-directional structure, with the energy plants being the only source of power for many end users.

Today, however, electricity can be generated from many and varied sources — and move through the system in multiple directions. An electric power system may include stands of huge turbines capturing wild ocean winds, for instance. There might be solar farms of a hundred megawatts or more, or houses with solar panels on their roofs that some days make more electricity than occupants need, some days much less. And there are electric cars, their batteries hoarding stored energy overnight. Users may draw electricity from one source or another, or feed it back into the system, all at the same time. Add to that the trend toward open electricity markets, where end users like households can pick and choose the electricity services they buy depending on their needs. How should systems operators integrate all these while keeping the grid stable and ensuring power gets to where it is needed?

To explore this question, Ilic has developed a new way to model complex power systems.

Electric power systems, even traditional ones, are complex and heterogeneous to begin with. They cover wide geographical areas and have legal and political barriers to contend with, such as state borders and energy policies. In addition, all electric power systems have inherent physical limitations. For instance, power does not flow in a set path in an electric grid, but rather along all possible paths connecting supply to demand. To maintain grid stability and quality of service, then, the system must control for the impact of interconnections: a change in supply and demand at one point in a system changes supply and demand for the other points in the system. This means there is much more complexity to manage as new sources of energy (more interconnections) with sometimes unpredictable supply (such as wind or solar power) come into play. Ultimately, however, to maintain stability and quality of service, and to balance supply and demand within the system, it comes down to a relatively simple concept: the power consumed and the rate at which it is consumed (plus whatever is lost along the way), must always equal the power produced and the rate at which it is produced.

Using this simpler concept to manage the complexities and limitations of electric power systems, Ilic is taking a non-traditional approach: She models the systems using information about energy, power, and ramp rate (the rate at which power can increase over time) for each part of the system — distributing decision-making calculations into smaller operational chunks. Doing this streamlines the model but retains information about the system’s physical and temporal structure. “That’s the minimal information you need to exchange. It’s simple and technology-agnostic, but we don’t teach systems that way.”

She believes regulatory organizations such as the Federal Energy Regulatory Commission and North American Energy Reliability Corporation should have standard protocols for such information exchanges, just as internet protocols govern how data is exchanged on the internet. “If you were to [use a standard set of] specifications like: what is your capacity, how much does it vary over time, how much energy do you need and within what power range — the system operator could integrate different sources in a much simpler way than we are doing now.” 

Another important aspect of Ilic’s work is that her models lend themselves to controlling the system with a layer of sensor and communications technologies. This uses a framework she developed called Dynamic Monitoring and Decision Systems framework, or DyMonDS. The data-enabled decision-making concept has been tested using real data from Portugal’s Azores Islands, and since applied to real-world challenges. After so many years it appears that her new modeling approach fittingly supports DyMonDS design, including systematic use of many theoretical concepts used by the LIDS community in their research.

One such challenge included work on Puerto Rico’s power grid. Ilic was the technical lead on a Lincoln Laboratory project on designing future architectures and software to make Puerto Rico’s electric power grid more resilient without adding much more production capacity or cost. Typically, a power grid’s generation capacity is scheduled in a simple, brute-force way, based on weather forecasts and the hottest and coldest days of the year, that doesn’t respond sensitively to real-time needs. Making such a system more resilient would mean spending a lot more on generation and transmission and distribution capacity, whereas a more dynamic system that integrates distributed microgrids could tame the cost, Ilic says: “What we are trying to do is to have systematic frameworks for embedding intelligence into small microgrids serving communities, and having them interact with large-scale power grids. People are realizing that you can make many small microgrids to serve communities rather than relying only on large scale electrical power generation.”

Although this is one of Ilic’s most recent projects, her work on DyMonDS can be traced back four decades, to when she was a student at the University of Belgrade in the former country of Yugoslavia, which sent her to the United States to learn how to use computers to prevent blackouts.

She ended up at Washington University in St. Louis, Missouri, studying with applied mathematician John Zaborszky, a legend in the field who was originally chief engineer of Budapest’s municipal power system before moving to the United States. (“The legend goes that in the morning he would teach courses, and in the afternoon he would go and operate Hungarian power system protection by hand.”) Under Zaborszky, a systems and control expert, Ilic learned to think in abstract terms as well as in terms of physical power systems and technologies. She became fascinated by the question of how to model, simulate, monitor, and control power systems — and that’s where she’s been ever since. (Although, she admits as she uncoils to her full height from behind her desk, her first love was actually playing basketball.)

Ilic first arrived at MIT in 1987 to work with the late professor Fred Schweppe on connecting electricity technologies with electricity markets. She stayed on as a senior research scientist until 2002, when she moved to Carnegie Mellon University (CMU) to lead the multidisciplinary Electric Energy Systems Group there. In 2018, after her consulting work for Lincoln Lab ramped up, she retired from CMU to move back to the familiar environs of Cambridge, Massachusetts. CMU’s loss has been MIT’s gain: In fall 2019, Ilic taught a course in modeling, simulation, and control of electric energy systems, applying her work on streamlined models that use pared-down information.

Addressing the evolving needs of electric power systems has not been a “hot” topic, historically. Traditional power systems are often seen by the academic community as legacy technology with no fundamentally new developments. And yet when new software and systems are developed to help integrate distributed energy generation and storage, commercial systems operators regard them as untested and disruptive. “I’ve always been a bit on the sidelines from mainstream power and electrical engineering because I’m interested in some of these things,” she remarks.

However, Ilic’s work is becoming increasingly urgent. Much of today’s power system is physically very old and will need to be retired and replaced over the next decade. This presents an opportunity for innovation: the next generation of electric energy systems could be built to integrate renewable and distributed energy resources at scale — addressing the pressing challenge of climate change and making way for further progress.

“That’s why I’m still working, even though I should be retired.” She smiles. “It supports the evolution of the system to something better.”

Read More

Reducing the carbon footprint of artificial intelligence

Artificial intelligence has become a focus of certain ethical concerns, but it also has some major sustainability issues. 

Last June, researchers at the University of Massachusetts at Amherst released a startling report estimating that the amount of power required for training and searching a certain neural network architecture involves the emissions of roughly 626,000 pounds of carbon dioxide. That’s equivalent to nearly five times the lifetime emissions of the average U.S. car, including its manufacturing.

This issue gets even more severe in the model deployment phase, where deep neural networks need to be deployed on diverse hardware platforms, each with different properties and computational resources. 

MIT researchers have developed a new automated AI system for training and running certain neural networks. Results indicate that, by improving the computational efficiency of the system in some key ways, the system can cut down the pounds of carbon emissions involved — in some cases, down to low triple digits. 

The researchers’ system, which they call a once-for-all network, trains one large neural network comprising many pretrained subnetworks of different sizes that can be tailored to diverse hardware platforms without retraining. This dramatically reduces the energy usually required to train each specialized neural network for new platforms — which can include billions of internet of things (IoT) devices. Using the system to train a computer-vision model, they estimated that the process required roughly 1/1,300 the carbon emissions compared to today’s state-of-the-art neural architecture search approaches, while reducing the inference time by 1.5-2.6 times. 

“The aim is smaller, greener neural networks,” says Song Han, an assistant professor in the Department of Electrical Engineering and Computer Science. “Searching efficient neural network architectures has until now had a huge carbon footprint. But we reduced that footprint by orders of magnitude with these new methods.”

The work was carried out on Satori, an efficient computing cluster donated to MIT by IBM that is capable of performing 2 quadrillion calculations per second. The paper is being presented next week at the International Conference on Learning Representations. Joining Han on the paper are four undergraduate and graduate students from EECS, MIT-IBM Watson AI Lab, and Shanghai Jiao Tong University. 

Creating a “once-for-all” network

The researchers built the system on a recent AI advance called AutoML (for automatic machine learning), which eliminates manual network design. Neural networks automatically search massive design spaces for network architectures tailored, for instance, to specific hardware platforms. But there’s still a training efficiency issue: Each model has to be selected then trained from scratch for its platform architecture. 

“How do we train all those networks efficiently for such a broad spectrum of devices — from a $10 IoT device to a $600 smartphone? Given the diversity of IoT devices, the computation cost of neural architecture search will explode,” Han says.   

The researchers invented an AutoML system that trains only a single, large “once-for-all” (OFA) network that serves as a “mother” network, nesting an extremely high number of subnetworks that are sparsely activated from the mother network. OFA shares all its learned weights with all subnetworks — meaning they come essentially pretrained. Thus, each subnetwork can operate independently at inference time without retraining. 

The team trained an OFA convolutional neural network (CNN) — commonly used for image-processing tasks — with versatile architectural configurations, including different numbers of layers and “neurons,” diverse filter sizes, and diverse input image resolutions. Given a specific platform, the system uses the OFA as the search space to find the best subnetwork based on the accuracy and latency tradeoffs that correlate to the platform’s power and speed limits. For an IoT device, for instance, the system will find a smaller subnetwork. For smartphones, it will select larger subnetworks, but with different structures depending on individual battery lifetimes and computation resources. OFA decouples model training and architecture search, and spreads the one-time training cost across many inference hardware platforms and resource constraints. 

This relies on a “progressive shrinking” algorithm that efficiently trains the OFA network to support all of the subnetworks simultaneously. It starts with training the full network with the maximum size, then progressively shrinks the sizes of the network to include smaller subnetworks. Smaller subnetworks are trained with the help of large subnetworks to grow together. In the end, all of the subnetworks with different sizes are supported, allowing fast specialization based on the platform’s power and speed limits. It supports many hardware devices with zero training cost when adding a new device.
 
In total, one OFA, the researchers found, can comprise more than 10 quintillion — that’s a 1 followed by 19 zeroes — architectural settings, covering probably all platforms ever needed. But training the OFA and searching it ends up being far more efficient than spending hours training each neural network per platform. Moreover, OFA does not compromise accuracy or inference efficiency. Instead, it provides state-of-the-art ImageNet accuracy on mobile devices. And, compared with state-of-the-art industry-leading CNN models , the researchers say OFA provides 1.5-2.6 times speedup, with superior accuracy. 
    
“That’s a breakthrough technology,” Han says. “If we want to run powerful AI on consumer devices, we have to figure out how to shrink AI down to size.”

“The model is really compact. I am very excited to see OFA can keep pushing the boundary of efficient deep learning on edge devices,” says Chuang Gan, a researcher at the MIT-IBM Watson AI Lab and co-author of the paper.

“If rapid progress in AI is to continue, we need to reduce its environmental impact,” says John Cohn, an IBM fellow and member of the MIT-IBM Watson AI Lab. “The upside of developing methods to make AI models smaller and more efficient is that the models may also perform better.”

Read More

Making Decision Trees Accurate Again: Explaining What Explainable AI Did Not

Making Decision Trees Accurate Again: Explaining What Explainable AI Did Not


The interpretability of neural networks is becoming increasingly necessary, as
deep learning is being adopted in settings where accurate and justifiable
predictions are required. These applications range from finance to medical
imaging. However, deep neural networks are notorious for a lack of
justification. Explainable AI (XAI) attempts to bridge this divide between
accuracy and interpretability, but as we explain below, XAI justifies
decisions without interpreting the model directly
.