Federated Analytics: Collaborative Data Science without Data Collection

Federated Analytics: Collaborative Data Science without Data Collection

Posted by Daniel Ramage, Research Scientist and Stefano Mazzocchi, Software Engineer, Google Research

Federated learning, introduced in 2017, enables developers to train machine learning (ML) models across many devices without centralized data collection, ensuring that only the user has a copy of their data, and is used to power experiences like suggesting next words and expressions in Gboard for Android and improving the quality of smart replies in Android Messages. Following the success of these applications, there is a growing interest in using federated technologies to answer more basic questions about decentralized data — like computing counts or rates — that often don’t involve ML at all. Analyzing user behavior through these techniques can lead to better products, but it is essential to ensure that the underlying data remains private and secure.

Today we’re introducing federated analytics, the practice of applying data science methods to the analysis of raw data that is stored locally on users’ devices. Like federated learning, it works by running local computations over each device’s data, and only making the aggregated results — and never any data from a particular device — available to product engineers. Unlike federated learning, however, federated analytics aims to support basic data science needs. This post describes the basic methodologies of federated analytics that were developed in the pursuit of federated learning, how we extended those insights into new domains, and how recent advances in federated technologies enable better accuracy and privacy for a growing range of data science needs.

Origin of Federated Analytics
The first exploration into federated analytics was in support of federated learning: how can engineers measure the quality of federated learning models against real-world data when that data is not available in a data center? The answer was to re-use the federated learning infrastructure but without the learning part. In federated learning, the model definition can include not only the loss function that is to be optimized, but also code to compute metrics that indicate the quality of the model’s predictions. We could use this code to directly evaluate model quality on phones’ data.

As an example, Gboard engineers measured the overall quality of next word prediction models against raw typing data held on users’ phones. Participating phones downloaded a candidate model, locally computed a metric of how well the model’s predictions matched the words that were actually typed, and then uploaded the metric without any adjustment to the model’s weights or any change to the Gboard typing experience. By averaging the metrics uploaded by many phones, engineers learned a population-level summary of model performance. The technique also easily extended to estimate basic statistics like dataset sizes.

Federated Analytics for Song Recognition Measurement
Beyond model evaluation, federated analytics is used to support the Now Playing feature on Google’s Pixel phones, a tool that shows you what song is playing in the room around you. Under the hood, Now Playing uses an on-device database of song fingerprints to identify music playing near the phone without the need for a network connection. The architecture is good for privacy and for users — it is fast, works offline, and no raw or processed audio data leaves the phone. Because every phone in a region receives the same database, and only songs in the database can be recognized, it’s important for the database to hold the right songs.

To measure and improve each regional database quality, engineers needed to answer a basic question: which of its songs are most often recognized? Federated analytics provides an answer without revealing which songs are heard by any individual phone. It is enabled for users who agreed to send device related usage and diagnostics information to Google.

When Now Playing recognizes a song, it records the track name into the on-device Now Playing history, where users can see recently recognized songs and add them to a music app’s playlist. Later, when the phone is idle, plugged in, and connected to WiFi, Google’s federated learning and analytics server may invite the phone to join a “round” of federated analytics computation, along with several hundred other phones. Each phone in the round computes the recognition rate for the songs in its Now Playing History, and uses the secure aggregation protocol to encrypt the results. The encrypted rates are sent to the federated analytics server, which does not have the keys to decrypt them individually. But when combined with the encrypted counts from the other phones in the round, the final tally of all song counts (and nothing else) can be decrypted by the server.

The result enables Google engineers to improve the song database (for example, by making sure the database contains truly popular songs), without any phone revealing which songs were heard. In its first improvement iteration, this resulted in a 5% increase in overall song recognition across all Pixel phones globally.

Protecting Federated Analytics with Secure Aggregation
Secure aggregation can enable stronger privacy properties for federated analytics applications. For intuition about the secure aggregation protocol, consider a simpler version of the song recognition measurement problem. Let’s say that Rakshita wants to know how often her friends Emily and Zheng have listened to a particular song. Emily has heard it SEmily times and Zheng SZheng times, but neither is comfortable sharing their counts with Rakshita or each other. Instead, the trio could perform a secure aggregation: Emily and Zheng meet to decide on a random number M, which they keep secret from Rakshita. Emily reveals to Rakshita the sum SEmily + M, while Zheng reveals the difference SZheng M. Rakshita sees two numbers that are effectively random (they are masked by M), but she can add them together (SEmily + M) + (SZheng M) = SEmily + SZheng to reveal the total number of times that the song was heard by both Emily and Zheng.

The privacy properties of this approach can be strengthened by summing over more people or by adding small random values to the counts (e.g. in support of differential privacy). For Now Playing, song recognition rates from hundreds of devices are summed together, before the result is revealed to the engineers.

An illustration of the secure aggregation protocol, from the federated learning comic book.

Toward Learning and Analytics with Greater Privacy
The methods of federated analytics are an active area of research and already go beyond analyzing metrics and counts. Sometimes, training ML models with federated learning can be used for obtaining aggregate insights about on-device data, without any of the raw data leaving the devices. For example, Gboard engineers wanted to discover new words commonly typed by users and add them to dictionaries used for spell-checking and typing suggestions, all without being able to see any words that users typed. They did it by training a character-level recurrent neural network on phones, using only the words typed on these phones that were not already in the global dictionary. No typed words ever left the phones, but the resulting model could then be used in the datacenter to generate samples of frequently typed character sequences – the new words!

We are also developing techniques for answering even more ambiguous questions on decentralized datasets like “what patterns in the data are difficult for my model to recognize?” by training federated generative models. And we’re exploring ways to apply user-level differentially private model training to further ensure that these models do not encode information unique to any one user.

Google’s commitment to our privacy principles means pushing the state of the art in safeguarding user data, be it through differential privacy in the data center or advances in privacy during data collection. Google’s earliest system for decentralized data analysis, RAPPOR, was introduced in 2014, and we’ve learned a lot about making effective decisions even with a great deal of noise (often introduced for local differential privacy) since. Federated analytics continues this line of work.

It’s still early days for the federated analytics approach and more progress is needed to answer many common data science questions with good accuracy. The recent Advances and Open Problems in Federated Learning paper offers a comprehensive survey of federated research, while Federated Heavy Hitters Discovery with Differential Privacy introduces a federated analytics method for the discovery of most frequent items in the dataset. Federated analytics enables us to think about data science differently, with decentralized data and privacy-preserving aggregation in a central role. We welcome new contributions and extensions in this emerging field.

Acknowledgments
This post reflects the work of many people, including Blaise Agüera y Arcas, Galen Andrew, Sean Augenstein, Françoise Beaufays, Kallista Bonawitz, Mingqing Chen, Hubert Eichner, Úlfar Erlingsson, Christian Frank, Anna Goralska, Marco Gruteser, Alex Ingerman, Vladimir Ivanov, Peter Kairouz, Chloé Kiddon, Ben Kreuter, Alison Lentz, Wei Li, Xu Liu, Antonio Marcedone, Rajiv Mathews, Brendan McMahan, Tom Ouyang, Sarvar Patel, Swaroop Ramaswamy, Aaron Segal, Karn Seth, Haicheng Sun, Timon Van Overveldt, Sergei Vassilvitskii, Scott Wegner, Yuanbo Zhang, Li Zhang, and Wennan Zhu.

Evaluating Natural Language Generation with BLEURT

Evaluating Natural Language Generation with BLEURT

Posted by Thibault Sellam, Software Engineer, and Ankur P. Parikh, Research Scientist, Google Research

In the last few years, research in natural language generation (NLG) has made tremendous progress, with models now able to translate text, summarize articles, engage in conversation, and comment on pictures with unprecedented accuracy, using approaches with increasingly high levels of sophistication. Currently, there are two methods to evaluate these NLG systems: human evaluation and automatic metrics. With human evaluation, one runs a large-scale quality survey for each new version of a model using human annotators, but that approach can be prohibitively labor intensive. In contrast, one can use popular automatic metrics (e.g., BLEU), but these are oftentimes unreliable substitutes for human interpretation and judgement. The rapid progress of NLG and the drawbacks of existing evaluation methods calls for the development of novel ways to assess the quality and success of NLG systems.

In “BLEURT: Learning Robust Metrics for Text Generation” (presented during ACL 2020), we introduce a novel automatic metric that delivers ratings that are robust and reach an unprecedented level of quality, much closer to human annotation. BLEURT (Bilingual Evaluation Understudy with Representations from Transformers) builds upon recent advances in transfer learning to capture widespread linguistic phenomena, such as paraphrasing. The metric is available on Github.

Evaluating NLG Systems
In human evaluation, a piece of generated text is presented to annotators, who are tasked with assessing its quality with respect to its fluency and meaning. The text is typically shown side-by-side with a reference, authored by a human or mined from the Web.

An example questionnaire used for human evaluation in machine translation.

The advantage of this method is that it is accurate: people are still unrivaled when it comes to evaluating the quality of a piece of text. However, this method of evaluation can easily take days and involve dozens of people for just a few thousand examples, which disrupts the model development workflow.

In contrast, the idea behind automatic metrics is to provide a cheap, low-latency proxy for human-quality measurements. Automatic metrics often take two sentences as input, a candidate and a reference, and they return a score that indicates to what extent the former resembles the latter, typically using lexical overlap. A popular metric is BLEU, which counts the sequences of words in the candidate that also appear in the reference (the BLEU score is very similar to precision).

The advantages and weaknesses of automatic metrics are the opposite of those that come with human evaluation. Automatic metrics are convenient — they can be computed in real-time throughout the training process (e.g., for plotting with Tensorboard). However, they are often inaccurate due to their focus on surface-level similarities and they fail to capture the diversity of human language. Frequently, there are many perfectly valid sentences that can convey the same meaning. Overlap-based metrics that rely exclusively on lexical matches unfairly reward those that resemble the reference in their surface form, even if they do not accurately capture meaning, and penalize other paraphrases.

BLEU scores for three candidate sentences. Candidate 2 is semantically close to the reference, and yet its score is lower than Candidate 3.

Ideally, an evaluation method for NLG should combine the advantages of both human evaluation and automatic metrics — it should be relatively cheap to compute, but flexible enough to cope with linguistic diversity.

Introducing BLEURT
BLEURT is a novel, machine learning-based automatic metric that can capture non-trivial semantic similarities between sentences. It is trained on a public collection of ratings (the WMT Metrics Shared Task dataset) as well as additional ratings provided by the user.

Three candidate sentences rated by BLEURT. BLEURT captures that candidate 2 is similar to the reference, even though it contains more non-reference words than candidate 3.

Creating a metric based on machine learning poses a fundamental challenge: the metric should do well consistently on a wide range of tasks and domains, and over time. However, there is only a limited amount of training data. Indeed, public data is sparse — the WMT Metrics Task dataset, the largest collection of human ratings at the time of writing, contains ~260K human ratings covering the news domain only. This is too limited to train a metric suited for the evaluation of NLG systems of the future.

To address this problem, we employ transfer learning. First, we use the contextual word representations of BERT, a state-of-the-art unsupervised representation learning method for language understanding that has already been successfully incorporated into NLG metrics (e.g., YiSi or BERTscore).

Second, we introduce a novel pre-training scheme to increase BLEURT’s robustness. Our experiments reveal that training a regression model directly over publicly available human ratings is a brittle approach, since we cannot control in what domain and across what time span the metric will be used. The accuracy is likely to drop in the presence of domain drift, i.e., when the text used comes from a different domain than the training sentence pairs. It may also drop when there is a quality drift, when the ratings to be predicted are higher than those used during training — a feature which would normally be good news because it indicates that ML research is making progress.

The success of BLEURT relies on “warming-up” the model using millions of synthetic sentence pairs before fine-tuning on human ratings. We generated training data by applying random perturbations to sentences from Wikipedia. Instead of collecting human ratings, we use a collection of metrics and models from the literature (including BLEU), which allows the number of training examples to be scaled up at very low cost.

BLEURT’s data generation process combines random perturbations and scoring with pre-existing metrics and models.

Experiments reveal that pre-training significantly increases BLEURT’s accuracy, especially when the test data is out-of-distribution.

We pre-train BLEURT twice, first with a language modelling objective (as explained in the original BERT paper), then with a collection of NLG evaluation objectives. We then fine-tune the model on the WMT Metrics dataset, on a set of ratings provided by the user, or a combination of both.The following figure illustrates BLEURT’s training procedure end-to-end.

Results
We benchmark BLEURT against competing approaches and show that it offers superior performance, correlating well with human ratings on the WMT Metrics Shared Task (machine translation) and the WebNLG Challenge (data-to-text). For example, BLEURT is ~48% more accurate than BLEU on the WMT Metrics Shared Task of 2019. We also demonstrate that pre-training helps BLEURT cope with quality drift.

Correlation between different metrics and human ratings on the WMT’19 Metrics Shared Task.

Conclusion
As NLG models have gotten better over time, evaluation metrics have become an important bottleneck for the research in this field. There are good reasons why overlap-based metrics are so popular: they are simple, consistent, and they do not require any training data. In the use cases where multiple reference sentences are available for each candidate, they can be very accurate. While they play a critical part in our infrastructure, they are also very conservative, and only give an incomplete picture of NLG systems’ performance. Our view is that ML engineers should enrich their evaluation toolkits with more flexible, semantic-level metrics.

BLEURT is our attempt to capture NLG quality beyond surface overlap. Thanks to BERT’s representations and a novel pre-training scheme, our metric yields SOTA performance on two academic benchmarks, and we are currently investigating how it can improve Google products. Future research includes investigating multilinguality and multimodality.

Acknowledgements
This project was co-advised by Dipanjan Das. We thank Slav Petrov, Eunsol Choi, Nicholas FitzGerald, Jacob Devlin, Madhavan Kidambi, Ming-Wei Chang, and all the members of the Google Research Language team.

Undergraduates develop next-generation intelligence tools

The coronavirus pandemic has driven us apart physically while reminding us of the power of technology to connect. When MIT shut its doors in March, much of campus moved online, to virtual classes, labs, and chatrooms. Among those making the pivot were students engaged in independent research under MIT’s Undergraduate Research Opportunities Program (UROP). 

With regular check-ins with their advisors via Slack and Zoom, many students succeeded in pushing through to the end. One even carried on his experiments from his bedroom, after schlepping his Sphero Bolt robots home in a backpack. “I’ve been so impressed by their resilience and dedication,” says Katherine Gallagher, one of three artificial intelligence engineers at MIT Quest for Intelligence who works with students each semester on intelligence-related applications. “There was that initial week of craziness and then they were right back to work.” Four projects from this spring are highlighted below.

Learning to explore the world with open eyes and ears

Robots rely heavily on images beamed through their built-in cameras, or surrogate “eyes,” to get around. MIT senior Alon Kosowsky-Sachs thinks they could do a lot more if they also used their microphone “ears.” 

From his home in Sharon, Massachusetts, where he retreated after MIT closed in March, Kosowsky-Sachs is training four baseball-sized Sphero Bolt robots to roll around a homemade arena. His goal is to teach the robots to pair sights with sounds, and to exploit this information to build better representations of their environment. He’s working with Pulkit Agrawal, an assistant professor in MIT’s Department of Electrical Engineering and Computer Science, who is interested in designing algorithms with human-like curiosity.

While Kosowsky-Sachs sleeps, his robots putter away, gliding through an object-strewn rink he built for them from two-by-fours. Each burst of movement becomes a pair of one-second video and audio clips. By day, Kosowsky-Sachs trains a “curiosity” model aimed at pushing the robots to become bolder, and more skillful, at navigating their obstacle course.

“I want them to see something through their camera, and hear something from their microphone, and know that these two things happen together,” he says. “As humans, we combine a lot of sensory information to get added insight about the world. If we hear a thunder clap, we don’t need to see lightning to know that a storm has arrived. Our hypothesis is that robots with a better model of the world will be able to accomplish more difficult tasks.”

Training a robot agent to design a more efficient nuclear reactor 

One important factor driving the cost of nuclear power is the layout of its reactor core. If fuel rods are arranged in an optimal fashion, reactions last longer, burn less fuel, and need less maintenance. As engineers look for ways to bring down the cost of nuclear energy, they are eying the redesign of the reactor core.

“Nuclear power emits very little carbon and is surprisingly safe compared to other energy sources, even solar or wind,” says third-year student Isaac Wolverton. “We wanted to see if we could use AI to make it more efficient.” 

In a project with Josh Joseph, an AI engineer at the MIT Quest, and Koroush Shirvan, an assistant professor in MIT’s Department of Nuclear Science and Engineering, Wolverton spent the year training a reinforcement learning agent to find the best way to lay out fuel rods in a reactor core. To simulate the process, he turned the problem into a game, borrowing a machine learning technique for producing agents with superhuman abilities at chess and Go.

He started by training his agent on a simpler problem: arranging colored tiles on a grid so that as few tiles as possible of the same color would touch. As Wolverton increased the number of options, from two colors to five, and four tiles to 225, he grew excited as the agent continued to find the best strategy. “It gave us hope we could teach it to swap the cores into an optimal arrangement,” he says.

Eventually, Wolverton moved to an environment meant to simulate a 36-rod reactor core, with two enrichment levels and 2.1 million possible core configurations. With input from researchers in Shirvan’s lab, Wolverton trained an agent that arrived at the optimal solution.

The lab is now building on Wolverton’s code to try to train an agent in a life-sized 100-rod environment with 19 enrichment levels. “There’s no breakthrough at this point,” he says. “But we think it’s possible, if we can find enough compute resources.”

Making more livers available to patients who need them

About 8,000 patients in the United States receive liver transplants each year, but that’s only half the number who need one. Many more livers might be made available if hospitals had a faster way to screen them, researchers say. In a collaboration with Massachusetts General Hospital, MIT Quest is evaluating whether automation could help to boost the nation’s supply of viable livers.  

In approving a liver for transplant, pathologists estimate its fat content from a slice of tissue. If it’s low enough, the liver is deemed ready for transplant. But there are often not enough qualified doctors to review tissue samples on the tight timeline needed to match livers with recipients. A shortage of doctors, coupled with the subjective nature of analyzing tissue, means that viable livers are inevitably discarded.

This loss represents a huge opportunity for machine learning, says third-year student Kuan Wei Huang, who joined the project to explore AI applications in health care. The project involves training a deep neural network to pick out globules of fat on liver tissue slides to estimate the liver’s overall fat content.

One challenge, says Huang, has been figuring out how to handle variations in how various pathologists classify fat globules. “This makes it harder to tell whether I’ve created the appropriate masks to feed into the neural net,” he says. “However, after meeting with experts in the field, I received clarifications and was able to continue working.”

Trained on images labeled by pathologists, the model will eventually learn to isolate fat globules in unlabeled images on its own. The final output will be a fat content estimate with pictures of highlighted fat globules showing how the model arrived at its final count. “That’s the easy part — we just count up the pixels in the highlighted globules as a percentage of the overall biopsy and we have our fat content estimate,” says the Quest’s Gallagher, who is leading the project.

Huang says he’s excited by the project’s potential to help people. “Using machine learning to address medical problems is one of the best ways that a computer scientist can impact the world.”

Exposing the hidden constraints of what we mean in what we say

Language shapes our understanding of the world in subtle ways, with slight variations in the words we use conveying sharply different meanings. The sentence, “Elephants live in Africa and Asia,” looks a lot like the sentence “Elephants eat twigs and leaves.” But most readers will conclude that the elephants in the first sentence are split into distinct groups living on separate continents but not apply the same reasoning to the second sentence, because eating twigs and eating leaves can both be true of the same elephant in a way that living on different continents cannot.

Karen Gu is a senior majoring in computer science and molecular biology, but instead of putting cells under a microscope for her SuperUROP project, she chose to look at sentences like the ones above. “I’m fascinated by the complex and subtle things that we do to constrain language understanding, almost all of it subconsciously,” she says.

Working with Roger Levy, a professor in MIT’s Department of Brain and Cognitive Sciences, and postdoc MH Tessler, Gu explored how prior knowledge guides our interpretation of syntax and ultimately, meaning. In the sentences above, prior knowledge about geography and mutual exclusivity interact with syntax to produce different meanings.

After steeping herself in linguistics theory, Gu built a model to explain how, word by word, a given sentence produces meaning. She then ran a set of online experiments to see how human subjects would interpret analogous sentences in a story. Her experiments, she says, largely validated intuitions from linguistic theory.

One challenge, she says, was having to reconcile two approaches for studying language. “I had to figure out how to combine formal linguistics, which applies an almost mathematical approach to understanding how words combine, and probabilistic semantics-pragmatics, which has focused more on how people interpret whole utterances.’ “

After MIT closed in March, she was able to finish the project from her parents’ home in East Hanover, New Jersey. “Regular meetings with my advisor have been really helpful in keeping me motivated and on track,” she says. She says she also got to improve her web-development skills, which will come in handy when she starts work at Benchling, a San Francisco-based software company, this summer.

Spring semester Quest UROP projects were funded, in part, by the MIT-IBM Watson AI Lab and Eric Schmidt, technical advisor to Alphabet Inc., and his wife, Wendy.

Read More

Finding Cross-Lingual Syntax in Multilingual BERT

Finding Cross-Lingual Syntax in Multilingual BERT

We projected head-dependent pairs from both English (light colors) and French (dark colors) into a syntactic space trained on solely English mBERT representations. Both English and French head-dependent vectors cluster; dependencies of the same label in both English and French share the same cluster. Although our method has no access to dependency labels, the dependencies exhibit cross-lingual clustering that largely agree with linguists’ categorizations.

If you ask a deep neural network to read a large number of languages, does it share what it’s learned about sentence structure between different languages?

Deep neural language models like BERT have recently demonstrated a fascinating level of understanding of human language. Multilingual versions of these models, like Multilingual BERT (mBERT), are able to understand a large number of languages simultaneously. To what extent do these models share what they’ve learned between languages?

Focusing on the syntax, or grammatical structure, of these languages, we show that Multilingual BERT is able to learn a general syntactic structure applicable to a variety of natural languages. Additionally, we find evidence that mBERT learns cross-lingual syntactic categories like “subject” and “adverb”—categories that largely agree with traditional linguistic concepts of syntax! Our results imply that simply by reading a large amount of text, mBERT is able to represent syntax—something fundamental to understanding language—in a way that seems to apply across many of the languages it comprehends.

More specifically, we present the following:

  • We apply the structural probe method of Hewitt and Manning (2019) to 10 languages, finding syntactic subspaces in a multilingual setting.

  • Through zero-shot transfer experiments, we demonstrate that mBERT represents some syntactic features in syntactic subspaces that overlap between languages.

  • Through an unsupervised method, we find that mBERT natively represents dependency clusters that largely overlap with the UD standard.

Our results are presented in the forthcoming ACL 2020 paper, Finding Universal Grammatical Relations in Multilingual BERT. This post draws from the paper, which is joint work with John Hewitt and Chris Manning. You can also find the code here.

If you’d like to skip the background and jump to the discussion of our methods, click here. Otherwise, read on!

Learning Languages

Past childhood, humans usually learn a language by comparison to one we already speak.1 We naturally draw parallels between sentences with similar meanings—for example, after learning some French, one can work out that Je vis le chat mignon is essentially a word-for-word translation of I see the cute cat. Importantly, humans draw parallels in syntax, or the way words are organized to form meaning; most bilinguals know that mignon is an adjective which describes the noun chat, just as cute describes the noun cat—even though the words are in the opposite order between languages.

How do we train a neural network to understand multiple languages at the same time? One intuitive approach might be to equip the neural network with a multilingual dictionary and a list of rules to transfer between one language to another. (For example, adjectives come before the noun in English but after the noun in Khmer.) However, mirroring recent developments in monolingual neural networks, one more recent method is to give our neural network enormous amounts of data in multiple languages. In this approach, we never provide even a single translation pair, much less a dictionary or grammar rules.

Surprisingly, this trial by fire works! A network trained this way, like Google’s Multilingual BERT, is able to understand a vast number of languages beyond what any human can handle, even a typologically divergent set ranging from English to Hindi to Indonesian.

This raises an interesting question: how do these networks understand multiple languages at the same time? Do they learn each language separately, or do they draw parallels between the way syntax works in different languages?

Knowing What it Means to “Know”

First, let’s ask: what does it even mean for a neural network to “understand” a linguistic property?

One way to evaluate this is through the network’s performance on a downstream task, such as a standard leaderboard like the GLUE (General Language Understanding Evaluation) benchmark. By this metric, large models like BERT do pretty well! However, although high performance numbers suggest in some sense that the model understands some aspects of language generally speaking, they conflate the evaluation of many different aspects of language, and it’s difficult to test specific hypotheses about the individual properties of our model.

Instead, we use a method known as probing. The central idea is as follows: we feed linguistic data for which we know the property we’re interested in exploring (e.g. part-of-speech) through the network we want to probe. Instead of looking at the predictions of the model themselves, for each sentence we feed through, we save the hidden representations, which one can think of as the model’s internal data structures. We then train a probe—a secondary model—to recover the target property from these representations, akin to how a neuroscientist might read out emotions from a MRI scan of your brain.

Probes are usually designed to be simple, to test what the neural network makes easily accessible. intuitively, the harder we try to tease a linguistic property out of the representations, the less the representations themselves matter to your final results. As an example, we might be able to build an extremely complex model to predict whether someone is seeing a cat, based on the raw data coming from the retina; however, this doesn’t mean that the retina itself intrinsically “understands” what a cat is.2

A Tale of Syntax and Subspaces

So what form, exactly, do these hidden representations take? The innards of a neural network like BERT represent each sentence as a series of real-valued vectors (in real life, these are 768-dimensional, but we’ve represented them as three-dimensional here):

A probe, then, is a model that maps from a word vector to some linguistic property of interest. For something like part of speech, this might take the form of a 1-layer neural classifier which predicts a category (like noun or verb).

But how do we evaluate whether a neural network knows something as nebulous as syntax, the way words and phrases are arranged to create meaning? Linguists believe sentences are implicitly organized into syntax trees, which we generate mentally in order to produce a sentence. Here’s an example of what that looks like:

Syntax tree for French Jean qui avait faim joue bien dans le jardin (Jean, who was hungry, plays in the garden).

To probe whether BERT encodes a syntax tree internally, we apply the structural probe method [Hewitt and Manning, 2019]. This finds a linear transformation3 such that the tree constructed by connecting each word to the word closest to it approximates a linguist’s idea of what the parse tree should look like. This ends up looking like this:

Intuitively, we can think of BERT vectors as lying in a 768-dimensional space; the structural probe tries to find a linear subspace of the BERT space which best recovers syntax trees.

Does this work, you might ask? Well, this certainly seems to be the case:

A gold parse tree annotated by a linguist, and a parse tree generated from Monolingual BERT embeddings. From Coenen et al. (2019).

Hewitt and Manning apply this method only to monolingual English BERT; we apply their method to 10 other languages, finding that mBERT encodes syntax to various degrees in all of them. Here’s a table of performance (measured in UUAS, or unlabeled undirected accuracy score) as graphed against the rank of the probe’s linear transformation:

Probing for Cross-Lingual Syntax

With this in mind, we can turn to the question with which we started this blog post:

Does Multilingual BERT represent syntax similarly cross-lingually?

To answer this, we train a structural probe to predict syntax from representations in one language—say, English—and evaluate it on another, like French. If a probe trained on mBERT’s English representations performs well when evaluated on French data, this intuitively suggests that the way mBERT encodes English syntax is similar to the way it encodes French syntax.

Does this work? In a word, basically:

Syntactic trees for a single English sentence generated by structural probes trained on English, French, and Indonesian data.
Black represents the reference syntactic tree as defined by a linguist.
The English structural probe is almost entirely able to replicate the syntactic tree, with one error;
the French probe finds most of the syntactic tree, while the Indonesian probe is able to recover the high-level structure but misses low-level details.

Out of the 11 languages that we evaluate on, we find that probes trained on representations from one language are able to successfully recover syntax trees—to varying degrees—in data from another language. Evaluated on two numerical metrics of parse tree accuracy, applying probes cross-lingually performs surprisingly well! This performance suggests that syntax is encoded similarly in mBERT representations across many different languages.

UUAS DSpr.
Best baseline 0% 0%
Transfer from best source language 62.3% 73.1%
Transfer from holdout subspace (trained on all languages other than eval) 70.5% 79%
Transfer from subspace trained on all languages (including eval) 88.0% 89.0%
Training on evaluation language directly 100% 100%
Table: Improvement for various transfer methods over best baseline, evaluated on two metrics: UUAS (unlabeled undirected accuracy score) and DSpr. (Spearman correlation of tree distances). Percent improvement is calculated with respect to the total possible improvement in recovering syntactic trees over baseline (as represented by in-language supervision.)

Finding Universal Grammatical Relations in mBERT

We’ve shown that cross-lingual syntax exists—can we visualize it?

Recall that the structural probe works by finding a linear subspace optimized to encode syntax trees. Intuitively, this syntactic subspace might focus on syntactic aspects of mBERT’s representations. Can we visualize words in this subspace and get a first-hand view of how mBERT represents syntax?

One idea is to focus on the edges of our syntactic tree, or head-dependent pairs. For example, below, was is the head of the dependent chef:

Let’s try to visualize these vectors in the syntactic subspace and see what happens! Define the head-dependent vector as the vector between the head and the dependent in the syntactic subspace:

We do this for every head-dependent pair in every sentence in our corpus, then visualize the resulting 32-dimensional vectors in two dimensions using t-SNE, a dimensionality reduction algorithm. The results are striking: the dependencies naturally separate into clusters, whose identities largely overlap with the categories that linguists believe are fundamental to language! In the image below, we’ve highlighted the clusters with dependency labels from Universal Dependencies, like amod (adjective modifying a noun) and conj (two clauses joined by a coordinating conjunction like and, or):

Importantly, these categories are multilingual. In the above diagram, we’ve projected head-dependent pairs from both English (light colors) and French (dark colors) into a syntactic space trained on solely English mBERT representations. We see that French head-dependent vectors cluster as well, and that dependencies with the same label in both English and French share the same cluster.

Freedom from Human-Chosen Labels

The fact that BERT “knows” dependency labels is nothing new; previous studies have shown high accuracy in recovering dependency labels from BERT embeddings. So what’s special about our method?

Training a probe successfully demonstrates that we can map from mBERT’s representations to a standard set of dependency category labels. But because our probe needs supervision on a labeled dataset, we’re limited to demonstrating the existence of a mapping to human-generated labels. In other words, probes make it difficult to gain insight into the categories drawn by mBERT itself.

By contrast, the structural probe never receives information about what humans think dependency label categories should look like. Because we only ever pass in head-dependent pairs, rather than the category labels associated with these pairs, our method is free from human category labels. Instead, the clusters that emerge from the data are a view into mBERT’s innate dependency label representations.4

For more work on the latent linguistic ontology of BERT, see: Michael et al. (2020) and Limisiewicz et al. (2020).

Analyzing mBERT’s Internal Representations

Taking a closer look, what can we discover about how mBERT categorizes head-dependency relations, as compared to human labels? Our results show that mBERT draws slightly different distinctions from Universal Dependencies. Some are linguistically valid distinctions not distinguished by the UD standards, while others are more influenced by word order, separating relations that most linguists would group together. Here’s a brief overview:

t-SNE visualization of 100,000 syntactic difference vectors projected into the cross-lingual syntactic subspace of Multilingual BERT. We exclude `punct` and visualize the top 11 dependencies remaining, which are collectively responsible for 79.36% of the dependencies in our dataset. Clusters of interest highlighted in yellow; linguistically interesting clusters labeled.
  • Adjectives: We find that mBERT breaks adjectives into two categories: prenominal adjectives in cluster (b) (e.g., Chinese 獨特的地理) and postnominal adjectives in cluster (u) (e.g., French applications domestiques).

  • Nominal arguments: mBERT maintains the UD distinction between subject and object. However, indirect objects cluster with direct objects; other adjuncts cluster with subjects if near the beginning of a sentence and obj otherwise. This suggests that mBERT categorizes nominal arguments into pre-verbal and post-verbal categories.

  • Relative clauses In the languages in our dataset, there are two major ways of forming relative clauses. Relative pronouns (e.g., English the man who is hungry are classed by Universal Dependencies as being an nsubj dependent, while subordinating markers (e.g., English I know that she saw me) are classed as the dependent of a mark relation. However, mBERT groups both of these relations together, clustering them distinctly from most nsubj and mark relations.

  • Determiners The linguistic category of determiners (det) is split into definite articles (i), indefinite articles (e), possessives (f), and demonstratives (g). Sentence-initial definite articles (k) cluster separately from other definite articles (j).

  • Expletive subjects Just as in UD, expletive subjects, or third person pronouns with no syntactic meaning (e.g. English It is cold, French Il faudrait, Indonesian Yang menjadi masalah kemudian), cluster separately (k) from other nsubj relations (small cluster in the bottom left).

Conclusion

In this work, we’ve found that BERT shares some of the ways it represents syntax between its internal representations of different languages. We’ve provided evidence that mBERT learns natural syntactic categories that overlap cross-lingually. Interestingly, we also find evidence that these categories largely agree with traditional linguistic concepts of syntax.

Excitingly, our methods allow us to examine fine-grained syntactic categories native to mBERT. By removing assumptions on what the ontology of syntactic relations should look like, we discover that mBERT’s internal representations innately share significant overlap with linguists’ idea of what syntax looks like. However, there are also some interesting differences between the two, the nature of which is definitely worth further investigation!

If you’d like to run some tests or generate some visualizations of your own, please head on over to the multilingual-probing-visualization codebase!

Finally, I’m deeply grateful to John Hewitt and Chris Manning, as well as members of the Stanford NLP group for their advice, including but not limited to: Erik Jones, Sebastian Schuster, and Chris Donahue. Many thanks also to John Hewitt and Dylan Losey for reading over the draft of this blog post, and to Mohammad Rasooli for advice on Farsi labels in the original paper.

  1. For a linguistic perspective (specifically, in the field of second-language acquisition), see Cook (1995)

  2. This definition is a general overview and leaves some important questions. How exactly, for instance, do we evaluate the complexity of our probe? Relatedly, how much of the performance improvement is due to the model, and how much is due to the probe itself? For more work on this, see Hewitt and Liang (2019) and Pimentel et al. (2020)

  3. A linear transformation on a vector is simply multiplication by a matrix:  

  4. Technically speaking, this is constrained to the assumption that BERT would choose the same head-dependent pairs as UD does. 

Read More

Microsoft President Brad Smith talks data, Covid-19, and a potential “digital 9/11”

In a virtual discussion hosted by MIT last week, viewers learned that there are many problems that concern Microsoft President Brad Smith: things like climate change, Covid-19, and the work of the future.

Attendees also learned how seriously he takes the issue of computer security: 45 minutes into the event, his Windows system automatically rebooted for a lightning-quick software update.

“There are a lot of benefits to working from home,” he said with a laugh after rejoining, “but it certainly also adds a level of unpredictability.”

Smith’s conversation with MIT Professor Daniela Rus on May 14 spanned a wide range of topics, from the challenges of Covid-19 to the security of online voting. The fireside chat was held as part of MIT’s “Hot Topics in Computing” series, founded by the Computer Science and Artificial Intelligence Laboratory (CSAIL). The series is now an Institute-wide effort being co-presented with the MIT Stephen A. Schwarzman College of Computing (SCC). SCC Dean Dan Huttenlocher opened the event with a welcome to the audience and introduced Smith and Rus.

Having worked at Microsoft for 25 years and served as its president since 2015, Smith gave an inside look at what it’s been like to be there in recent months — and what the tech company has done to try to help curb the spread of Covid-19. 

In March, for example, Microsoft developed a chatbot to help people determine if they might need a Covid-19 test. Within weeks of deploying the app at a hospital in Seattle, Washington, the company started rolling it out more broadly. The chatbot was ultimately used 190 million times in April, and is now available at 1,500 institutions across 23 countries. 

Smith also discussed the promising work being done with Bluetooth-based contact tracing, but expressed skepticism that it could be adopted at a meaningful scale.

“Not everyone is going to walk around with an app on their phone,” he said. “I think we should recognize that it is a tool, and not a panacea.”

Smith and his colleague Carol Ann Browne recently co-wrote the book “Tools and Weapons,” which examines the promise and peril of technology for both good and bad. The authors draw on lessons in history, from Edward Snowden’s revelations about government surveillance to the Cambridge Analytica scandal, to explore the future of technology and how it needs to be managed. 

While he agreed with Rus’ assessment of the book as one “where the geeks are the heroes,” Smith also warned of the dangers of a future “digital 9/11” with respect to the electric grid and future presidential elections. 

“You don’t need to be a PhD in computer science to have an important role to play,” he said. “As consumers, citizens, and voters, we’re at a point of time where we’d all benefit from being better informed.” 

The long-time sustainability advocate also spoke about Microsoft’s ambitious goals to not just be carbon-negative by 2030, but to remove all of the carbon that the company has emitted since 1975 in the next 30 years. At a higher level, Smith advocated for making what he calls “the biggest R&D investment of our century” to develop new techniques to remove carbon from the environment. 

“We’re going to need huge breakthroughs in the next three decades if we are going to achieve this fundamental goal of protecting the planet the way we need to,” he said.

In the hourlong conversation, Smith often implored his audience of computer scientists and technologists to recognize the responsibility they have to solve tangible real-world problems. He pointed out that, for all the buzz about big data over the last 20 years, the Covid-19 pandemic has actually led to government officials making decisions directly informed by data.

“Data is running the economy and deciding who can leave their house and who needs to stay home,” Smith said. “The world will need the kind of technology that we can create … [and] you all have an opportunity to make a more positive impact in the world than perhaps any generation of MIT students has had before. How can that not get you excited about getting up in the morning?”

Read More

Open-Sourcing BiT: Exploring Large-Scale Pre-training for Computer Vision

Open-Sourcing BiT: Exploring Large-Scale Pre-training for Computer Vision

Posted by Lucas Beyer and Alexander Kolesnikov, Research Engineers, Google Research, Zürich

A common refrain for computer vision researchers is that modern deep neural networks are always hungry for more labeled data — current state-of-the-art CNNs need to be trained on datasets such as OpenImages or Places, which consist of over 1M labelled images. However, for many applications, collecting this amount of labeled data can be prohibitive to the average practitioner.

A common approach to mitigate the lack of labeled data for computer vision tasks is to use models that have been pre-trained on generic data (e.g., ImageNet). The idea is that visual features learned on the generic data can be re-used for the task of interest. Even though this pre-training works reasonably well in practice, it still falls short of the ability to both quickly grasp new concepts and understand them in different contexts. In a similar spirit to how BERT and T5 have shown advances in the language domain, we believe that large-scale pre-training can advance the performance of computer vision models.

In “Big Transfer (BiT): General Visual Representation Learning” we devise an approach for effective pre-training of general features using image datasets at a scale beyond the de-facto standard (ILSVRC-2012). In particular, we highlight the importance of appropriately choosing normalization layers and scaling the architecture capacity as the amount of pre-training data increases. Our approach exhibits unprecedented performance adapting to a wide range of new visual tasks, including the few-shot recognition setting and the recently introduced “real-world” ObjectNet benchmark. We are excited to share the best BiT models pre-trained on public datasets, along with code in TF2, Jax, and PyTorch. This will allow anyone to reach state-of-the-art performance on their task of interest, even with just a handful of labeled images per class.

Pre-training
In order to investigate the effect of data scale, we revisit common design choices of the pre-training setup (such as normalizations of activations and weights, model width/depth and training schedules) using three datasets: ILSVRC-2012 (1.28M images with 1000 classes), ImageNet-21k (14M images with ~21k classes) and JFT (300M images with ~18k classes). Importantly, with these datasets we concentrate on the previously underexplored large data regime.

We first investigate the interplay between dataset size and model capacity. To do this we train classical ResNet architectures, which perform well, while being simple and reproducible. We train variants from the standard 50-layer deep “R50x1” up to the 4x wider and 152-layer deep “R152x4” on each of the above-mentioned datasets. A key observation is that in order to profit from more data, one also needs to increase model capacity. This is exemplified by the red arrows in the left-hand panel of the figure below

Left: In order to make effective use of a larger dataset for pre-training, one needs to increase model capacity. The red arrows exemplify this: small architectures (smaller point) become worse when pre-trained on the larger ImageNet-21k, whereas the larger architectures (larger points) improve. Right: Pre-training on a larger dataset alone does not necessarily result in improved performance, e.g., when going from ILSVRC-2012 to the relatively larger ImageNet-21k. However, by also increasing the computational budget and training for longer, the performance improvement is pronounced.

A second, even more important observation, is that the training duration becomes crucial. If one pre-trains on a larger dataset without adjusting the computational budget and training longer, performance is likely to become worse. However, by adapting the schedule to the new dataset, the improvements can be significant.

During our exploration phase, we discovered another modification crucial to improving performance. We show that replacing batch normalization (BN, a commonly used layer that stabilizes training by normalizing activations) with group normalization (GN) is beneficial for pre-training at large scale. First, BN’s state (mean and variance of neural activations) needs adjustment between pre-training and transfer, whereas GN is stateless, thus side-stepping this difficulty. Second, BN uses batch-level statistics, which become unreliable with small per-device batch sizes that are inevitable for large models. Since GN does not compute batch-level statistics, it also side-steps this issue. For more technical details, including the use of a weight standardization technique to ensure stable behavior, please see our paper.

Summary of our pre-training strategy: take a standard ResNet, increase depth and width, replace BatchNorm (BN) with GroupNorm and Weight Standardization (GNWS), and train on a very large and generic dataset for many more iterations.

Transfer Learning
Following the methods established in the language domain by BERT, we fine-tune the pre-trained BiT model on data from a variety of “downstream” tasks of interest, which may come with very little labeled data. Because the pre-trained model already comes with a good understanding of the visual world, this simple strategy works remarkably well.

Fine-tuning comes with a lot of hyper-parameters to be chosen, such as learning-rate, weight-decay, etc. We propose a heuristic for selecting these hyper-parameters that we call “BiT-HyperRule”, which is based only on high-level dataset characteristics, such as image resolution and the number of labeled examples. We successfully apply the BiT-HyperRule on more than 20 diverse tasks, ranging from natural to medical images.

Once the BiT model is pre-trained, it can be fine-tuned on any task, even if only few labeled examples are available.

When transfering BiT to tasks with very few examples, we observe that as we simultaneously increase the amount of generic data used for pre-training and the architecture capacity, the ability of the resulting model to adapt to novel data drastically improves. On both 1-shot and 5-shot CIFAR (see Fig below) increasing model capacity yields limited returns when pre-training on ILSVRC (green curves). Yet, with large-scale pre-training on JFT, each step-up in model capacity yields massive returns (brown curves), up to BiT-L which attains 64% 1-shot and 95% 5-shot.

The curves depict median accuracy over 5 independent runs (light points) when transferring to CIFAR-10 with only 1 or 5 images per class (10 or 50 images total). It is evident that large architectures pre-trained on large datasets are significantly more data-efficient.

In order to verify that this result holds more generally, we also evaluate BiT on VTAB-1k, which is a suite of 19 diverse tasks with only 1000 labeled examples per task. We transfer the BiT-L model to all these tasks and achieve a score of 76.3% overall, which is a 5.8% absolute improvement over the previous state-of-the-art.

We show that this strategy of large-scale pre-training and simple transfer is effective even when a moderate amount of data is available by evaluating BiT-L on several standard computer vision benchmarks such as Oxford Pets and Flowers, CIFAR, etc. On all of these, BiT-L matches or surpasses state-of-the-art results. Finally, we use BiT as a backbone for RetinaNet on the MSCOCO-2017 detection task and confirm that even for such a structured output task, using large-scale pre-training helps considerably.

Left: Accuracy of BiT-L compared to the previous state-of-the-art general model on various standard computer vision benchmarks. Right: Results in average precision (AP) of using BiT as backbone for RetinaNet on MSCOCO-2017.

It is important to emphasize that across all the different downstream tasks we consider, we do not perform per-task hyper-parameter tuning and rely on the BiT-HyperRule. As we show in the paper, even better results can be achieved by tuning hyperparameters on sufficiently large validation data.

Evaluation with ObjectNet
To further assess the robustness of BiT in a more challenging scenario, we evaluate BiT models that were fine-tuned on ILSVRC-2012 on the recently introduced ObjectNet dataset. This dataset closely resembles real-world scenarios, where objects may appear in atypical context, viewpoint, rotation, etc. Interestingly, the benefit from data and architecture scale is even more pronounced with the BiT-L model achieving unprecedented top-5 accuracy of 80.0%, an almost 25% absolute improvement over the previous state-of-the-art.

Results of BiT on the ObjectNet evaluation dataset. Left: top-5 accuracy, right: top-1 accuracy.

Conclusion
We show that given pre-training on large amounts of generic data, a simple transfer strategy leads to impressive results, both on large datasets as well as tasks with very little data, down to a single image per class. We release the BiT-M model, a R152x4 pre-trained on ImageNet-21k, along with colabs for transfer in Jax, TensorFlow2, and PyTorch. In addition to the code release, we refer the reader to the hands-on TensorFlow2 tutorial on how to use BiT models. We hope that practitioners and researchers find it a useful alternative to commonly used ImageNet pre-trained models.

Acknowledgements
We would like to thank Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, and Neil Houlsby who have co-authored the BiT paper and been involved in all aspects of its development, as well as the Brain team in Zürich. We also would like to thank Andrei Giurgiu for his help in debugging input pipelines. We thank Tom Small for creating the animations used in this blogpost. Finally, we refer the interested reader to the related approaches in this direction by our colleagues in Google Research, Noisy Student, as well as Facebook Research’s highly relevant Exploring the Limits of Weakly Supervised Pretraining.

Fireflies helps companies get more out of meetings

Many decisions are made and details sorted out in a productive business meeting. But in order for that meeting to translate into results, participants have to remember all those details, understand their assignments, and follow through on commitments.

The startup Fireflies.ai is helping people get the most out of their meetings with a note-taking, information-organizing virtual assistant named Fred. Fred transcribes every word of meetings and then uses artificial intelligence to help people sort and share that information later on.

“There’s a tremendous amount of data generated in meetings that can help your team stay on the same page,” says Sam Udotong ’16, who founded the company with Krish Ramineni in 2016. “We let people capture that data, search through it, and then share it to the places that matter most.”

The tool integrates with popular meeting and scheduling software like Zoom and Google Calendar so users can quickly add Fred to calls. It also works with collaboration platforms like Slack and customer management software like Salesforce to help ensure plans turn into coordinated action.

Fireflies is used by people working in roles including sales, recruiting, and product management. They can use the service to automate project management tasks, screen candidates, and manage internal team communications.

In the last few months, driven in part by the Covid-19 pandemic, Fred has sat through millions of minutes of meetings involving more than half a million people. And the founders believe Fred can do more than simply help people adjust to remote work; it can also help them collaborate more effectively than ever before.

“[Fred] is giving you perfect memory,” says Udotong, who serves as Firelies’ chief technology officer. “The dream is for everyone to have perfect recall and make all their decisions based on the right information. So being able to search back to exact points in conversation and remember that is powerful. People have told us it makes them look smarter in front of clients.”

Taking the leap

Udotong was introduced to the power of machine learning in his first year at MIT while working on a project in which students built a drone that could lead people on campus tours. Later, during his first MIT hackathon, he sought to use machine learning in a cryptography solution. That’s when he met Ramineni, who was a student at the University of Pennsylvania. That’s also when Fireflies was born — although the founders would go on to change everything about the company besides its name as they sought to use artificial intelligence to improve efficiency in a range of fields.

“We ended up building six iterations of Fireflies before this current meeting assistant,” Udotong remembers. “And every time we would build a different iteration, we would tell our friends, ‘Download it, use it, and get back to us next week, we’ll grab coffee.’ We were making all these agreements and promises, and it became really challenging to keep track of all the conversations we were having to get our products out there. We thought, ‘What if we just had an AI that could keep track of conversations for us?’”

The founders’ initial note-taking solution, built in short bursts between classes and homework, tracked action items written in messages, sending reminders to users later on.

Following Udotong’s graduation with a degree in aeronautics and astronautics in 2016, the founders decided to use a $25,000 stipend they received from Rough Draft Ventures, along with $5,000 from the MIT Sandbox Innovation Fund, to work on Fireflies through the summer.

The plan was to work on Fireflies for another short burst: Ramineni was already making plans to attend Cambridge University for his master’s degree in the fall, and Udotong was weighing acceptance letters from graduate schools as well as job offers. By July, however, the founders had changed their plans.

“I think deciding [on a career path] is really hard these days, even if you identify your passion,” Udotong says. “The easy path for someone in tech is to follow the money and go work for Google or Facebook. We decided to go a different route and take the risk.”

They moved to Ramineni’s hometown of San Francisco to officially launch the company. Udotong remembers getting to San Francisco with $100 dollars in his bank account.

The founders had fully committed themselves to Fireflies, but it didn’t make starting the company any easier. They decided not to raise venture capital in the company’s early years, and Ramineni admits to questioning whether going all in on Fireflies was the right decision as recently at 2018.

The founders also weren’t sure a radically new software category would be embraced so readily by businesses. They continued to invest in the voice AI space, as they believed that the need for their technology was growing and the timing was right.

“We realized that there’s a ton of data generated every day through speech, either in meetings like Zoom or in person,” Ramineni says. “Today, two hours after your meeting, unless you’re taking good notes or recording, you’re not going to be able to recall everything. You might not even remember what action items you agreed to a few hours ago. It’s such a common problem that people don’t even know it’s an issue. You have meetings and you expect things to slip through the cracks.”

Illuminating conversations

Today the Fireflies solution shows little trace of the arduous journey the founders took to get to this point. In fact, building simplicity into the tool has been a major focus for the founders.

Fred can join calendar events automatically or be added to meetings using the fred@fireflies.ai address. Fred joins Zoom, Google Meet, Skype, or Microsoft calls as a participant, silently transcribing and generating notes from the meeting. After the meeting, the AI assistant sends a full transcript to whomever the organizer chooses, allowing users to click on sections of the transcript to hear that part of the meeting audio. Users can also search the transcript and go through an hourlong meeting in five minutes, according to the company. The transcript can also surface action items, tasks, metrics, pricing, and other topics of interest.

After each meeting, Fireflies can automatically sync all this meeting data into apps from companies like Slack, Salesforce, and Hubspot.

“Fireflies is like a personal assistant that helps connect your systems of communication with your systems of record,” Udotong says. “If you’re having these meetings over Zoom and Google Meet every day, and you’re interacting with Slack or Trello, Fireflies is that middle router that can bring synchronicity to your work life.”

In the midst of the Covid-19 pandemic, millions of companies have been forced to operate remotely, and the founders think the impact of that response will be felt for far longer than the virus.

“I think the world’s now realizing that people can be fully distributed,” says Ramineni, who notes Fireflies’ team has been remote since he and Udotong began working together in college hackathons from different campuses in 2014.

And as the company has grown, customers have begun using Fred for use cases the founders hadn’t even considered, like sending Fred to meetings that they can’t attend and reviewing the notes later on. Customers, the founders believe, are realizing that being able to quickly search, sort, and otherwise collaborate across audio data unlocks a world of new possibilities.

“It’s kind of like what Google did with search,” Udotong says. “There was five to 10 years of web data building up, and there was no way for people to find what they were looking for. The same thing is true today of audio and meeting data. It’s out there, but there’s no way to actually find what you’re looking for because it’s never even stored in the first place.”

Read More

Announcing the 7th Fine-Grained Visual Categorization Workshop

Announcing the 7th Fine-Grained Visual Categorization Workshop

Posted by Christine Kaeser-Chen, Software Engineer and Serge Belongie, Visiting Faculty, Google Research

Fine-grained visual categorization refers to the problem of distinguishing between images of closely related entities, e.g., a monarch butterfly (Danaus plexippus) from a viceroy (Limenitis archippus). At the time of the first FGVC workshop in 2011, very few fine-grained datasets existed, and the ones that were available (e.g., the CUB dataset of 200 bird species, launched at that workshop) presented a formidable challenge to the leading classification algorithms of the time. Fast forward to 2020, and the computer vision landscape has undergone breathtaking changes. Deep learning based methods helped CUB-200-2011 accuracy rocket from 17% to 90% and fine-grained datasets have proliferated, with data arriving from a diverse array of institutions, such as art museums, apparel retailers, and cassava farms.

In order to help support even further progress in this field, we are excited to sponsor and co-organize the 7th Workshop on Fine-Grained Visual Categorization (FGVC7), which will take place as a virtual gathering on June 19, 2020, in conjunction with the IEEE conference on Computer Vision and Pattern Recognition (CVPR). We’re excited to highlight this year’s world-class lineup of fine-grained challenges, ranging from fruit tree disease prediction to fashion attributes, and we invite computer vision researchers from across the world to participate in the workshop.

The FGVC workshop at CVPR 2020 focuses on subordinate categories, including (from left to right) wildlife camera traps, plant pathology, birds, herbarium sheets, apparel, and museum artifacts.

Real-World Impact of the FGVC Challenges
In addition to pushing the frontier of fine-grained recognition on ever more challenging datasets, each FGVC workshop cycle provides opportunities for fostering new collaborations between researchers and practitioners. Some of the efforts from the FGVC workshop have made the leap into the hands of real world users.

The 2018 FGVC workshop hosted a Fungi challenge with data for 1,500 mushroom species provided by the Danish Mycological Society. When the competition concluded, the leaderboard was topped by a team from Czech Technical University and the University of West Bohemia.

The mycologists subsequently invited the Czech researchers for a visit to Copenhagen to explore further collaboration and field test a new workflow for collaborative machine learning research in biodiversity. This resulted in a jointly authored conference paper, a mushroom recognition app for Android and iOS, and an open access model published on TensorFlow Hub.

The Svampeatlas app for mushroom recognition is a result of a Danish-Czech collaboration spun out of the FGVC 2018 Fungi challenge. The underlying model is now published on TF Hub. Images used with permission of the Danish Mycological Society.

The iCassava Disease Challenge from 2019 mentioned above is another example of an FGVC team effort finding its way into the real world. In this challenge, Google researchers in Ghana collaborated with Makerere University and the National Crops Resources Research Institute (NaCRRI) to produce an annotated dataset of five cassava disease categories.

Examples of cassava leaf disease represented in the 2019 iCassava challenge.

The teams are testing a new model in the fields in Uganda with local farmers, and the model will be published on TFHub soon.

This Year’s Challenges
FGVC7 will feature six challenges, four of which represent sequels to past offerings, and two of which are brand new.

In iWildCam, the challenge is to identify different species of animals in camera trap images. Like its predecessors in 2018 and 2019, this year’s competition makes use of data from static, motion-triggered cameras used by biologists to study animals in the wild. Participants compete to build models that address diverse regions from around the globe, with a focus on generalization to held-out camera deployments within those regions, which exhibit differences in device model, image quality, local environment, lighting conditions, and species distributions, making generalization difficult.

It has been shown that species classification performance can be dramatically improved by using information beyond the image itself. In addition, since an ecosystem can be monitored in a variety of ways (e.g., camera traps, citizen scientists, remote sensing), each of which has its own strengths and limitations, it is important to facilitate the exploration of techniques for combining these complementary modalities. To this end, the competition provides a time series of remote sensing imagery for each camera trap location, as well as images from the iNaturalist competition datasets for species in the camera trap data.

Side-by-side comparison of image quality from iWildcam, captured from wildlife camera traps, (left) and iNaturalist (right), captured by conventional cameras. Images are from the 2020 iWildCam Challenge, and the iNaturalist competition datasets from 2017 and 2018.

The Herbarium Challenge, now in its second year, entails plant species identification, based on a large, long-tailed collection of herbarium specimens. Developed in collaboration with the New York Botanical Garden (NYBG), this challenge features over 1 million images representing over 32,000 plant species. Last year’s challenge was based on 46,000 specimens for 680 species. Being able to recognize species from historical herbarium collections can not only help botanists better understand changes in plant life on our planet, but also offers a unique opportunity to identify previously undescribed new species in the collection.

Representative examples of specimens from the 2020 Herbarium challenge. Images used with permission of the New York Botanical Garden.

In this year’s iMat Fashion challenge, participants compete to perform apparel instance segmentation and fine-grained attribute classification. The goal of this competition is to push the state of the art in fine-grained segmentation by joining forces between the fashion and computer vision communities. This challenge is in its third iteration, growing both in size and level of detail over past years’ offerings.

The last of the sequels is iMet, in which participants are challenged with building algorithms for fine-grained attribute classification on works of art. Developed in collaboration with the Metropolitan Museum of Art, the dataset has grown significantly since the 2019 edition, with a wide array of new cataloguing information generated by subject matter experts including multiple object classifications, artist, title, period, date, medium, culture, size, provenance, geographic location, and other related museum objects within the Met’s collection.

Semi-Supervised Aves is one of the new challenges at this year’s workshop. While avian data from iNaturalist has featured prominently in past FGVC challenges, this challenge focuses on the problem of learning from partially labeled data, a form of semi-supervised learning. The dataset is designed to expose some of the challenges encountered in realistic settings, such as the fine-grained similarity between classes, significant class imbalance, and domain mismatch between the labeled and unlabeled data.

Rounding out the set of challenges is Plant Pathology. In this challenge, the participants attempt to spot foliar diseases of apples using a reference dataset of expert-annotated diseased specimens. While this particular challenge is new to the FGVC community, it is the second such challenge to involve plant disease, the first being iCassava at last year’s FGVC.

Invitation to Participate
The results of these competitions will be presented at the FGVC7 workshop by top performing teams. We invite researchers, practitioners, and domain experts to participate in the FGVC workshop to learn more about state-of-the-art advances in fine-grained image recognition. We are excited to encourage the community’s development of cutting edge algorithms for fine-grained visual categorization and foster new collaborations with global impact!

Acknowledgements
We’d like to thank our colleagues and friends on the FGVC7 organizing committee for working together to advance this important area. At Google we would like to thank Hartwig Adam, Kiat Chuan Tan, Arvi Gjoka, Kimberly Wilber, Sara Beery, Mikhail Sirotenko, Denis Brulé, Timnit Gebru, Ernest Mwebaze, Wojciech Sirko, Maggie Demkin.