Introducing TF-Coder, a tool that writes tricky TensorFlow expressions for you!

Introducing TF-Coder, a tool that writes tricky TensorFlow expressions for you!

Posted by Kensen Shi, Google Research

When manipulating tensors, one must keep track of multiple dimensions, tensor shape and DType compatibility, and of course mathematical correctness. Additionally, there are hundreds of TensorFlow operations, and finding the right ones to use can be a challenge.

Instead of coding your tensor manipulation directly, what if you could just demonstrate it through an illustrative example and get the corresponding code automatically? TensorFlow Coder (TF-Coder) makes this possible!

TF-Coder is a program synthesis tool that helps you write TensorFlow code. First, the tool asks for an input-output example of the desired tensor transformation. Then, it runs a combinatorial search to find TensorFlow expressions that perform that transformation. TF-Coder’s output is real TensorFlow code that you can include in your projects.

The following one-minute video introduces TF-Coder, and this Colab notebook allows you to use the TF-Coder tool for your own tensor manipulation problems.

In this blog post, we’ll illustrate various scenarios where TF-Coder can help you write TensorFlow code.

Programming in TensorFlow by example

Suppose you want to “add” an M-element vector with an N-element vector in a broadcasted way to produce an M x N matrix containing all pairwise sums. Instead of digging through TensorFlow documentation to figure out how to do this, you can instead provide an input-output example (using M = 3 and N = 4):

Input tensors, as a dict mapping input variable names to example tensor values:

inputs = {
'rows': [10, 20, 30],
'cols': [1, 2, 3, 4],
}

The desired output tensor, corresponding to the provided input tensors:

output = [[11, 12, 13, 14],
[21, 22, 23, 24],
[31, 32, 33, 34]]

Given this information (already entered into the TF-Coder Colab by default), the TF-Coder tool will find the appropriate TensorFlow code automatically in a fraction of a second:

tf.add(cols, tf.expand_dims(rows, 1))

The above problem was pretty simple just to illustrate the idea of programming by example. TF-Coder can be useful for harder problems as well, as we’ll see below.

TF-Coder helps you find the right function to use

Let’s suppose you are working with a numerical feature such as the price of an item. The prices in your dataset have a wide range, e.g., from under $10 to over $1000. If these prices are used directly as features, your model may overfit to specific prices in the training data, and it may also have difficulty with outlier prices during evaluation.

To deal with these issues, you may want to use bucketing to transform the numerical prices into categorical features. For example, using bucket boundaries of [10, 50, 100, 1000] means that prices under $10 should fall into bucket 0, prices between $10 and $50 fall into bucket 1, and so on.

After choosing bucket boundaries, how do you actually map the numerical prices to the bucket indices using TensorFlow? For example, given the following bucket boundaries and item prices:

# Input tensors
boundaries = [10, 50, 100, 1000]
prices = [15, 3, 50, 90, 100, 1001]

you want to compute the bucket number for each item:

# Output tensor
bucketed_prices = [1, 0, 2, 2, 3, 4]

Although TensorFlow comes with various bucketing operations, it may be tricky to figure out which specific operation does this exact kind of bucketing. Since TF-Coder can identify hundreds of Tensor operations by behavior, you can look up the correct operation by providing an input-output example:

# Input-output example
inputs = {
'boundaries': [10, 50, 100, 1000],
'prices': [15, 3, 50, 90, 100, 1001],
}
output = [1, 0, 2, 2, 3, 4]

Within seconds, TF-Coder outputs the following solution:

tf.searchsorted(boundaries, prices, side='right')

This gives us a useful hint, and the documentation for tf.searchsorted confirms that this code indeed performs the bucketing as desired.

TF-Coder helps you combine functions in clever ways

Now let’s consider another problem: compute a 0-1 tensor that identifies the maximum element of each row of the input tensor.

# Input tensor
scores = [[0.7, 0.2, 0.1],
[0.4, 0.5, 0.1],
[0.4, 0.4, 0.2],
[0.3, 0.4, 0.3],
[0.0, 0.0, 1.0]]

# Output tensor
top_scores = [[1, 0, 0],
[0, 1, 0],
[1, 0, 0],
[0, 1, 0],
[0, 0, 1]]

Note that if the same largest element appears multiple times within a row, such as in the third row of scores, then only the first such largest element should be marked, so that every row of top_scores has exactly one entry of 1.

Unlike in the last problem, there is no single TensorFlow function that performs this computation. If you search the documentation for “max”, you may find that tf.reduce_max, tf.argmax, and tf.maximum are relevant, but which one should you use? tf.reduce_max produces [0.7, 0.5, 0.4, 0.4, 1.0], tf.argmax produces [0, 1, 0, 1, 2], and tf.maximum isn’t right because it takes two arguments. None of these look close to our desired output.

TF-Coder can help solve tricky problems like this. You can write the problem in the form of an input-output example:

# Input-output example
inputs = {
'scores': [[0.7, 0.2, 0.1],
[0.4, 0.5, 0.1],
[0.4, 0.4, 0.2],
[0.3, 0.4, 0.3],
[0.0, 0.0, 1.0]],
}
output = [[1, 0, 0],
[0, 1, 0],
[1, 0, 0],
[0, 1, 0],
[0, 0, 1]]

TF-Coder uses a combination of tf.one_hot and tf.argmax in a short solution to this problem:

tf.cast(tf.one_hot(tf.argmax(scores, axis=1), 3), tf.int32)

Through a detailed search over combinations of TensorFlow operations, TF-Coder often finds elegant solutions like this, which may simplify and speed up your TensorFlow programs.

TF-Coder helps you write correct code with less debugging

Consider normalizing lists of integer counts into probability distributions by dividing each row by the sum of that row. For instance:

# Input tensor
counts = [[0, 1, 0, 0],
[0, 1, 1, 0],
[1, 1, 1, 1]]

# Output tensor
normalized = [[0.0, 1.0, 0.0, 0.0],
[0.0, 0.5, 0.5, 0.0],
[0.25, 0.25, 0.25, 0.25]]

Even if you know relevant functions to use (tf.reduce_sum followed by tf.divide), writing the correct code is still nontrivial. A first attempt may look like this:

# First attempt
normalized = tf.divide(counts, tf.reduce_sum(counts, axis=1))

Is this right? There are many potential pitfalls to think about:

  • Is the summation axis correct, or should it be axis=0?
  • Are the shapes of counts and tf.reduce_sum(counts, axis=1) compatible for division, or do you need to reshape or transpose either of these?
  • counts and tf.reduce_sum(counts, axis=1) are both tf.int32 tensors. Can tf.int32 tensors be divided, or do you need to cast them to a float DType first?
  • Are the two arguments in the correct order, or should they be swapped?
  • Does the output have type tf.int32, tf.float32, or something else?
  • Is there a simpler or better way that was not considered?

You can give this task to TF-Coder with the following input-output example:

# Input-output example
inputs = {
'counts': [[0, 1, 0, 0],
[0, 1, 1, 0],
[1, 1, 1, 1]],
}
output = [[0.0, 1.0, 0.0, 0.0],
[0.0, 0.5, 0.5, 0.0],
[0.25, 0.25, 0.25, 0.25]]

TF-Coder’s solution is:

tf.cast(tf.divide(counts, tf.expand_dims(tf.reduce_sum(counts, axis=1), axis=1)), tf.float32)

By using TF-Coder to solve this problem, the mental burden of the exercise is reduced. When TF-Coder produces the solution above, it is guaranteed that the code correctly produces the example output when run on the example input. TF-Coder’s solution will also avoid any unnecessary steps. Thus, you can quickly deduce the answers to most of the questions above: an extra tf.expand_dims step is needed to make the shapes compatible for division, and the result of tf.divide must be cast to tf.float32 (in fact tf.divide returns a tf.float64 tensor when dividing two tf.int32 tensors). In this way, TF-Coder helps you write simple and correct code without painful debugging cycles.

Caveats

There are limitations to TF-Coder. It can currently find solutions involving 3-4 operations within a minute of searching, but solutions involving 6 or more operations are too complex to find in a reasonable amount of time. Furthermore, TF-Coder currently does not support complex or string tensors, or RaggedTensors. The full list of supported operations can be found in the Colab notebook.
In addition, TF-Coder only guarantees that its solutions work for the given input-output example. The tool searches for a simple TensorFlow expression that matches the provided input-output example, but sometimes this solution is too simple and doesn’t generalize in the intended way. It can be helpful to make the example as unambiguous as possible, which can often be achieved by adding more numbers to the input and output tensors. Please review TF-Coder’s solutions to ensure that they correctly implement the intended behavior.

Try TF-Coder yourself!

Be sure to give TF-Coder a try! Even experienced TensorFlow users at Google are learning new things with the help of TF-Coder.
You can access the tool using this Colab notebook — no download or installation is required. Follow this tutorial for a detailed walkthrough. You can also take a look at our code and documentation on GitHub and our research paper.

Note: in the Colab tool, we would like to log the problems given to TF-Coder and the resulting solutions, so that we can improve the tool and build a dataset that will accelerate program synthesis research in general, but this data collection is completely optional.Read More

Leveraging online social interactions for enhancing integrity at Facebook

Leveraging online social interactions for enhancing integrity at Facebook

Nima Noorshams, Saurabh Verma, and Aude Hofleitner are Research Scientists at Facebook working within Core Data Science, a research and development team focused on improving Facebook’s processes, infrastructure, and products.

What we did: Sequence modeling for integrity

Billions of people rely on Facebook products and services to connect with family and friends, build new communities, share experiences, and run their businesses. However, the rise of inauthentic accounts and activities as well as disparaging and threatening content on social media has introduced several integrity challenges. Needless to say, maintaining the integrity of such a large and growing network in a fast and scalable manner is of utmost importance for the safety and security of the online community.

Entities on the platform, such as accounts, posts, pages, and groups, are not static. They interact with one another over time, which can reveal a lot about their nature. For instance, fake accounts and misinformation posts elicit different types of reactions from other accounts than do normal/benign accounts and posts (see Figure 1). In the paper “TIES: Temporal Interaction Embeddings for enhancing social media integrity at Facebook,” we focus on the problem of leveraging these interactions in order to enhance the integrity of the platform.

In short, TIES is a deep learning, application-agnostic, scalable framework for embedding sequences of entity interactions. It encodes not only the sequence of actions but also various features of sources and targets of the interactions. The embedding vectors can then be used for various integrity applications, such as detecting fake accounts, identifying misinformation or hate speech, detecting high-risk ad accounts, and many others.

Figure 1, at left: Account-account interaction used to detect fake accounts. At right: Post-account interactions used to identify misinformation.

How we did it: Combining graph representations and sequence learning

Past studies have mainly focused on either static or dynamic behaviors of the networks, but not both at the same time. In contrast, the core of TIES consist of two embeddings:

  1. Graph-based embedding, which captures the static (or slow-changing) information encoded in the large social graph.
  2. Sequence-based embedding, which captures the more dynamic actions.

Prior knowledge, such as friending and group or page memberships, are captured in the social graph. Large-scale embedding algorithms, such as PyTorch-BigGraph, can be used to encode graph information. These graph-based embeddings are then used to initialize the sequence encoder piece of the framework. Figure 2 illustrates the model architecture.

We first convert the sequence of triplets (source, target, action) into feature vectors. These vectors consist of trainable action embeddings, pretrained source and target embeddings (which are produced by PyTorch-BigGraph), as well as miscellaneous features such as time-gap between the actions. The features are then fed into a sequence encoder, which consists of a seq2seq encoding layer, self-attention, and pooling layer. The model parameters are trained by minimizing a loss function over a labeled data set, thus creating supervised embeddings.

TIES applications at Facebook

We have tested this framework on several applications, including detecting misinformation, detecting fake accounts and engagements, and identifying high-risk ad accounts. Different types of actions and features were used for each application. For instance, in detecting misinformation, we used sequences of user actions on posts, such as likes, comments, shares, and so on.

In all the aforementioned applications, we used a portion of the training samples (up to millions of samples) for training the TIES model and then passed that as an additional feature to baseline models (generally complicated models consisting of several hundred carefully engineered features and/or deep learning frameworks that were already deployed into production). In all instances, we observed uniform and statistically significant gains over existing baselines that can contribute to enhancing the integrity of our platform, and TIES features have been deployed into production since.

The post Leveraging online social interactions for enhancing integrity at Facebook appeared first on Facebook Research.

Read More

Sterling Support: SHIELD TV’s 25th Software Upgrade Now Available

Sterling Support: SHIELD TV’s 25th Software Upgrade Now Available

With NVIDIA SHIELD TV, there’s always more to love.

Today’s software update — SHIELD Software Experience Upgrade 8.2 — is the 25th for owners of the original SHIELD TV. It’s a remarkable run, spanning more than 5 years since the first SHIELD TVs launched in May 2015.

The latest upgrade brings a host of new features and improvements for daily streamers and media enthusiasts.

Stream On

One of the fan-favorite features for the newest SHIELD TVs is the AI upscaler. It works by training a neural network model on countless images. Deployed on 2019 SHIELD TVs, the AI model can then take low-resolution video and produce incredible sharpness and enhanced details no traditional scaler can recreate. Edges look sharper. Hair looks scruffier. Landscapes pop with striking clarity.

To see the difference between “basic upscaling” and “AI-enhanced upscaling” on SHIELD, click the image below and move the slider left and right.

Today’s upgrade adds more UHD 4K upscaling support from 360p to 1440p content. And on 2019 SHIELD TV Pros, we added support for 60fps content. Now SHIELD can upscale live sports on HD TV and HD video from YouTube to 4K with AI. In the weeks ahead, following an update to the NVIDIA Games app in September, we’ll add 4K 60fps upscaling to GeForce NOW.

The customizable menu button on the new SHIELD remote is another popular addition to the family. It’s getting two more actions to customize.

In addition to an action assigned to a single press, users can now configure a custom action for double press and long press. With over 25 actions available, the SHIELD remote is now the most customizable remote for streamers. This powerful feature works with all SHIELD TVs and the SHIELD TV app, available on the Google Play Store and iOS App Store.

More to Be Enthusiastic About

We take pride in SHIELD being a streaming media player enthusiasts can be, well, enthusiastic about. With our latest software upgrade, we’re improving our IR and CEC volume control support.

These upgrades include support for digital projectors, and allowing functionality when SHIELD isn’t active. It also adds IR volume control when using the SHIELD TV app, and when you’ve paired your Google Home with SHIELD. The 2019 SHIELD remote adds IR control to change the input source on TVs, AVRs and soundbars.

Additionally, earlier SHIELD generations — both 2015 and 2017 models — now have an option to match the frame rate of displayed content.

We’ve added native SMBv3 support as well, providing faster and more secure connections between PC and SHIELD. SMBv3 now works without requiring a PLEX media server.

With SHIELD, there’s always more to love. Download the latest software upgrade today, and check out the release notes for a complete list of all the new features and improvements.

The post Sterling Support: SHIELD TV’s 25th Software Upgrade Now Available appeared first on The Official NVIDIA Blog.

Read More

Safe Travels: Voyage Intros Ambulance-Grade, Self-Cleaning Driverless Vehicle Powered by NVIDIA DRIVE

Safe Travels: Voyage Intros Ambulance-Grade, Self-Cleaning Driverless Vehicle Powered by NVIDIA DRIVE

Self-driving cars continue to amaze passengers as a truly transformative technology. However, in the time of COVID-19, a self-cleaning car may be even more appealing.

Robotaxi startup Voyage introduced its third-generation vehicle, the G3, this week. The  autonomous vehicle, a Chrysler Pacifica Hybrid minivan retrofitted with self-driving technology, is the company’s first designed to operate without a driver and is equipped with an ambulance-grade ultraviolet light disinfectant system to keep passengers healthy.

The new vehicles use the NVIDIA DRIVE AGX Pegasus compute platform to enable the startup’s self-driving AI for robust perception and planning. The automotive-grade platform delivers safety to the core of Voyage’s autonomous fleet.

Given the enclosed space and the proximity of the driver and passengers, ride-hailing currently poses a major risk in a COVID-19 world. By implementing a disinfecting system alongside driverless technology, Voyage is ensuring self-driving cars will continue to develop as a safer, more efficient alternative to everyday mobility.

The G3 vehicle uses an ultraviolet-C system from automotive supplier GHSP to destroy pathogens in the vehicle between rides. UV-C works by inactivating a pathogen’s DNA, blocking its reproductive cycle. It’s been proven to be up to 99.9 percent effective and is commonly used to sterilize ambulances and hospital rooms.

The G3 is production-ready and currently testing on public roads in San Jose, Calif., with production vehicles planned to come out next year.

G3 Compute Horsepower Takes Off with DRIVE AGX Pegasus

Voyage has been using the NVIDIA DRIVE AGX platform in its previous-generation vehicles to power its Shield automatic emergency braking system.

With the G3, the startup is unleashing the 320 TOPS of performance from NVIDIA DRIVE AGX Pegasus to process sensor data and run diverse and redundant deep neural networks simultaneously for driverless operation. Voyage’s onboard computers are automotive grade and safety certified, built to handle the harsh vehicle environment for safe daily operation.

NVIDIA DRIVE AGX Pegasus delivers the compute necessary for level 4 and level 5 autonomous driving.

DRIVE AGX Pegasus is built on two NVIDIA Xavier systems-on-a-chip. Xavier is the first SoC built for autonomous machines and was recently determined by global safety agency TÜV SÜD to meet all applicable requirements of ISO 26262. This stringent assessment means it meets the strictest standard for functional safety.

Xavier’s safety architecture combined with the AI compute horsepower of the DRIVE AGX Pegasus platform delivers the robustness and performance necessary for the G3’s fully autonomous capabilities.

Moving Forward as the World Shelters in Place

As the COVID-19 pandemic continues to limit the way people live and work, transportation must adapt to keep the world moving.

In addition to the UV-C lights, Voyage has also equipped the car with HEPA-certified air filters to ensure safe airflow inside the car. The startup uses its own employees to manage and operate the fleet, enacting strict contact tracing and temperature checks to help minimize virus spread.

The Voyage G3 is equipped with a UV-C light system to disinfect the vehicle between rides.

While these measures are in place to specifically protect against the COVID-19 virus, they demonstrate the importance of an autonomous vehicle as a place where passengers can feel safe. No matter the condition of the world, autonomous transportation translates to a worry-free voyage, every time.

The post Safe Travels: Voyage Intros Ambulance-Grade, Self-Cleaning Driverless Vehicle Powered by NVIDIA DRIVE appeared first on The Official NVIDIA Blog.

Read More

National Science Foundation announces MIT-led Institute for Artificial Intelligence and Fundamental Interactions

National Science Foundation announces MIT-led Institute for Artificial Intelligence and Fundamental Interactions

The U.S. National Science Foundation (NSF) announced today an investment of more than $100 million to establish five artificial intelligence (AI) institutes, each receiving roughly $20 million over five years. One of these, the NSF AI Institute for Artificial Intelligence and Fundamental Interactions (IAIFI), will be led by MIT’s Laboratory for Nuclear Science (LNS) and become the intellectual home of more than 25 physics and AI senior researchers at MIT and Harvard, Northeastern, and Tufts universities. 

By merging research in physics and AI, the IAIFI seeks to tackle some of the most challenging problems in physics, including precision calculations of the structure of matter, gravitational-wave detection of merging black holes, and the extraction of new physical laws from noisy data.

“The goal of the IAIFI is to develop the next generation of AI technologies, based on the transformative idea that artificial intelligence can directly incorporate physics intelligence,” says Jesse Thaler, an associate professor of physics at MIT, LNS researcher, and IAIFI director.  “By fusing the ‘deep learning’ revolution with the time-tested strategies of ‘deep thinking’ in physics, we aim to gain a deeper understanding of our universe and of the principles underlying intelligence.”

IAIFI researchers say their approach will enable making groundbreaking physics discoveries, and advance AI more generally, through the development of novel AI approaches that incorporate first principles from fundamental physics.  

“Invoking the simple principle of translational symmetry — which in nature gives rise to conservation of momentum — led to dramatic improvements in image recognition,” says Mike Williams, an associate professor of physics at MIT, LNS researcher, and IAIFI deputy director. “We believe incorporating more complex physics principles will revolutionize how AI is used to study fundamental interactions, while simultaneously advancing the foundations of AI.”

In addition, a core element of the IAIFI mission is to transfer their technologies to the broader AI community.

“Recognizing the critical role of AI, NSF is investing in collaborative research and education hubs, such as the NSF IAIFI anchored at MIT, which will bring together academia, industry, and government to unearth profound discoveries and develop new capabilities,” says NSF Director Sethuraman Panchanathan. “Just as prior NSF investments enabled the breakthroughs that have given rise to today’s AI revolution, the awards being announced today will drive discovery and innovation that will sustain American leadership and competitiveness in AI for decades to come.”

Research in AI and fundamental interactions

Fundamental interactions are described by two pillars of modern physics: at short distances by the Standard Model of particle physics, and at long distances by the Lambda Cold Dark Matter model of Big Bang cosmology. Both models are based on physical first principles such as causality and space-time symmetries.  An abundance of experimental evidence supports these theories, but also exposes where they are incomplete, most pressingly that the Standard Model does not explain the nature of dark matter, which plays an essential role in cosmology.

AI has the potential to help answer these questions and others in physics.

For many physics problems, the governing equations that encode the fundamental physical laws are known. However, undertaking key calculations within these frameworks, as is essential to test our understanding of the universe and guide physics discovery, can be computationally demanding or even intractable. IAIFI researchers are developing AI for such first-principles theory studies, which naturally require AI approaches that rigorously encode physics knowledge. 

“My group is developing new provably exact algorithms for theoretical nuclear physics,” says Phiala Shanahan, an assistant professor of physics and LNS researcher at MIT. “Our first-principles approach turns out to have applications in other areas of science and even in robotics, leading to exciting collaborations with industry partners.”

Incorporating physics principles into AI could also have a major impact on many experimental applications, such as designing AI methods that are more easily verifiable. IAIFI researchers are working to enhance the scientific potential of various facilities, including the Large Hadron Collider (LHC) and the Laser Interferometer Gravity Wave Observatory (LIGO). 

“Gravitational-wave detectors are among the most sensitive instruments on Earth, but the computational systems used to operate them are mostly based on technology from the previous century,” says Principal Research Scientist Lisa Barsotti of the MIT Kavli Institute for Astrophysics and Space Research. “We have only begun to scratch the surface of what can be done with AI; just enough to see that the IAIFI will be a game-changer.”

The unique features of these physics applications also offer compelling research opportunities in AI more broadly. For example, physics-informed architectures and hardware development could lead to advances in the speed of AI algorithms, and work in statistical physics is providing a theoretical foundation for understanding AI dynamics. 

“Physics has inspired many time-tested ideas in machine learning: maximizing entropy, Boltzmann machines, and variational inference, to name a few,” says Pulkit Agrawal, an assistant professor of electrical engineering and computer science at MIT, and researcher in the Computer Science and Artificial Intelligence Laboratory (CSAIL). “We believe that close interaction between physics and AI researchers will be the catalyst that leads to the next generation of machine learning algorithms.” 

Cultivating early-career talent

AI technologies are advancing rapidly, making it both important and challenging to train junior researchers at the intersection of physics and AI. The IAIFI aims to recruit and train a talented and diverse group of early-career researchers, including at the postdoc level through its IAIFI Fellows Program.  

“By offering our fellows their choice of research problems, and the chance to focus on cutting-edge challenges in physics and AI, we will prepare many talented young scientists to become future leaders in both academia and industry,” says MIT professor of physics Marin Soljacic of the Research Laboratory of Electronics (RLE). 

IAIFI researchers hope these fellows will spark interdisciplinary and multi-investigator collaborations, generate new ideas and approaches, translate physics challenges beyond their native domains, and help develop a common language across disciplines. Applications for the inaugural IAIFI fellows are due in mid-October. 

Another related effort spearheaded by Thaler, Williams, and Alexander Rakhlin, an associate professor of brain and cognitive science at MIT and researcher in the Institute for Data, Systems, and Society (IDSS), is the development of a new interdisciplinary PhD program in physics, statistics, and data science, a collaborative effort between the Department of Physics and the Statistics and Data Science Center.

“Statistics and data science are among the foundational pillars of AI. Physics joining the interdisciplinary doctoral program will bring forth new ideas and areas of exploration, while fostering a new generation of leaders at the intersection of physics, statistics, and AI,” says Rakhlin.  

Education, outreach, and partnerships 

The IAIFI aims to cultivate “human intelligence” by promoting education and outreach. For example, IAIFI members will contribute to establishing a MicroMasters degree program at MIT for students from non-traditional backgrounds.    

“We will increase the number of students in both physics and AI from underrepresented groups by providing fellowships for the MicroMasters program,” says Isaac Chuang, professor of physics and electrical engineering, senior associate dean for digital learning, and RLE researcher at MIT. “We also plan on working with undergraduate MIT Summer Research Program students, to introduce them to the tools of physics and AI research that they might not have access to at their home institutions.”

The IAIFI plans to expand its impact via numerous outreach efforts, including a K-12 program in which students are given data from the LHC and LIGO and tasked with rediscovering the Higgs boson and gravitational waves. 

“After confirming these recent Nobel Prizes, we can ask the students to find tiny artificial signals embedded in the data using AI and fundamental physics principles,” says assistant professor of physics Phil Harris, an LNS researcher at MIT. “With projects like this, we hope to disseminate knowledge about — and enthusiasm for — physics, AI, and their intersection.”

In addition, the IAIFI will collaborate with industry and government to advance the frontiers of both AI and physics, as well as societal sectors that stand to benefit from AI innovation. IAIFI members already have many active collaborations with industry partners, including DeepMind, Microsoft Research, and Amazon. 

“We will tackle two of the greatest mysteries of science: how our universe works and how intelligence works,” says MIT professor of physics Max Tegmark, an MIT Kavli Institute researcher. “Our key strategy is to link them, using physics to improve AI and AI to improve physics. We’re delighted that the NSF is investing the vital seed funding needed to launch this exciting effort.”

Building new connections at MIT and beyond

Leveraging MIT’s culture of collaboration, the IAIFI aims to generate new connections and to strengthen existing ones across MIT and beyond.

Of the 27 current IAIFI senior investigators, 16 are at MIT and members of the LNS, RLE, MIT Kavli Institute, CSAIL, and IDSS. In addition, IAIFI investigators are members of related NSF-supported efforts at MIT, such as the Center for Brains, Minds, and Machines within the McGovern Institute for Brain Research and the MIT-Harvard Center for Ultracold Atoms.  

“We expect a lot of creative synergies as we bring physics and computer science together to study AI,” says Bill Freeman, the Thomas and Gerd Perkins Professor of Electrical Engineering and Computer Science and researcher in CSAIL. “I’m excited to work with my physics colleagues on topics that bridge these fields.”

More broadly, the IAIFI aims to make Cambridge, Massachusetts, and the surrounding Boston area a hub for collaborative efforts to advance both physics and AI. 

“As we teach in 8.01 and 8.02, part of what makes physics so powerful is that it provides a universal language that can be applied to a wide range of scientific problems,” says Thaler. “Through the IAIFI, we will create a common language that transcends the intellectual borders between physics and AI to facilitate groundbreaking discoveries.”

Read More

Explore then Execute: Adapting without Rewards via Factorized Meta-Reinforcement Learning

Activities more fulfilling than chores.

Nobody likes chores — can we build robots to do these chores, such as
cooking, for us? A common paradigm for training agents to perform
various tasks is to train a separate agent on each task, completely from
scratch, with reinforcement learning. However, training a robot to cook
with reinforcement learning from scratch in each person’s home would
completely fail, as it would result in many disasters (e.g., kitchen
fires), would require a lot of supervision from each person to reward
the robot for successfully cooking meals, and would take a long time
(learning even simple tasks from scratch can take reinforcement learning
agents millions of attempts).

Instead, it would be ideal if we could train a robot to be able to
quickly adapt to various home kitchens, after first training in many
kitchens in a robot chef factory. Intuitively, this should be possible
since different tasks and environments share considerable structure
(e.g., cooking pizza in one kitchen is similar to cooking a hamburger in
another kitchen), which can make learning each task easier and more
efficient.

Fortunately, meta-reinforcement learning seeks this exact goal of
training agents to adapt to new tasks from very few interactions on the
new task, after first training on many similar tasks. So, why aren’t
robots cooking in our kitchens today? To answer this question, we’ll
turn our attention to the problem of meta-exploration: how to best
spend these few interactions exploring the new task. For example, in
order to adapt to a new kitchen, a robot chef should ideally spend its
few interactions exploring the new kitchen to find the ingredients,
which will allow it to later cook a meal (solve the task). In this blog
post, we’ll cover and solve two key challenges about
meta-exploration that keep humans in the kitchen.

  • First, we’ll show that existing meta-reinforcement learning
    approaches suffer from a chicken-and-egg coupling problem:
    learning to explore and find the ingredients only helps a robot
    prepare a meal if it already knows how to cook, but the robot can
    only learn to cook if it already knows where the ingredients are.
    We’ll avoid this cyclic dependence of learning to explore and learning to
    execute (solve the task) by proposing an objective to learn them
    independently of each other.

  • Second, we’ll observe that the standard meta-reinforcement learning
    problem setting expects robots to cook the correct meal by
    trial-and-error, without even being told what meal to cook, which
    unnecessarily complicates the meta-exploration problem. To avoid
    this, we propose instruction-based meta-reinforcement learning,
    where the robot receives instructions specifying what meal to
    cook.

Standard Meta-Reinforcement Learning

Standard meta-RL setting.

Before we dive in, let’s review the standard meta-reinforcement learning
(meta-RL) problem statement. In meta-reinforcement learning, an agent
(e.g., a robot chef) trains on many tasks (different recipes) and
environments (different kitchens), and then must accomplish a new task
in a new environment during meta-testing. When presented with a new task
and environment, the agent is allowed to first spend an episode
exploring, gathering any necessary information (e.g., locating the
ingredients), before execution episodes, where the agent must accomplish
the task (e.g., cook a meal).

In more formal language, standard meta-RL considers a family of
problems, where each problem identifies a reward function
(e.g., cook a pizza) and transition dynamics
(e.g., a kitchen).
Using the terminology from Duan et al., 2016, we define a trial to consist of
several episodes in the same problem. The first episode is the exploration
episode, where the agent is allowed to gather information, without needing to
maximize returns. All subsequent episodes are execution episodes, where the
agent must accomplish the task. The goal is to maximize the returns achieved
during the execution episodes of meta-testing trials, after first training on
many trials during meta-training.

Decoupled Reward-free Exploration and Execution in Meta-Reinforcement Learning (DREAM)

A chicken-and-egg coupling problem. A common approach (Wang et al.,
2016, Duan et al., 2016) for the meta-exploration problem is to optimize
a recurrent policy that performs both exploration and execution episodes
end-to-end based on the execution episode rewards. The hope is to
capture the information learned during the exploration episode in the
recurrent policy’s hidden state, which will then be useful for
execution episodes. However, this leads to a chicken-and-egg coupling
problem, where learning good exploration behaviors requires already
having learned good execution behaviors and vice-versa, which prevents
such an approach from learning.

For example, if a robot chef fails to discover the locations of
ingredients in a kitchen (bad exploration), then it cannot possibly
learn how to cook (bad execution). On the other hand, if the robot does
not know how to cook (bad execution), then no matter what it does during
the exploration episode, it will still not successfully cook a meal,
making learning exploration challenging. Since robots can’t cook or
explore at the beginning of training, they get stuck in this local
optimum and have a hard time learning either.


The coupling problem. What came first: the chicken (good exploration) or
the egg (good execution)?

Avoiding the coupling problem with DREAM. To avoid the
chicken-and-egg coupling problem, we propose a method to break the
cyclic dependency between learning exploration and learning execution
behaviors, which we call DREAM. Intuitively, good exploration can be
learned by trying to recover the information necessary for executing
instructions. Therefore, from a high-level, DREAM consists of two main
steps: 1) simultaneously learn an execution policy independently from
exploration and learn what information is necessary for execution and 2)
learn an exploration policy to recover that information.

To answer the chicken-and-egg problem, DREAM manufactures its own egg,
and out comes the chicken.

More concretely, in the first step, we train an execution policy

conditioned on the problem identifier , which in the cooking example,
may either directly identify attributes of the kitchen (e.g., wall color
or ingredient locations), or simply be a unique identifier (e.g., a
one-hot) for each kitchen. This problem identifier (directly or
indirectly) encodes all the information necessary to solve tasks in the
kitchen, allowing the execution policy to learn independently from
exploration, which avoids the coupling problem. At the same time, our
goal in the first step is to identify only the information necessary for
executing instructions, and the problem identifier may also encode
extraneous information, such as the wall color. To remove this, we apply
an information bottleneck to obtain a bottlenecked representation ,
which we use for training an exploration policy .

In the second step, once we’ve obtained a bottleneck representation
that ideally contains only the information necessary for executing
instructions, we can train an exploration policy to recover this
information in the exploration episode. To do this, we roll-out the
exploration policy to obtain an episode and then reward the policy
based on how well this episode encodes the information contained in .
Roughly, this reward is the mutual information between the
bottlenecked representation and the episode .

DREAM meta-testing.

The problem identifier is easy to provide during meta-training by
simply assigning each problem a unique one-hot, but is typically
unavailable or unhelpful during meta-testing (e.g., if is a completely
new one-hot). This might seem concerning, since, during meta-training,
the execution policy conditions on , which requires knowing . However,
since the exploration policy is trained to produce exploration
trajectories that contain the same information as , we can directly
swap for at meta-test time by rolling out the exploration policy.
See our paper for the details!

Instruction-based Meta-Reinforcement Learning (IMRL)

Improving the standard meta-RL setting. A second meta-exploration
challenge concerns the meta-reinforcement learning setting itself.
While the above standard
meta-RL setting is a useful problem formulation, we observe two areas
that can be made more realistic. First, the standard setting requires
the agent to infer the task (e.g., the meal to cook) from reward
observations, which can be needlessly inefficient. In more realistic
situations, the user would just tell the agent what they want, instead.

Open and honest communication is important for your robots too.

Second, while the standard meta-RL setting leverages shared structure
between different problems (environment and task pairs), it does not
capture shared structure between different tasks in the same
environment. More concretely, the task is fixed across all episodes in a
trial, and in order to perform a new task (e.g., cook a new meal), the
agent requires another exploration episode, even when the underlying
environment (e.g., the kitchen) stays the same. Instead, an agent would
ideally be able to perform many tasks after a single exploration
episode. For example, after exploring the kitchen to find any
ingredients, an ideal robot chef would be able to then cook any meal
involving those ingredients, whereas an agent trained in the standard
meta-reinforcement learning setting would only be able to cook a single meal.

Dinner schedule according to a robot chef trained in the standard
meta-reinforcement learning setting.

These two areas can obscure the meta-exploration problem of how to
optimally spend the exploration episode, as the former requires
unnecessary exploration to infer the task, while the latter only
requires the agent to explore to discover information relevant to a
single task. While intuitively, the agent should spend the exploration
episode gathering useful information for later execution episodes, in
many cases, optimal exploration collapses to simply solving the task.
For example, the agent can only discover that the task is to cook pizza
by successfully cooking pizza and receiving positive rewards, only to do
the same thing again and again on future execution episodes. This can
make the exploration episode nearly useless.

Instruction-based meta-RL (IMRL). To make the meta-RL setting more
realistic, we propose a new setting called instruction-based meta-RL
(IMRL), which addresses the two above areas by (i) providing the agent
with instructions (e.g., “cook pizza” or a one-hot representation)
that specify the task during execution episodes and (ii) varying the
task by providing a different instruction on each execution episode.
Then, for example, after meta-training in different kitchens at a
factory, a robot chef could begin cooking many different meals specified
by a human in a new home kitchen, after a single setup period
(exploration episode).

Instruction-based meta-RL: The task, which changes each execution
episode, is conveyed to the agent via instructions. The environment
still stays the same within a trial.

Reward-free adaptation. In the standard meta-RL setting, the agent
requires reward observations during exploration episodes in order to
infer the task. However, by receiving instructions that specify the task
in IMRL, a further benefit is that the agent no longer requires observing
rewards to adapt to new tasks and environments. Concretely, IMRL enables
reward-free adaptation, where during meta-training, the agent uses reward
observations during execution episodes to learn to solve the task, but does
not observe rewards during exploration episodes. During meta-testing, the
agent never observes any rewards. This enables modeling real-world deployment
situations where gathering reward supervision is really expensive. For
example, a robot chef would ideally be able to adapt to a home kitchen without
any supervision from a human.

Is IMRL general? Importantly, setting the instruction to always be
some “empty” instruction recovers the standard meta-RL setting. In
other words, standard meta-RL is just IMRL, where the user’s desires
are fixed within a trial and the user says nothing for the instructions.
Therefore, algorithms developed for IMRL can also be directly applied to
the standard setting and vice-versa.

Results

Sign reads blue.
Sign reads red.

Sparse-reward 3D visual navigation. In one experiment from our
paper, we evaluate DREAM on the sparse-reward 3D visual navigation
problem family proposed by Kamienny et al., 2020 (pictured above), which
we’ve made harder by including a visual sign and more objects. We use
the IMRL setting with reward-free adaptation. During execution episodes,
the agent receives an instruction to go to an object: a ball, block or
key. The agent starts episodes on the far side of the barrier, and must
walk around the barrier to read the sign
(highlighted in yellow),
which in the two versions of the problem, either specify going to the
blue or red version of the
object. The agent receives 80×60 RGB images as observations
and can turn left or right, or move forward. Going to the correct object gives
reward +1 and going to the wrong object gives reward -1.

DREAM learns near-optimal exploration and execution behaviors on this
task, which are pictured below. On the left, DREAM spends the
exploration episode walking around the barrier to read the sign, which
says blue. On the right, during an execution
episode, DREAM receives an instruction to go to the key. Since DREAM already
read that the sign said blue during the exploration
episode, it goes to the blue key.

Behaviors learned by DREAM

Exploration.
Execution: go to the key.

Comparisons. Broadly, prior meta-RL approaches fall into two main
groups: (i) end-to-end approaches, where exploration and execution are
optimized end-to-end based on execution rewards, and (ii) decoupled
approaches, where exploration and execution are optimized with separate
objectives. We compare DREAM with state-of-the-art approaches from both
categories. In the end-to-end category, we compare with:

  • RL12, the canonical
    end-to-end approach, which learns a recurrent policy conditioned
    on the entire sequence of past state and reward observations.

  • VariBAD3, which additionally adds auxiliary
    loss functions to the hidden state of the recurrent policy to
    predict the rewards and dynamics of the current problem. This can
    be viewed as learning the belief state4, a
    sufficient summary of all of its past observations.

  • IMPORT5, which additionally leverages the
    problem identifier to help learn execution behaviors.

Additionally, in the decoupled category, we compare with:

  • PEARL-UB, an upperbound on PEARL6. We
    analytically compute the expected rewards achieved by the optimal
    problem-specific policy that explores with
    Thompson sampling7 using the true posterior
    distribution over problems.

Quantitative results. Below, we plot the returns achieved by all
approaches. In contrast to DREAM, which achieves near-optimal returns,
we find that the end-to-end approaches never read the sign, and
consequently avoid all objects, in fear of receiving negative reward for
going to the wrong object. This happens even when they are allowed to
observe rewards in the exploration episode (dotted lines). Therefore,
they achieve no rewards, which is indicative of the coupling problem.

On the other hand, while existing approaches in the decoupled category
avoid the coupling problem, optimizing their objectives does not lead to
the optimal exploration policy. For example, Thompson sampling
approaches (PEARL-UB) do not achieve optimal reward, even with the
optimal problem-specific execution policy and access to the true
posterior distribution over problems. To see this, recall that Thompson
sampling explores by sampling a problem from the posterior distribution
and following the execution policy for that problem. Since the optimal
execution policy directly goes to the correct object, and never reads
the sign, Thompson sampling never reads the sign during exploration. In
contrast, a nice property of DREAM is that with enough data and
expressive-enough policy classes, it theoretically learns optimal
exploration and execution.

Training curves with (dotted lines) and without (solid lines) rewards
during exploration. Only DREAM reads the sign and solves the task. And
it does it without needing rewards during exploration!

Additional results. In our paper, we also evaluate DREAM on
additional didactic problems, designed to to answer the following
questions:

  • Can DREAM efficiently explore to discover only the information
    required to execute instructions?

  • Can DREAM generalize to unseen instructions and environments?

  • Does DREAM also show improved results in the standard meta-RL
    setting, as well as instruction-based meta-RL?

Broadly, the answer is yes to all of these questions. Check out our
paper for detailed results!

Conclusion

Summary. In this blog post, we tackled the problem of
meta-exploration: how to best gather information in a new environment in
order to perform a task. To do this, we examined and addressed two key
challenges.

  • First, we saw how existing meta-RL approaches that optimize both
    exploration and execution end-to-end to maximize reward fall
    prey to a chicken-and-egg problem. If the agent hasn’t learned to
    explore yet, then it can’t gather key information (e.g., the
    location of ingredients) required for learning to solve tasks
    (e.g., cook a meal). On the other hand, if the agent hasn’t
    learned to solve tasks yet, then there’s no signal for learning to
    explore, as it’ll fail to solve the task no matter what. We
    avoided this problematic cycle by proposing a decoupled
    objective (DREAM), which learns to explore and learns to solve
    tasks independently from each other.

  • Second, we saw how the standard meta-RL setting captures the notion
    of adapting to a new environment and task, but requires the agent
    to unnecessarily explore to infer the task (e.g., what meal to
    cook) and doesn’t leverage the shared structure between different
    tasks in the same environment (e.g., cooking different meals in
    the same kitchen). We addressed this by proposing
    instruction-based meta-RL (IMRL), which provides the agent with an
    instruction that specifies the task and requires the agent to
    explore and gather information useful for many tasks.

DREAM and IMRL combine quite nicely: IMRL enables reward-free adaptation
in principle, and DREAM achieves it in practice. Other state-of-the-art
approaches we tested weren’t able to achieve reward-free adaptation, due
to the chicken-and-egg coupling problem.

What’s next? There’s a lot of room for future work — here are a
few directions to explore.

  • More sophisticated instruction and problem ID representations.
    This work examines the case where the instructions and problem IDs
    are represented as unique one-hots, as a proof of concept. Of
    course, in the real world, instructions and problem IDs might be
    better represented with natural language, or images (e.g., a
    picture of the meal to cook).

  • Applying DREAM to other meta-RL settings. DREAM applies generally
    to any meta-RL setting where some information is conveyed to the
    agent and the rest must be discovered via exploration. In this
    work, we studied two such instances — in IMRL, the instruction
    conveys the task and in the standard meta-RL setting, everything
    must be discovered via exploration — but there are other settings
    worth examining, too. For example, we might want to convey
    information about the environment to the agent, such as the
    locations of some ingredients, or that the left burner is broken,
    so the robot chef should use the right one.

  • Seamlessly integrating exploration and execution. In the most
    commonly studied meta-RL setting, the agent is allowed to first
    gather information via exploration (exploration episode) before
    then solving tasks (execution episodes). This is also the setting
    we study, and it can be pretty realistic. For example, a robot
    chef might require a setup phase, where it first explores a home
    kitchen, before it can start cooking meals. On the other hand, a
    few works, such as Zintgraf et al., 2019, require the agent to
    start solving tasks from the get-go: there are no exploration
    episodes and all episodes are execution episodes. DREAM can
    already operate in this setting, by just ignoring the rewards and
    exploring in the first execution episode, and trying to make up
    for the first execution episode with better performance in later
    execution episodes. This works surprisingly well, but it’d be nice
    to more elegantly integrate exploration and execution.

Acknowledgements

This is work done with my fantastic collaborators
Aditi Raghunathan,
Percy Liang,
and Chelsea Finn.
You can check out our full paper on ArXiv
and our source code on GitHub.
You can also find a short talk on this work here.

Many thanks to Andrey Kurenkov
for comments and edits on this blog post!

The icons used in the above figures were made by Freepik, ThoseIcons,
dDara, mynamepong, Icongeek26, photo3idea_studio and Vitaly Gorbachev
from flaticon.com.

  1. Y. Duan, J. Schulman, X. Chen, P. L. Bartlett, I. Sutskever, and P. Abbeel. RL2: Fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779, 2016. 

  2. J. X. Wang, Z. Kurth-Nelson, D. Tirumala, H. Soyer, J. Z. Leibo, R. Munos, C. Blundell, D. Kumaran, and M. Botvinick. Learning to reinforcement learn. arXiv preprint arXiv:1611.05763, 2016. 

  3. L. Zintgraf, K. Shiarlis, M. Igl, S. Schulze, Y. Gal, K. Hofmann, and S. Whiteson. VariBAD A very good method for bayes-adaptive deep RL via meta-learning. arXiv preprint arXiv:1910.08348, 2019. 

  4. L. P. Kaelbling, M. L. Littman, and A. R. Cassandra. Planning and acting in partially observable stochastic domains. Artificial intelligence, 101(1):99–134, 1998. 

  5. P. Kamienny, M. Pirotta, A. Lazaric, T. Lavril, N. Usunier, and L. Denoyer. Learning adaptive exploration strategies in dynamic environments through informed policy regularization. arXiv preprint arXiv:2005.02934, 2020. 

  6. K. Rakelly, A. Zhou, D. Quillen, C. Finn, and S. Levine. Efficient off-policy meta-reinforcement learning via probabilistic context variables. arXiv preprint arXiv:1903.08254, 2019. 

  7. W. R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3):285–294, 1933. 

Read More

An Analysis of Online Datasets Using Dataset Search (Published, in Part, as a Dataset)

An Analysis of Online Datasets Using Dataset Search (Published, in Part, as a Dataset)

Posted by Natasha Noy, Research Scientist; and Omar Benjelloun, Software Engineer, Google Research

There are tens of millions of datasets on the web, with content ranging from sensor data and government records, to results of scientific experiments and business reports. Indeed, there are datasets for almost anything one can imagine, be it diets of emperor penguins or where remote workers live. More than two years ago, we undertook an effort to design a search engine that would provide a single entry point to these millions of datasets and thousands of repositories. The result is Dataset Search, which we launched in beta in 2018 and fully launched in January 2020. In addition to facilitating access to data, Dataset Search reconciles and indexes datasets using the metadata descriptions that come directly from the dataset web pages using schema.org structure.

As of today, the complete Dataset Search corpus contains more than 31 million datasets from more than 4,600 internet domains. About half of these datasets come from .com domains, but .org and governmental domains are also well represented. The graph below shows the growth of the corpus over the last two years, and while we still don’t know what fraction of datasets on the web are currently in Dataset Search, the number continues to grow steadily.

Growth in the number of datasets indexed by Dataset Search

To better understand the breadth and utility of the datasets made available through Dataset Search, we published “Google Dataset Search by the Numbers”, accepted at the 2020 International Semantic Web Conference. Here we provide an overview of the available datasets, present metrics and insights originating from their analysis, and suggest best practices for publishing future scientific datasets. In order to enable other researchers to build analysis and tools using the metadata, we are also making a subset of the data publicly available.

A Range of Dataset Topics
In order to determine the distribution of topics covered by the datasets, we infer the research category based on dataset titles and descriptions, as well as other text on the dataset Web pages. The two most common topics are geosciences and social sciences, which account for roughly 45% of the datasets. Biology is a close third at ~15%, followed by a roughly even distribution for other topics, including computer science, agriculture, and chemistry, among others.

Distribution of dataset topics

In our initial efforts to launch Dataset Search, we reached out to specific communities, which was key to bootstrapping widespread use of the corpus. Initially, we focused on geosciences and social sciences, but since then, we have allowed the corpus to grow organically. We were surprised to see that the fields associated with the communities we reached out to early on are still dominating the corpus. While their early involvement certainly contributes to their prevalence, there may be other factors involved, such as differences in culture across communities. For instance, geosciences have been particularly successful in making their data findable, accessible, interoperable, and reusable (FAIR), a core component to reducing barriers for access.

Making Data Easily Citable and Reusable
There is a growing consensus among researchers across scientific disciplines that it is important to make datasets available, to publish details relevant to their use, and to cite them when they are used. Many funding agencies and academic publishers require proper publication and citation of data.

Peer-reviewed journals such as Nature Scientific Data are dedicated to publishing valuable datasets, and efforts such as DataCite provide digital object identifiers (DOIs) for them. Resolution services (e.g., identifiers.org) also provide persistent, de-referenceable identifiers, allowing for easy citation, which is key to making datasets widely available in scientific discourse. Unfortunately, we found that only about 11% of the datasets in the corpus (or ~3M) have DOIs. We chose this subset from the dataset corpus to be included in our open-source release. From this collection, about 2.3M datasets come from two sites, datacite.org and figshare.com:

Domain Datasets with DOIs
figshare.com 1,301K
datacite.org 1,070K
narcis.nl 118K
openaire.eu 100K
datadiscoverystudio.org 72K
osti.gov 63K
zenodo.org 50K
researchgate.net 41K
da-ra.de 40K

Publishers can specify access requirements for a dataset via schema.org metadata properties, including details of the license and information indicating whether or not the dataset is accessible for free. Only 34% of datasets specify license information, but when no license is specified, users cannot make any assumptions on whether or not they are allowed to reuse the data. Thus, adding licensing information, and, ideally, adding as open a license as possible, will greatly improve the reusability of the data.

Among the datasets that did specify a license, we were able to recognize a known license in 72% of cases. Those licenses include Open Government licenses for the UK and Canada, Creative Commons licenses, and several Public Domain licenses (e.g., Public Domain Mark 1.0). We found 89.5% of these datasets to either be accessible for free or use a license that allows redistribution, or both. And of these open datasets, 5.6M (91%) allow commercial reuse.

Another critical component of data reusability is providing downloadable data, yet only 44% of datasets specify download information in their metadata. A possible explanation for this surprisingly low value is that webmasters (or dataset-hosting platforms) fear that exposing the data download link through schema.org metadata may lead search engines or other applications to give their users direct access to download the data, thus “stealing” traffic from their website. Another concern may be that data needs the proper context to be used appropriately (e.g., methodology, footnotes, and license information), and providers feel that only their web pages can give the complete picture. In Dataset Search, we do not show download links as part of dataset metadata so that users must go to the publisher’s website to download the data, where they will see the full context for the dataset.

What Do Users Access?
Finally, we examine how Dataset Search is being used. Overall, 2.1M unique datasets from 2.6K domains appeared in the top 100 Dataset Search results over 14 days in May 2020. We find that the distribution of topics being queried is different from that of the corpus as a whole. For instance, geoscience takes up a much smaller fraction, and conversely, biology and medicine represent a larger fraction relative to their share of the corpus. This result is likely explained by the timing of our analysis, as it was performed during the first weeks of the COVID-19 pandemic.

Distribution of topics covered by datasets that appear in search results

Best Practices for Publishing Scientific Datasets
Based on our analysis, we have identified a set of best practices that can improve how datasets are discovered, reused and cited.

  • Discoverability
    Dataset metadata should be on pages that are accessible to web crawlers and that provide metadata in machine-readable formats in order to improve discoverability.

  • Persistence
    Publishing metadata on sites that are likely to be more persistent than personal web pages will facilitate data reuse and citation. Indeed, during our analysis of Dataset Search, we noted a very high rate of turnover — many URLs that hosted a dataset one day did not have it a few weeks or months later. Data repositories, such as Figshare, Zenodo, DataDryad, Kaggle Datasets and many others, are a good way to ensure dataset persistence. Many of these repositories have agreements with libraries to preserve data in perpetuity.

  • Provenance
    With datasets often published in multiple repositories, it would be useful for repositories to describe the provenance information more explicitly in the metadata. The provenance information helps users understand who collected the data, where the primary source of the dataset is, or how it might have changed.

  • Licensing
    Datasets should include licensing information, ideally in a machine-readable format. Our analysis indicates that when dataset providers select a license, they tend to choose a fairly open one. So, encouraging and enabling scientists to choose licenses for their data will result in many more datasets being openly available.

  • Assigning persistent identifiers (such as DOIs)
    DOIs are critical for long-term tracking and useability. Not only do these identifiers allow for much easier citation of datasets and version tracking, they are also dereferenceable: if a dataset moves, the identifier can point to a different location.

Releasing Metadata for Datasets with Persistent Identifiers
As part of the announcement today, we are also releasing a subset of our corpus for others to use. It contains the metadata for more than three million datasets that have DOIs and other types of persistent identifiers –- these are the datasets that are the most easily citable. Researchers can use this metadata to perform deeper analysis or to build their own applications using this data. For example, much of the growth of DOI usage appears to have been within the last decade. How does this timeframe relate to the datasets covered in the corpus? Is the DOI usage distribution uniform across datasets, or are there significant differences between research communities?

We will update the dataset on a regular basis. Finally, we hope that focusing this data release on datasets with persistent citable identifiers will encourage more data providers to describe their datasets in more detail and to make them more easily citable.

In conclusion, we hope that having data more discoverable through tools such as Google’s Dataset Search will encourage scientists to share their data more broadly and do it in a way that makes data truly FAIR.

Acknowledgments
This post reflects the work of the entire Dataset Search team. We are grateful to Shiyu Chen, Dimitris Paparas, Katrina Sostek, Yale Cong, Marc Najork, and Chris Gorgolewski for their contributions. We would also like to thank Hal Varian for suggesting this analysis and for many helpful ideas.

Read More