Posted by Natasha Noy, Research Scientist; and Omar Benjelloun, Software Engineer, Google Research
There are tens of millions of datasets on the web, with content ranging from sensor data and government records, to results of scientific experiments and business reports. Indeed, there are datasets for almost anything one can imagine, be it diets of emperor penguins or where remote workers live. More than two years ago, we undertook an effort to design a search engine that would provide a single entry point to these millions of datasets and thousands of repositories. The result is Dataset Search, which we launched in beta in 2018 and fully launched in January 2020. In addition to facilitating access to data, Dataset Search reconciles and indexes datasets using the metadata descriptions that come directly from the dataset web pages using schema.org structure.
As of today, the complete Dataset Search corpus contains more than 31 million datasets from more than 4,600 internet domains. About half of these datasets come from .com domains, but .org and governmental domains are also well represented. The graph below shows the growth of the corpus over the last two years, and while we still don’t know what fraction of datasets on the web are currently in Dataset Search, the number continues to grow steadily.
|Growth in the number of datasets indexed by Dataset Search|
To better understand the breadth and utility of the datasets made available through Dataset Search, we published “Google Dataset Search by the Numbers”, accepted at the 2020 International Semantic Web Conference. Here we provide an overview of the available datasets, present metrics and insights originating from their analysis, and suggest best practices for publishing future scientific datasets. In order to enable other researchers to build analysis and tools using the metadata, we are also making a subset of the data publicly available.
A Range of Dataset Topics
In order to determine the distribution of topics covered by the datasets, we infer the research category based on dataset titles and descriptions, as well as other text on the dataset Web pages. The two most common topics are geosciences and social sciences, which account for roughly 45% of the datasets. Biology is a close third at ~15%, followed by a roughly even distribution for other topics, including computer science, agriculture, and chemistry, among others.
|Distribution of dataset topics|
In our initial efforts to launch Dataset Search, we reached out to specific communities, which was key to bootstrapping widespread use of the corpus. Initially, we focused on geosciences and social sciences, but since then, we have allowed the corpus to grow organically. We were surprised to see that the fields associated with the communities we reached out to early on are still dominating the corpus. While their early involvement certainly contributes to their prevalence, there may be other factors involved, such as differences in culture across communities. For instance, geosciences have been particularly successful in making their data findable, accessible, interoperable, and reusable (FAIR), a core component to reducing barriers for access.
Making Data Easily Citable and Reusable
There is a growing consensus among researchers across scientific disciplines that it is important to make datasets available, to publish details relevant to their use, and to cite them when they are used. Many funding agencies and academic publishers require proper publication and citation of data.
Peer-reviewed journals such as Nature Scientific Data are dedicated to publishing valuable datasets, and efforts such as DataCite provide digital object identifiers (DOIs) for them. Resolution services (e.g., identifiers.org) also provide persistent, de-referenceable identifiers, allowing for easy citation, which is key to making datasets widely available in scientific discourse. Unfortunately, we found that only about 11% of the datasets in the corpus (or ~3M) have DOIs. We chose this subset from the dataset corpus to be included in our open-source release. From this collection, about 2.3M datasets come from two sites, datacite.org and figshare.com:
|Domain||Datasets with DOIs|
Publishers can specify access requirements for a dataset via schema.org metadata properties, including details of the license and information indicating whether or not the dataset is accessible for free. Only 34% of datasets specify license information, but when no license is specified, users cannot make any assumptions on whether or not they are allowed to reuse the data. Thus, adding licensing information, and, ideally, adding as open a license as possible, will greatly improve the reusability of the data.
Among the datasets that did specify a license, we were able to recognize a known license in 72% of cases. Those licenses include Open Government licenses for the UK and Canada, Creative Commons licenses, and several Public Domain licenses (e.g., Public Domain Mark 1.0). We found 89.5% of these datasets to either be accessible for free or use a license that allows redistribution, or both. And of these open datasets, 5.6M (91%) allow commercial reuse.
Another critical component of data reusability is providing downloadable data, yet only 44% of datasets specify download information in their metadata. A possible explanation for this surprisingly low value is that webmasters (or dataset-hosting platforms) fear that exposing the data download link through schema.org metadata may lead search engines or other applications to give their users direct access to download the data, thus “stealing” traffic from their website. Another concern may be that data needs the proper context to be used appropriately (e.g., methodology, footnotes, and license information), and providers feel that only their web pages can give the complete picture. In Dataset Search, we do not show download links as part of dataset metadata so that users must go to the publisher’s website to download the data, where they will see the full context for the dataset.
What Do Users Access?
Finally, we examine how Dataset Search is being used. Overall, 2.1M unique datasets from 2.6K domains appeared in the top 100 Dataset Search results over 14 days in May 2020. We find that the distribution of topics being queried is different from that of the corpus as a whole. For instance, geoscience takes up a much smaller fraction, and conversely, biology and medicine represent a larger fraction relative to their share of the corpus. This result is likely explained by the timing of our analysis, as it was performed during the first weeks of the COVID-19 pandemic.
|Distribution of topics covered by datasets that appear in search results|
Best Practices for Publishing Scientific Datasets
Based on our analysis, we have identified a set of best practices that can improve how datasets are discovered, reused and cited.
Dataset metadata should be on pages that are accessible to web crawlers and that provide metadata in machine-readable formats in order to improve discoverability.
Publishing metadata on sites that are likely to be more persistent than personal web pages will facilitate data reuse and citation. Indeed, during our analysis of Dataset Search, we noted a very high rate of turnover — many URLs that hosted a dataset one day did not have it a few weeks or months later. Data repositories, such as Figshare, Zenodo, DataDryad, Kaggle Datasets and many others, are a good way to ensure dataset persistence. Many of these repositories have agreements with libraries to preserve data in perpetuity.
With datasets often published in multiple repositories, it would be useful for repositories to describe the provenance information more explicitly in the metadata. The provenance information helps users understand who collected the data, where the primary source of the dataset is, or how it might have changed.
Datasets should include licensing information, ideally in a machine-readable format. Our analysis indicates that when dataset providers select a license, they tend to choose a fairly open one. So, encouraging and enabling scientists to choose licenses for their data will result in many more datasets being openly available.
Assigning persistent identifiers (such as DOIs)
DOIs are critical for long-term tracking and useability. Not only do these identifiers allow for much easier citation of datasets and version tracking, they are also dereferenceable: if a dataset moves, the identifier can point to a different location.
Releasing Metadata for Datasets with Persistent Identifiers
As part of the announcement today, we are also releasing a subset of our corpus for others to use. It contains the metadata for more than three million datasets that have DOIs and other types of persistent identifiers –- these are the datasets that are the most easily citable. Researchers can use this metadata to perform deeper analysis or to build their own applications using this data. For example, much of the growth of DOI usage appears to have been within the last decade. How does this timeframe relate to the datasets covered in the corpus? Is the DOI usage distribution uniform across datasets, or are there significant differences between research communities?
We will update the dataset on a regular basis. Finally, we hope that focusing this data release on datasets with persistent citable identifiers will encourage more data providers to describe their datasets in more detail and to make them more easily citable.
In conclusion, we hope that having data more discoverable through tools such as Google’s Dataset Search will encourage scientists to share their data more broadly and do it in a way that makes data truly FAIR.
This post reflects the work of the entire Dataset Search team. We are grateful to Shiyu Chen, Dimitris Paparas, Katrina Sostek, Yale Cong, Marc Najork, and Chris Gorgolewski for their contributions. We would also like to thank Hal Varian for suggesting this analysis and for many helpful ideas.