Posted by the TensorFlow Datasets team
Datasets landscape has changed a lot since TensorFlow Datasets (TFDS) was introduced about 4 years ago: TFDS made sharing or re-using a dataset significantly easier, and transformed the datasets landscape by inspiring other ML tools, libraries and services.
Loading a dataset went from complicated scripts to:
Read the documentation for a more extensive introduction.
Over the years, TFDS has grown to become a recognized way to load datasets. To celebrate our last 4.8.2 release, we would like to take some time to reflect on the progress and improvements made over those past years and thank the community for their support.
TFDS is still a library to facilitate download, preparation and loading of datasets for ML pipelines, but it now supports hundreds of datasets and offers the following main features:
- A large variety of features with encoding and decoding, ranging from text to images, videos, audio and even RL-specific types (e.g. dataset of datasets).
- Large datasets support: TFDS is successfully used within Google to prepare and load large datasets (PBs) using high performance input pipelines.
- Dataset collections, to arbitrarily group together a number of existing TFDS datasets, for example used in a benchmark.
- Support for all main ML Python frameworks: yes there is “TF” in “TFDS”, but besides TensorFlow, one can use TFDS with Torch, Jax, NumPy, Keras and any other Python ML framework that can consume a tf.data.Dataset or a NumPy Iterator.
- Global shuffling at preparation time: It is good practice to shuffle training data, TFDS optionally does a global shuffling at preparation time in case the source of the data wasn’t already shuffled.
- Splits and slicing: datasets can specify their splits, and readers can specify which split(s) they want to read, or slices of splits they want to read, eg: test[:10%] to “load the 10 first percent of the test split”.
- Versioning and determinism: TFDS datasets and collection are versioned, so it is possible to reproduce experiments reliably. Loading a dataset pinned at a particular version will always return the same set of examples. This works with slicing and global shuffling too, as those are deterministic.
- Code-less sharing: TFDS can read TFDS prepared datasets even if the code used to prepare the dataset is not available. This facilitates sharing and versioning datasets.
- Community datasets and support for internal datasets within organizations: TFDS allows organizations to manage different corpuses of datasets and make them available to their internal users.
- Formats-specific builders: to easily define datasets based on well known formats such as CoNLL.
- GCS integration: TFDS works well with GCS.
Thank you to all of our contributors and users!
TFDS is under active development to bring you the best datasets to use as input in your ML pipelines.
Notably, we work on making transformations seamless. Sometimes, a dataset is derived from another dataset by a few transformations (e.g., data augmentation or column renaming). We want those transformations to be as easy to implement as possible. This feature is already available experimentally, don’t hesitate to give feedback on GitHub!
We are also working on making the TensorFlow dependency optional. TFDS is a framework agnostic library that provides datasets and tools to support machine learning research. TFDS does not rely on any specific machine learning framework, and we are working to make the TensorFlow dependency optional.
We have other plans too, smaller ones such as the support of partitioned datasets, and longer-term ones that could durably influence the field. Follow us on GitHub to receive future updates about those upcoming developments!