Posted by Billy Lamberta, TensorFlow Team
Jupyter notebooks are an important part of our TensorFlow documentation infrastructure. With the JupyterCon 2020 conference underway, the TensorFlow docs team would like to share some tools we use to manage a large collection of Jupyter notebooks as a first-class documentation format published on tensorflow.org.
As the TensorFlow ecosystem has grown, the TensorFlow documentation has grown into a substantial software project in its own right. We publish ~270 notebook guides and tutorials on tensorflow.org—all tested and available in GitHub. We also publish an additional ~400 translated notebooks for many languages—all tested like their English counterpart. The tooling we’ve developed to work with Jupyter notebooks helps us manage all this content.
When we published our first notebook on tensorflow.org over two years ago for the 2018 TensorFlow Developer Summit, the community response was fantastic. Users love that they can immediately jump from webpage documentation to an interactive computing experience in Google Colab. This setup allows you to run—and experiment with—our guides and tutorials right in the browser, without installing any software on your machine. This tensorflow.org integration with Colab made it much easier to get started and changed how we could teach TensorFlow using Jupyter notebooks. Other machine learning projects soon followed. Notebooks can be loaded directly from GitHub into Google Colab with just the URL:
For compute-intensive tasks, Colab provides TPUs and GPUs at no cost. The TensorFlow documentation, such as this quickstart tutorial, has buttons that link to both its notebook source in GitHub and to load in Colab.
Software documentation is a team effort, and notebooks are an expressive, education-focused format that allows engineers and writers to build up an interactive demonstration. Jupyter notebooks are JSON-formatted files that contain text cells and code cells, typically executed in sequential order from top-to-bottom. They are an excellent way to communicate programming ideas, and, with some discipline, a way to share reproducible results.
On the TensorFlow team, notebooks allow engineers, technical writers, and open source contributors to collaborate on the same document without the tension that exists between a separate code example and its published explanation. We write TensorFlow notebooks so that the documentation is the code—self-contained, easily shared, and tested.
Notebook translations with GitLocalize
Documentation needs to reach everyone around the world—something the TensorFlow team values. The TensorFlow community translation project has grown to 10 languages over the past two years. Translation sprints are a great way to engage with the community on open source documentation projects.
To make TensorFlow documentation accessible to even more developers, we worked with Alconost to add Jupyter notebook support to their GitLocalize translation tool. GitLocalize makes it easy to create translated notebooks and sync documentation updates from the source files. Open source contributors can submit pull requests and provide reviews using the TensorFlow GitLocalize project: gitlocalize.com/tensorflow/docs-l10n.
Jupyter notebook support in GitLocalize not only benefits TensorFlow, but is now available for all open source translation projects that use notebooks with GitHub.
TensorFlow docs notebook tools
Incorporating Jupyter notebooks into our docs infrastructure allows us to run and test all the published guides and tutorials to ensure everything on the site works for a new TensorFlow release—using stable or nightly packages.
Benefits aside, there are challenges with managing Jupyter notebooks as source code. To make pull requests and reviews easier for contributors and project maintainers, we created the TensorFlow docs notebook tools to automate common fixes and communicate issues to contributors with continuous integration (CI) tests. You can install the
tensorflow-docs pip package directly from the tensorflow/docs GitHub repository:
$ python3 -m pip install -U git+https://github.com/tensorflow/docs
While the Jupyter notebook format is straightforward, notebook authoring environments are often inconsistent with JSON formatting or embed their own metadata in the file. These unnecessary changes can cause diff churn in pull requests that make content reviews difficult. The solution is to use an auto-formatter that outputs consistent notebook JSON.
nbfmt is a notebook formatter with a preference for the TensorFlow docs notebook style. It formats the JSON and strips unneeded metadata except for some Colab-specific fields used for our integration. To run:
$ python3 -m tensorflow_docs.tools.nbfmt [options] notebook.ipynb
For TensorFlow docs projects, notebooks saved without output cells are executed and tested; notebooks saved with output cells are published as-is. We prefer to remove outputs to test our notebooks, but
nbfmt can be used with either format.
--test flag is available for continuous integration tests. Instead of updating the notebook, it returns an error if the notebook is not formatted. We use this in a CI test for one of our GitHub Actions workflows. And with some further bot integration, formatting patches can be automatically applied to the contributor’s pull request.
The easiest way to scale reviews is to let the machine do it. Every project has recurring issues that pop up in reviews, and style questions are often best settled with a style guide (TensorFlow likes the Google developer docs style guide). For a large project, the more patterns you can catch and fix automatically, the more time you’ll have available for other goals.
nblint is a notebook linting tool that checks documentation style rules. We use it to catch common style and structural issues in TensorFlow notebooks:
>$ python3 -m tensorflow_docs.tools.nblint [options] notebook.ipynb
Lints are assertions that test specific sections of the notebook. These lints are collected into style modules.
nblint tests the google and tensorflow styles by default, and other style modules can be loaded at the command-line. Some styles require arguments that are also passed at the command-line, for example, setting a different repo when linting the TensorFlow translation notebooks:
$ python3 -m tensorflow_docs.tools.nblint
Lint tests can have an associated fix that makes it easy to update notebooks to pass style checks automatically. Use the
--fix argument to apply lint fixes that overwrite the notebook, for example:
$ python3 -m tensorflow_docs.tools.nblint --fix
TensorFlow is a big fan of Project Jupyter and Jupyter notebooks. Along with Google Colab, notebooks changed how we teach TensorFlow and scale a large open source documentation project with tested guides, tutorials, and translations. We hope that sharing some of the tools will help other open source projects that want to use notebooks as documentation.
Special thanks to Mark Daoust, Wolff Dobson, Yash Katariya, the TensorFlow docs team, and all TensorFlow docs authors, reviewers, contributors, and supporters.