How AI is making information more useful

How AI is making information more useful

Today, there’s more information accessible at people’s fingertips than at any point in human history. And advances in artificial intelligence will radically transform the way we use that information, with the ability to uncover new insights that can help us both in our daily lives and in the ways we are able to tackle complex global challenges.

At our Search On livestream event today, we shared how we’re bringing the latest in AI to Google’s products, giving people new ways to search and explore information in more natural and intuitive ways.

Making multimodal search possible with MUM

Earlier this year at Google I/O, we announced we’ve reached a critical milestone for understanding information with Multitask Unified Model, or MUM for short.

We’ve been experimenting with using MUM’s capabilities to make our products more helpful and enable entirely new ways to search. Today, we’re sharing an early look at what will be possible with MUM. 

In the coming months, we’ll introduce a new way to search visually, with the ability to ask questions about what you see. Here are a couple of examples of what will be possible with MUM.

Animated GIF showing how you can tap on the Lens icon when you’re looking at a picture of a shirt, and ask Google to find you the same pattern — but on another article of clothing, like socks.

With this new capability, you can tap on the Lens icon when you’re looking at a picture of a shirt, and ask Google to find you the same pattern — but on another article of clothing, like socks. This helps when you’re looking for something that might be difficult to describe accurately with words alone. You could type “white floral Victorian socks,” but you might not find the exact pattern you’re looking for. By combining images and text into a single query, we’re making it easier to search visually and express your questions in more natural ways.

Animated GIF showing the point-and-ask mode of searching that can make it easier to find the exact moment in a video that can help you with instructions on fixing your bike.

Some questions are even trickier: Your bike has a broken thingamajig, and you need some guidance on how to fix it. Instead of poring over catalogs of parts and then looking for a tutorial, the point-and-ask mode of searching will make it easier to find the exact moment in a video that can help.

Helping you explore with a redesigned Search page

We’re also announcing how we’re applying AI advances like MUM to redesign Google Search. These new features are the latest steps we’re taking to make searching more natural and intuitive.

First, we’re making it easier to explore and understand new topics with “Things to know.” Let’s say you want to decorate your apartment, and you’re interested in learning more about creating acrylic paintings.

The search results page for the query “acrylic painting” that scrolls to a new feature called “Things to know”, which lists out various aspects of the topic like, “step by step”, “styles” and “using household items."

If you search for “acrylic painting,” Google understands how people typically explore this topic, and shows the aspects people are likely to look at first. For example, we can identify more than 350 topics related to acrylic painting, and help you find the right path to take.

We’ll be launching this feature in the coming months. In the future, MUM will unlock deeper insights you might not have known to search for — like “how to make acrylic paintings with household items” — and connect you with content on the web that you wouldn’t have otherwise found.

Two phone screens side by side highlight a set of queries and tappable features that allow you to refine to more specific searches for acrylic painting or broaden to concepts like famous painters.

Second, to help you further explore ideas, we’re making it easy to zoom in and out of a topic with new features to refine and broaden searches. 

In this case, you can learn more about specific techniques, like puddle pouring, or art classes you can take. You can also broaden your search to see other related topics, like other painting methods and famous painters. These features will launch in the coming months.

A scrolling results page for the query “pour painting ideas” that shows results with bold images and video thumbnails.

Third, we’re making it easier to find visual inspiration with a newly designed, browsable results page. If puddle pouring caught your eye, just search for “pour painting ideas” to see a visually rich page full of ideas from across the web, with articles, images, videos and more that you can easily scroll through. 

This new visual results page is designed for searches that are looking for inspiration, like “Halloween decorating ideas” or “indoor vertical garden ideas,” and you can try it today.

Get more from videos

We already use advanced AI systems to identify key moments in videos, like the winning shot in a basketball game, or steps in a recipe. Today, we’re taking this a step further, introducing a new experience that identifies related topics in a video, with links to easily dig deeper and learn more. 

Using MUM, we can even show related topics that aren’t explicitly mentioned in the video, based on our advanced understanding of information in the video. In this example, while the video doesn’t say the words “macaroni penguin’s life story,” our systems understand that topics contained in the video relate to this topic, like how macaroni penguins find their family members and navigate predators. The first version of this feature will roll out in the coming weeks, and we’ll add more visual enhancements in the coming months.

Across all these MUM experiences, we look forward to helping people discover more web pages, videos, images and ideas that they may not have come across or otherwise searched for. 

A more helpful Google

The updates we’re announcing today don’t end with MUM, though. We’re also making it easier to shop from the widest range of merchants, big and small, no matter what you’re looking for. And we’re helping people better evaluate the credibility of information they find online. Plus, for the moments that matter most, we’re finding new ways to help people get access to information and insights. 

All this work not only helps people around the world, but creators, publishers and businesses as well.  Every day, we send visitors to well over 100 million different websites, and every month, Google connects people with more than 120 million businesses that don’t have websites, by enabling phone calls, driving directions and local foot traffic.

As we continue to build more useful products and push the boundaries of what it means to search, we look forward to helping people find the answers they’re looking for, and inspiring more questions along the way.

Read More

Optical character recognition with TensorFlow Lite: A new example app

Optical character recognition with TensorFlow Lite: A new example app

Posted by Wei Wei, TensorFlow Developer Advocate

As the old adage goes, “a picture is worth a thousand words.” Images are rich in visual information, but sometimes the key is with the text within. While it is easy for literate human beings to read words embedded in images, how do we use computer vision and machine learning to teach computers to do so?

Today, we are going to show you how to use TensorFlow Lite to extract text from images on Android devices. We will walk you through the key steps of the Optical Character Recognition (OCR) Android app that we recently open sourced here, which you can refer to for the complete code. You can see how the app extracts the product names from three Google product logos in the animation below.

Optical Character Recognition demo

The process of recognizing text from images is called Optical Character Recognition and is widely used in many domains. For example, Google Maps uses OCR technology to automatically extract information from the geo-located imagery to improve Google Maps.

Generally speaking, OCR is a pipeline with multiple steps. Usually they consist of text detection and text recognition:

  • Use a text detection model to find out bounding boxes around text;
  • Do some post-processing to transform the bounding boxes;
  • Transform the images within those bounding boxes into grayscale, so that a text recognition model can map out the words and numbers.

In our case, we are going to leverage the text detection and text recognition models from TensorFlow Hub. There are several different model versions for speed / accuracy tradeoffs; we use the float16 quantized models here. For more information on model quantization, please refer to the TensorFlow Lite quantization section. We also use OpenCV, which is a widely used computer vision library for Non-Maximum Suppression (NMS) and perspective transformation (we’ll expand on this later) to post-process detection results. In addition, we use the TFLite Support Library to grayscale and normalize the images.

OCR pipeline from text detection, perspective transformation, to recognition
OCR pipeline from text detection, perspective transformation, to recognition.

For text detection, since the detection model accepts a fixed size of 320×320, we use the TFLite Support Library to resize and normalize the input image:

val imageProcessor =
ImageProcessor.Builder()
.add(ResizeOp(height, width, ResizeOp.ResizeMethod.BILINEAR))
.add(NormalizeOp(means, stds))
.build()
var tensorImage = TensorImage(DataType.FLOAT32)

tensorImage.load(bitmapIn)
tensorImage = imageProcessor.process(tensorImage)

Then we use TFLite to run the detection model:

detectionInterpreter.runForMultipleInputsOutputs(detectionInputs, detectionOutputs)

The output of the detection model is a number of rotated bounding boxes which contain the text in the image. We run Non-Maximum Suppression to identify one bounding box for each text block with OpenCV:

NMSBoxesRotated(
boundingBoxesMat,
detectedConfidencesMat,
detectionConfidenceThreshold.toFloat(),
detectionNMSThreshold.toFloat(),
indicesMat
)

Sometimes texts inside images are distorted (e.g., the ‘kubernetes’ sticker on my laptop) with a perspective angle:

Perspective transformation demo
Perspective transformation demo

If we just feed the raw rotated bounding box into the recognition model, the model is unlikely to correctly identify the characters. In this case, we need to use OpenCV to do perspective transformation:

val rotationMatrix = getPerspectiveTransform(srcPtsMat, targetPtsMat)

warpPerspective(
srcBitmapMat,
recognitionBitmapMat,
rotationMatrix,
Size(recognitionImageWidth.toDouble(), recognitionImageHeight.toDouble())
)

After that, we use the TFLite Support Library again to resize, grayscale, and normalize the transformed images inside the bounding boxes:

val imageProcessor =
ImageProcessor.Builder()
.add(ResizeOp(height, width, ResizeOp.ResizeMethod.BILINEAR))
.add(TransformToGrayscaleOp())
.add(NormalizeOp(mean, std))
.build()

Finally, we run the text recognition model, map out the characters and numbers from the model output, and update the app UI:

recognitionInterpreter.run(recognitionTensorImage.buffer, recognitionResult)

var recognizedText = ""
for (k in 0 until recognitionModelOutputSize) {
var alphabetIndex = recognitionResult.getInt(k * 8)
if (alphabetIndex in 0..alphabets.length - 1)
recognizedText = recognizedText + alphabets[alphabetIndex]
}
Log.d("Recognition result:", recognizedText)
if (recognizedText != "") {
ocrResults.put(recognizedText, getRandomColor())
}

That’s it. We are now able to extract text from input images using TFLite within our app.

Finally, if you just want a ready-to-use OCR SDK, Google also offers on-device OCR functionality through ML Kit, which uses TFLite underneath and should be sufficient for most OCR use cases. There are some situations where you may want to build your own OCR solution with TFLite such as:

  • You have your own text detection / recognition TFLite models that you would like to use;
  • You have special business requirements (e.g. recognizing upside-down text) and need to customize the OCR pipeline;
  • You want to support languages not covered by ML Kit;
  • Your target user devices that don’t necessarily have Google Play services installed;
  • You want to have control over hardware backends (CPU / GPU / etc.) used to run your models.

In these cases, I hope that this tutorial and our example implementation can help you get started on building your own OCR functionality in your app.

You can learn more about OCR with the resources below.

Acknowledgements

The author would like to thank Tian Lin for the helpful feedback and community contributors @Tulasi123789 and @risingsayak for their prior work on OCR using TFLite (creating and uploading the models to TF Hub, providing accompanying notebooks, and etc.).

Read More

Announcing the winners of the 2021 Next-generation Data Infrastructure request for proposals

In April 2021, Facebook launched the Next-generation Data Infrastructure request for proposals (RFP). Today, we’re announcing the winners of this award.
VIEW RFPThe Facebook Core Data and Data Infra teams were interested in proposals that sought out innovative solutions to the challenges that still remain in the data management community. Areas of interest included, but were not limited to, the following topics:

  • Large-scale query processing
  • Physical layout and IO optimizations
  • Data management and processing at a global scale
  • Converged architectures for data wrangling, machine learning, and analytics
  • Advances in testing and verification for storage and processing systems

Read our Q&A with database researchers Stavros Harizopoulos and Shrikanth Shankar to learn more about database research at Facebook, the goal of this RFP, and the inspiration behind the RFP.

The team reviewed 109 high-quality proposals, and we are pleased to announce the 10 winning proposals and six finalists. Thank you to everyone who took the time to submit a proposal, and congratulations to the winners.

Research award recipients

Holistic optimization for parallel query processing
Paraschos Koutris (University of Wisconsin–Madison)

SCALER – SCalAbLe vEctor pRocessing of SPJG-Queries
Wolfgang Lehner, Dirk Habich (Technische Universität Dresden)

AnyScale transactions in the cloud
Natacha Crooks, Joe Hellerstein (University of California, Berkeley)

Proudi: Predictability on unpredictable data infrastructure
Haryadi S. Gunawi (University of Chicago)

Making irregular partitioning practical
Spyros Blanas (The Ohio State University)

Dynamic join processing pushdown in Presto
Daniel Abadi, Chujun Song (University of Maryland, College Park)

A learned persistent key-value store
Tim Kraska (Massachusetts Institute of Technology)

Building global-scale systems using a flexible consensus substrate
Faisal Nawab (University of California, Irvine)

Runtime-optimized analytics using compilation hints
Anastasia Ailamaki (Swiss Federal Institute of Technology Lausanne)

Flexible scheduling for machine learning data processing close to storage
Ana Klimovic, Damien Aymon (ETH Zurich)

Finalists

Next generation data provenance/data governance
Tim Kraska, Michael Cafarella, Michael Stonebraker (Massachusetts Institute of Technology)

Optimizing commitment latency for geo-distributed transactions
Xiangyao Yu (University of Wisconsin–Madison)

Semantic optimization of recursive queries
Dan Suciu (University of Washington)

Towards a disaggregated database for future data centers
Jianguo Wang (Purdue University)

Unified data systems for structured and unstructured data
Matei Zaharia, Christos Kozyrakis (Stanford University)

Unifying machine learning and analytics under a single data engine
Stratos Idreos (Harvard University)

The post Announcing the winners of the 2021 Next-generation Data Infrastructure request for proposals appeared first on Facebook Research.

Read More

Summarizing Books with Human Feedback

Summarizing Books with Human Feedback

Read paperBrowse samples

Summarizing Books with Human Feedback

To safely deploy powerful, general-purpose artificial intelligence in the future, we need to ensure that machine learning models act in accordance with human intentions. This challenge has become known as the alignment problem.

A scalable solution to the alignment problem needs to work on tasks where model outputs are difficult or time-consuming for humans to evaluate. To test scalable alignment techniques, we trained a model to summarize entire books, as shown in the following samples.[1] Our model works by first summarizing small sections of a book, then summarizing those summaries into a higher-level summary, and so on.

Explore more samples

Our best model is fine-tuned from GPT-3 and generates sensible summaries of entire books, sometimes even matching the average quality of human-written summaries: it achieves a 6/7 rating (similar to the average human-written summary) from humans who have read the book 5% of the time and a 5/7 rating 15% of the time. Our model also achieves state-of-the-art results on the BookSum dataset for book-length summarization. A zero-shot question-answering model can use our model’s summaries to obtain competitive results on the NarrativeQA dataset for book-length question answering.[2]

Our Approach: Combining Reinforcement Learning from Human Feedback and Recursive Task Decomposition

Consider the task of summarizing a piece of text. Large pretrained models aren’t very good at summarization. In the past we found that training a model with reinforcement learning from human feedback helped align model summaries with human preferences on short posts and articles. But judging summaries of entire books takes a lot of effort to do directly since a human would need to read the entire book, which takes many hours.

To address this problem, we additionally make use of recursive task decomposition: we procedurally break up a difficult task into easier ones. In this case we break up summarizing a long piece of text into summarizing several shorter pieces. Compared to an end-to-end training procedure, recursive task decomposition has the following advantages:

  1. Decomposition allows humans to evaluate model summaries more quickly by using summaries of smaller parts of the book rather than reading the source text.
  2. It is easier to trace the summary-writing process. For example, you can trace to find where in the original text certain events from the summary happen. See for yourself on our summary explorer!
  3. Our method can be used to summarize books of unbounded length, unrestricted by the context length of the transformer models we use.

Why We Are Working on This

This work is part of our ongoing research into aligning advanced AI systems, which is key to our mission. As we train our models to do increasingly complex tasks, making informed evaluations of the models’ outputs will become increasingly difficult for humans. This makes it harder to detect subtle problems in model outputs that could lead to negative consequences when these models are deployed. Therefore we want our ability to evaluate our models to increase as their capabilities increase.

Our current approach to this problem is to empower humans to evaluate machine learning model outputs using assistance from other models. In this case, to evaluate book summaries we empower humans with individual chapter summaries written by our model, which saves them time when evaluating these summaries relative to reading the source text. Our progress on book summarization is the first large-scale empirical work on scaling alignment techniques.

Going forward, we are researching better ways to assist humans in evaluating model behavior, with the goal of finding techniques that scale to aligning artificial general intelligence.

We’re always looking for more talented people to join us; so if this work interests you, please apply to join our team!


Acknowledgments

We’d like to acknowledge our paper co-authors: Long Ouyang, Daniel Ziegler, Nisan Stiennon, and Paul Christiano.

Thanks to the following for feedback on this release: Steve Dowling, Hannah Wong, Miles Brundage, Gretchen Krueger, Ilya Sutskever, and Sam Altman.


Design
Justin Jay Wang


Book Cover Artwork


Footnotes

  1. These samples were selected from works in the public domain, and are part of GPT-3’s pretraining data. To control for this effect, and purely for research purposes, our paper evaluates summaries of books the model has never seen before. ↩︎

  2. We’ve amended our original claim about results on NarrativeQA after being made aware of prior work with better results than ours. ↩︎

OpenAI

Googler Marian Croak is now in the Inventors Hall of Fame

Googler Marian Croak is now in the Inventors Hall of Fame

Look around you right now and consider everything that was created by an inventor. The computer you’re reading this article on, the internet necessary to load this article, the electricity that powers the screen, even the coffee maker you used this morning. 

To recognize the incredible contributions of those inventors and the benefits they bring to our everyday life, the National Inventors Hall of Fame has inducted a new group of honorees every year since 1973. In this year’s combined inductee class of 2020/2021, Googler Marian Croak is being honored for her work in advancing VoIP (Voice over Internet Protocol) Technology, which powers the online calls and video chats that have helped businesses and families stay connected through the COVID-19 pandemic. She holds more than 200 patents, and recently was honored by the U.S. Patent and Trademark Office. 

These days, Marian leads our Research Center for Responsible AI and Human Centered Technology, which is responsible for ensuring Google ​​develops artificial intelligence responsibly and that it has a positive impact. We chatted over Google Meet to find out how plumbers and electricians sparked her interest in science, how her inventions have made life in a pandemic a tiny bit easier for everyone, and what the NIHF honor means to her.  

When was the first time you realized you were interested in technology?

I was probably around 5 or 6. I know that we don’t usually think of things like plumbing or electricity as necessarily technology, but they are. I was very enchanted with plumbers and electricians who would come to our house and fix things. They would be dirty and greasy, but I would love the smell, you know? I felt like, Wow, what a miracle worker! I would follow them around, trying to figure out how they’d fix something. I still do that today! 

So when you have electricians come to your house, you’re still like, “Hey, how did you do that?”

There was a leak once, and I was asking the plumber all these questions, and he asked me to quiet down! Because he needed to listen to the invisible flow of water through the pipes to determine the problem. It was amazing to me how similar it was to network engineering!

You’ve had a few different roles at Google and Alphabet so far. How did you move to where you are today?

When I first came to Google, my first role was bringing the Internet to emerging markets. Laying fiber in Africa, building public Wi-Fi in railroad stations in India and then exploring the landscape in countries like Cuba and countries where there wasn’t an openness yet for the Internet. And that was a fascinating job. It was a merger of technology, policy and governmental affairs, combined with an understanding of communities and regions. 

Then I worked on bringing features and technology and Google’s products to the next billion users. And after I did that for a few years, I joined the Site Reliability Engineering organization to help enhance the performance of Google’s complex, integrated systems. Now my current role is leading the Research Center for Responsible AI and Human Centered Technology group. I’m inspired that my work has the potential to positively impact so many of our users. 

Today you’re being inducted into the National Inventors Hall of Fame for your work in advancing VoIP technology. What inspired you to work on VoIP, and can you describe that process of bringing the technology to life?

I have alway been motivated by the desire to change the world, and to do that I try to change the world that I’m currently in. What I mean by that is I work on problems that I am aware of, and that I can tackle within the world that surrounds me. So when I began working on VoIP technology, it was at a time in the late ‘90s when there was a lot of change happening involving the internet. Netscape had put a user-friendly web browser in place and there was a lot of new activity beginning to bubble up all over the online world. 

I was part of a team that was also very interested in doing testing and prototyping of voice communications over the internet. There were some existing technologies but they didn’t scale and they were proprietary in nature, so we were thinking of ways we could open it up, make it scalable, make it reliable and be able to support billions of daily calls. We started to work on this but had a lot of doubters telling us that this wouldn’t work, and that no one would ever use this “toy like” technology. And at the time, they were right: It wasn’t working and it wasn’t reliable. But over time we were able to get it to a point where it started working very well. So much so that eventually the senior leaders within AT&T began to adopt the technology for their core network. It was challenging but an exciting thing for me to do because I like to bring change to things, especially when people doubt that it can happen.

What advice would you give to aspiring inventors? 

Most importantly, don’t give up, and during the process of creation, listen to your critics. I received so much criticism and in many ways it was valid. That type of feedback motivated me to improve the technology, and really address a variety of pain points that I hadn’t necessarily thought of. 

What does being inducted into the NIHF mean to you? 

Well it’s humbling, and a great experience. At the time I never thought the work that I was doing was that significant and that it would lead to this, but I’m so I’m very grateful for the recognition.

What does it mean to be a part of a class that sees the first two Black women inducted into the NIHF?

I find that it inspires people when they see someone who looks like themselves on some dimension, and I’m proud to offer that type of representation. People also see that I’m just a normal person like themselves and I think that also inspires them to accomplish their goals. I want people to understand that it may be difficult but that they can overcome obstacles and that it will be so worth it.

Read More

Break-It-Fix-It: Unsupervised Learning for Fixing Source Code Errors

Machine Learning for Code Repair

Across the board, programming has increased in popularity, ranging from developing with general-purpose programming languages like Python, C, Java to using simpler languages like HTML, SQL, LaTeX, and Excel formulas. When writing code we often make syntax errors such as typos, unbalanced parentheses, invalid indentations, etc., and need to fix them. In fact, several studies 1 show that both beginner and professional programmers spend 50% of time fixing code errors during programming. Automating code repair can dramatically enhance the programming productivity 2.

Recent works 3 use machine learning models to fix code errors by training the models on human-labeled (broken code, fixed code) pairs. However, collecting this data for even a single programming language is costly, much less the dozens of languages commonly used in practice.

On the other hand, unlabeled (unaligned) data—not aligned as (broken, fixed) pairs—is readily available: for example, raw code snippets on the web like GitHub. An unsupervised approach for training code repair models would make them much more scalable and widely deployable. In our recent work 4 published at ICML 2021, we study how to leverage unlabeled data to learn code fixers effectively.

Problem Setup

In code repair, we are given a critic that assesses the quality of an input: for instance, a compiler or code analyzer that tells us if input code has any syntax errors. The code is bad if there is at least one error and it is good if there are no errors. What we want is a fixer that repairs bad code into good code that satisfies the critic, e.g. repairing missing parenthesis as in the figure below. Our goal is to use unlabeled data and critic to learn a fixer.

Challenges
While unlabeled data can be split into a set of good code and a set of bad code using the critic, they are unaligned; in other words, they do not form (broken, fixed) pairs ready to be used for training a fixer.

A straightforward technique 5 is to apply random or heuristic perturbations to good code, such as dropping tokens, and prepare synthetic paired data (perturbed code, good code) to train a fixer. However, such synthetically-generated bad code does not match the distribution of real bad code written by humans. For instance, as the figure below shows, synthetic perturbations (purple box) may drop parentheses arbitrarily from code, generating errors that are rare in real code. In contrast, human-written code (red box) rarely misses parentheses when only a single pair appears, but misses parentheses often in a nested context (e.g., 10x more than non-nested in our Python code dataset collected from GitHub). This distributional mismatch between synthetic data and real data can result in low code repair performance when used in practice. To tackle this challenge, we introduce a new training approach, Break-It-Fix-It (BIFI), that adapts the fixer towards real distributions of bad code.

Approach: Break-It-Fix-It

The basic idea of BIFI is to introduce a machine learning-based breaker that learns to corrupt good code into realistic bad code, and iteratively train both the fixer and the breaker while using them in conjunction to generate more realistic paired data. Concretely, BIFI takes as inputs:

  • Critic
  • Unaligned set of good and bad code
  • Initial fixer, which potentially is trained on synthetic data

BIFI then improves the fixer by performing the following cycle of data generation and training procedure:

  1. Apply the fixer to the set of bad code, which consists of real code errors made by humans, and use the critic to assess if the fixer’s output is good. If good, keep the pair
  2. Train the breaker on the resulting paired data from Step 1. Consequently, the breaker can generate more realistic errors than the initial synthetic data
  3. Apply the breaker to the set of good code, and keep outputs that the critic judges as bad
  4. Train the fixer on the newly-generated paired data in Step 1 and Step 3

These steps are also illustrated in the left panel of the figure below. We iterate over this cycle to improve the fixer and the breaker simultaneously until they have both converged. The intuition is that a better fixer and breaker will be able to generate more realistic paired data, which in turn helps to train a better fixer and breaker.

BIFI is related to the backtranslation (cycle-consistency) method in unsupervised translation 6. If we apply backtranslation directly to the code repair task, we would do the following:

  1. Apply the fixer to the set of bad code and generate (noisy) good code
  2. Train the breaker to reconstruct the bad code
  3. Apply the breaker to the set of good code and generate (noisy) bad code
  4. Train the fixer to reconstruct the good code

as illustrated in the right panel of the figure. BIFI improves on backtranslation in two aspects. First, while backtranslation may include non-fixed code as good or non-broken code as bad in Step 1 or 3, BIFI uses the critic to verify if the generated code is actually fixed or broken in Step 1 and 3, as highlighted with pink in the left panel of the figure. This ensures the correctness of training data generated by the breaker and fixer. Second, while backtranslation only uses paired data generated in Step 3 to train the fixer in Step 4, BIFI uses paired data generated in both Step 3 and Step 1, as paired data from Step 1 contains real code errors made by humans. This improves the distributional match of generated training data.

Let’s use our code repair model!

We apply and evaluate our method, BIFI, on two code repair benchmarks:

  • GitHub-Python 7: Fix syntax errors in Python code. Critic is Python AST parser.
  • DeepFix 8: Fix compiler errors in C code. Critic is C compiler.

BIFI improves on existing unsupervised methods for code repair
Using the GitHub-Python dataset, we first compare BIFI with existing unsupervised methods for code repair: a synthetic baseline that uses synthetic paired data generated by randomly dropping, inserting or replacing tokens from good code, and a backtranslation baseline that directly applies backtracklation to code repair. The synthetic baseline serves as the initial fixer for our BIFI algorithm. We find that BIFI improves the repair accuracy by 28% (62%→90%) over the synthetic baseline and by 10% (80%→90%) over the backtranslation baseline, as shown in the left panel of the figure. This result suggests that while we started from a simple initial fixer trained with random perturbations, BIFI can automatically turn it into a usable fixer with high repair accuracy.

For the other dataset, DeepFix, there are several prior works that use heuristic ways to generate synthetic paired data for the task: Gupta+17 9, Hajipour+19 10, DrRepair 11. We take the existing best model, DrRepair, as our initial fixer and apply BIFI. We find that it improves the repair accuracy by 5% (66%→71%), as shown in the right panel of the figure. This result suggests that while the initial fixer DrRepair was already trained with manually designed heuristics, there is still room for improving the adaptation to a more realistic distribution of code errors. BIFI helps to achieve this without additional manual effort.

Examples of breaker outputs
Let’s look at several examples of code generated by the trained breaker. Given the good Python code shown on the left below, we show on the right outputs that the breaker places high probability on. In output 1, the breaker converts raise ValueError(...) into raise ValueError, ..., which is an obsolete usage of raise in Python. In output 2, the breaker drops a closing parenthesis in a nested context. These are both errors commonly seen in human written bad code.

Examples of fixer outputs
Let’s look at how our fixer performs through examples too. The left side of the figure shows human-written Python code with an indentation error—one needs to add indent to the err = 0 line and remove indent in the next line. The initial fixer, shown in the center, only inserts one indent token and fails to fix the error. This is most likely due to the mismatch between real errors and synthetic errors used in training: synthetic errors generated by random perturbations do not frequently contain this kind of indentation error where multiple tokens need to be inserted/removed accordingly. The fixer trained by BIFI, shown on the right, fixes the indentation error by inserting and removing the correct pair of indent tokens. We find that this is one of the representative examples of when BIFI successfully fixes code errors but the initial fixer fails.

Finally, one limitation of this work is that we focus on fixing syntactic errors (we use critics such as AST parser and compiler), and we are not evaluating the semantic correctness of our outputs. Extending BIFI to fixing semantic errors is an exciting future research avenue.

Conclusion

Machine learning of source code repair is an important direction to enhance programming productivity, but collecting human-labeled data is costly. In this work, we studied how to learn source code repair in an unsupervised way, and developed a new training method, BIFI. The key innovation of BIFI is that it creates realistic paired data for training fixers from a critic (e.g. compiler) and unlabeled data (e.g. code snippets on the web) only, which are cheaply available.

More broadly, the idea of learning fixers from critics + unlabeled data is applicable to various repair tasks beyond code repair, such as grammatical error correction 12 and molecule design, using domain-specific critics. Additionally, the idea of using a critic to improve the quality of paired data is applicable to various translation tasks by introducing a learned critic. We hope that BIFI can be an effective solution to unsupervised repair tasks and translation tasks.

You can check out our full paper here and our source code/data on GitHub.

Acknowledgments

This blog post is based on the paper:

Many thanks to Percy Liang, as well as members of the Stanford P-Lambda group, SNAP group and NLP group for their valuable feedback. Many thanks to Jacob Schreiber and Sidd Karamcheti for edits on this blog post.

  1. Reversible Debugging Software. Tom Britton, Lisa Jeng, Graham Carver, Paul Cheak, Tomer Katzenellenbogen. 2013. Programmers’ Build Errors: A Case Study (at Google). Hyunmin Seo, Caitlin Sadowski, Sebastian Elbaum, Edward Aftandilian, Robert Bowdidge. 2014. 

  2. Improving programming productivity with machine learning is an extremely active area of research. A prominent example is the Copilot / Codex service recently released by OpenAI and GitHub, which translates natural language (e.g. English) descriptions into code. Automated code repair is another complementary technology to improve programming productivity. 

  3. SEQUENCER: Sequence-to-Sequence Learning for End-to-End Program Repair. Zimin Chen, Steve Kommrusch, Michele Tufano, Louis-Noël Pouchet, Denys Poshyvanyk, Martin Monperrus. 2019. DeepDelta: Learning to Repair Compilation Errors. Ali Mesbah Andrew Rice Emily Johnston Nick Glorioso Eddie Aftandilian. 2019. Patching as Translation: the Data and the Metaphor. Yangruibo Ding, Baishakhi Ray, Premkumar Devanbu, Vincent J. Hellendoorn. 2020 

  4. Break-It-Fix-It: Unsupervised Learning for Program Repair. Michihiro Yasunaga, Percy Liang. 2021. 

  5. DeepFix: Fixing common C language errors by deep learning. Rahul Gupta, Soham Pal, Aditya Kanade, Shirish Shevade. 2017. DeepBugs: A Learning Approach to Name-based Bug Detection. Michael Pradel, Koushik Sen. 2018. Neural program repair by jointly learning to localize and repair. Marko Vasic, Aditya Kanade, Petros Maniatis, David Bieber, Rishabh Singh. 2019. Global relational models of source code. Vincent J. Hellendoorn, Charles Sutton, Rishabh Singh, Petros Maniatis, David Bieber. 2020. 

  6. Improving Neural Machine Translation Models with Monolingual Data. Rico Sennrich, Barry Haddow, Alexandra Birch. 2016. Phrase-Based & Neural Unsupervised Machine Translation. Guillaume Lample, Myle Ott, Alexis Conneau, Ludovic Denoyer, Marc’Aurelio Ranzato. 2018. 

  7. https://github.com/michiyasunaga/BIFI 

  8. DeepFix: Fixing common C language errors by deep learning. Rahul Gupta, Soham Pal, Aditya Kanade, Shirish Shevade. 2017. 

  9. DeepFix: Fixing common C language errors by deep learning. Rahul Gupta, Soham Pal, Aditya Kanade, Shirish Shevade. 2017. 

  10. SampleFix: Learning to Correct Programs by Sampling Diverse Fixes. Hossein Hajipour, Apratim Bhattacharya, Mario Fritz. 2019. 

  11. Graph-based, Self-Supervised Program Repair from Diagnostic Feedback. Michihiro Yasunaga, Percy Liang. 2020. 

  12. LM-Critic: Language Models for Unsupervised Grammatical Error Correction. Michihiro Yasunaga, Jure Leskovec, Percy Liang. 2021. 

Read More

Break-It-Fix-It: Unsupervised Learning for Fixing Source Code Errors

Machine Learning for Code Repair

Across the board, programming has increased in popularity, ranging from developing with general-purpose programming languages like Python, C, Java to using simpler languages like HTML, SQL, LaTeX, and Excel formulas. When writing code we often make syntax errors such as typos, unbalanced parentheses, invalid indentations, etc., and need to fix them. In fact, several studies 1 show that both beginner and professional programmers spend 50% of time fixing code errors during programming. Automating code repair can dramatically enhance the programming productivity 2.

Recent works 3 use machine learning models to fix code errors by training the models on human-labeled (broken code, fixed code) pairs. However, collecting this data for even a single programming language is costly, much less the dozens of languages commonly used in practice.

On the other hand, unlabeled (unaligned) data—not aligned as (broken, fixed) pairs—is readily available: for example, raw code snippets on the web like GitHub. An unsupervised approach for training code repair models would make them much more scalable and widely deployable. In our recent work 4 published at ICML 2021, we study how to leverage unlabeled data to learn code fixers effectively.

Problem Setup

In code repair, we are given a critic that assesses the quality of an input: for instance, a compiler or code analyzer that tells us if input code has any syntax errors. The code is bad if there is at least one error and it is good if there are no errors. What we want is a fixer that repairs bad code into good code that satisfies the critic, e.g. repairing missing parenthesis as in the figure below. Our goal is to use unlabeled data and critic to learn a fixer.

Challenges
While unlabeled data can be split into a set of good code and a set of bad code using the critic, they are unaligned; in other words, they do not form (broken, fixed) pairs ready to be used for training a fixer.

A straightforward technique 5 is to apply random or heuristic perturbations to good code, such as dropping tokens, and prepare synthetic paired data (perturbed code, good code) to train a fixer. However, such synthetically-generated bad code does not match the distribution of real bad code written by humans. For instance, as the figure below shows, synthetic perturbations (purple box) may drop parentheses arbitrarily from code, generating errors that are rare in real code. In contrast, human-written code (red box) rarely misses parentheses when only a single pair appears, but misses parentheses often in a nested context (e.g., 10x more than non-nested in our Python code dataset collected from GitHub). This distributional mismatch between synthetic data and real data can result in low code repair performance when used in practice. To tackle this challenge, we introduce a new training approach, Break-It-Fix-It (BIFI), that adapts the fixer towards real distributions of bad code.

Approach: Break-It-Fix-It

The basic idea of BIFI is to introduce a machine learning-based breaker that learns to corrupt good code into realistic bad code, and iteratively train both the fixer and the breaker while using them in conjunction to generate more realistic paired data. Concretely, BIFI takes as inputs:

  • Critic
  • Unaligned set of good and bad code
  • Initial fixer, which potentially is trained on synthetic data

BIFI then improves the fixer by performing the following cycle of data generation and training procedure:

  1. Apply the fixer to the set of bad code, which consists of real code errors made by humans, and use the critic to assess if the fixer’s output is good. If good, keep the pair
  2. Train the breaker on the resulting paired data from Step 1. Consequently, the breaker can generate more realistic errors than the initial synthetic data
  3. Apply the breaker to the set of good code, and keep outputs that the critic judges as bad
  4. Train the fixer on the newly-generated paired data in Step 1 and Step 3

These steps are also illustrated in the left panel of the figure below. We iterate over this cycle to improve the fixer and the breaker simultaneously until they have both converged. The intuition is that a better fixer and breaker will be able to generate more realistic paired data, which in turn helps to train a better fixer and breaker.

BIFI is related to the backtranslation (cycle-consistency) method in unsupervised translation 6. If we apply backtranslation directly to the code repair task, we would do the following:

  1. Apply the fixer to the set of bad code and generate (noisy) good code
  2. Train the breaker to reconstruct the bad code
  3. Apply the breaker to the set of good code and generate (noisy) bad code
  4. Train the fixer to reconstruct the good code

as illustrated in the right panel of the figure. BIFI improves on backtranslation in two aspects. First, while backtranslation may include non-fixed code as good or non-broken code as bad in Step 1 or 3, BIFI uses the critic to verify if the generated code is actually fixed or broken in Step 1 and 3, as highlighted with pink in the left panel of the figure. This ensures the correctness of training data generated by the breaker and fixer. Second, while backtranslation only uses paired data generated in Step 3 to train the fixer in Step 4, BIFI uses paired data generated in both Step 3 and Step 1, as paired data from Step 1 contains real code errors made by humans. This improves the distributional match of generated training data.

Let’s use our code repair model!

We apply and evaluate our method, BIFI, on two code repair benchmarks:

  • GitHub-Python 7: Fix syntax errors in Python code. Critic is Python AST parser.
  • DeepFix 8: Fix compiler errors in C code. Critic is C compiler.

BIFI improves on existing unsupervised methods for code repair
Using the GitHub-Python dataset, we first compare BIFI with existing unsupervised methods for code repair: a synthetic baseline that uses synthetic paired data generated by randomly dropping, inserting or replacing tokens from good code, and a backtranslation baseline that directly applies backtracklation to code repair. The synthetic baseline serves as the initial fixer for our BIFI algorithm. We find that BIFI improves the repair accuracy by 28% (62%→90%) over the synthetic baseline and by 10% (80%→90%) over the backtranslation baseline, as shown in the left panel of the figure. This result suggests that while we started from a simple initial fixer trained with random perturbations, BIFI can automatically turn it into a usable fixer with high repair accuracy.

For the other dataset, DeepFix, there are several prior works that use heuristic ways to generate synthetic paired data for the task: Gupta+17 9, Hajipour+19 10, DrRepair 11. We take the existing best model, DrRepair, as our initial fixer and apply BIFI. We find that it improves the repair accuracy by 5% (66%→71%), as shown in the right panel of the figure. This result suggests that while the initial fixer DrRepair was already trained with manually designed heuristics, there is still room for improving the adaptation to a more realistic distribution of code errors. BIFI helps to achieve this without additional manual effort.

Examples of breaker outputs
Let’s look at several examples of code generated by the trained breaker. Given the good Python code shown on the left below, we show on the right outputs that the breaker places high probability on. In output 1, the breaker converts raise ValueError(...) into raise ValueError, ..., which is an obsolete usage of raise in Python. In output 2, the breaker drops a closing parenthesis in a nested context. These are both errors commonly seen in human written bad code.

Examples of fixer outputs
Let’s look at how our fixer performs through examples too. The left side of the figure shows human-written Python code with an indentation error—one needs to add indent to the err = 0 line and remove indent in the next line. The initial fixer, shown in the center, only inserts one indent token and fails to fix the error. This is most likely due to the mismatch between real errors and synthetic errors used in training: synthetic errors generated by random perturbations do not frequently contain this kind of indentation error where multiple tokens need to be inserted/removed accordingly. The fixer trained by BIFI, shown on the right, fixes the indentation error by inserting and removing the correct pair of indent tokens. We find that this is one of the representative examples of when BIFI successfully fixes code errors but the initial fixer fails.

Finally, one limitation of this work is that we focus on fixing syntactic errors (we use critics such as AST parser and compiler), and we are not evaluating the semantic correctness of our outputs. Extending BIFI to fixing semantic errors is an exciting future research avenue.

Conclusion

Machine learning of source code repair is an important direction to enhance programming productivity, but collecting human-labeled data is costly. In this work, we studied how to learn source code repair in an unsupervised way, and developed a new training method, BIFI. The key innovation of BIFI is that it creates realistic paired data for training fixers from a critic (e.g. compiler) and unlabeled data (e.g. code snippets on the web) only, which are cheaply available.

More broadly, the idea of learning fixers from critics + unlabeled data is applicable to various repair tasks beyond code repair, such as grammatical error correction 12 and molecule design, using domain-specific critics. Additionally, the idea of using a critic to improve the quality of paired data is applicable to various translation tasks by introducing a learned critic. We hope that BIFI can be an effective solution to unsupervised repair tasks and translation tasks.

You can check out our full paper here and our source code/data on GitHub.

Acknowledgments

This blog post is based on the paper:

Many thanks to Percy Liang, as well as members of the Stanford P-Lambda group, SNAP group and NLP group for their valuable feedback. Many thanks to Jacob Schreiber and Sidd Karamcheti for edits on this blog post.

  1. Reversible Debugging Software. Tom Britton, Lisa Jeng, Graham Carver, Paul Cheak, Tomer Katzenellenbogen. 2013. Programmers’ Build Errors: A Case Study (at Google). Hyunmin Seo, Caitlin Sadowski, Sebastian Elbaum, Edward Aftandilian, Robert Bowdidge. 2014. 

  2. Improving programming productivity with machine learning is an extremely active area of research. A prominent example is the Copilot / Codex service recently released by OpenAI and GitHub, which translates natural language (e.g. English) descriptions into code. Automated code repair is another complementary technology to improve programming productivity. 

  3. SEQUENCER: Sequence-to-Sequence Learning for End-to-End Program Repair. Zimin Chen, Steve Kommrusch, Michele Tufano, Louis-Noël Pouchet, Denys Poshyvanyk, Martin Monperrus. 2019. DeepDelta: Learning to Repair Compilation Errors. Ali Mesbah Andrew Rice Emily Johnston Nick Glorioso Eddie Aftandilian. 2019. Patching as Translation: the Data and the Metaphor. Yangruibo Ding, Baishakhi Ray, Premkumar Devanbu, Vincent J. Hellendoorn. 2020 

  4. Break-It-Fix-It: Unsupervised Learning for Program Repair. Michihiro Yasunaga, Percy Liang. 2021. 

  5. DeepFix: Fixing common C language errors by deep learning. Rahul Gupta, Soham Pal, Aditya Kanade, Shirish Shevade. 2017. DeepBugs: A Learning Approach to Name-based Bug Detection. Michael Pradel, Koushik Sen. 2018. Neural program repair by jointly learning to localize and repair. Marko Vasic, Aditya Kanade, Petros Maniatis, David Bieber, Rishabh Singh. 2019. Global relational models of source code. Vincent J. Hellendoorn, Charles Sutton, Rishabh Singh, Petros Maniatis, David Bieber. 2020. 

  6. Improving Neural Machine Translation Models with Monolingual Data. Rico Sennrich, Barry Haddow, Alexandra Birch. 2016. Phrase-Based & Neural Unsupervised Machine Translation. Guillaume Lample, Myle Ott, Alexis Conneau, Ludovic Denoyer, Marc’Aurelio Ranzato. 2018. 

  7. https://github.com/michiyasunaga/BIFI 

  8. DeepFix: Fixing common C language errors by deep learning. Rahul Gupta, Soham Pal, Aditya Kanade, Shirish Shevade. 2017. 

  9. DeepFix: Fixing common C language errors by deep learning. Rahul Gupta, Soham Pal, Aditya Kanade, Shirish Shevade. 2017. 

  10. SampleFix: Learning to Correct Programs by Sampling Diverse Fixes. Hossein Hajipour, Apratim Bhattacharya, Mario Fritz. 2019. 

  11. Graph-based, Self-Supervised Program Repair from Diagnostic Feedback. Michihiro Yasunaga, Percy Liang. 2020. 

  12. LM-Critic: Language Models for Unsupervised Grammatical Error Correction. Michihiro Yasunaga, Jure Leskovec, Percy Liang. 2021. 

Read More

TensorFlow Hub’s Experience with Google Summer of Code 2021

TensorFlow Hub’s Experience with Google Summer of Code 2021

Posted by Sayak Paul (MLE at Carted, and GDE) and Morgan Roff (Google)

header with GSOC and TFHub logos

We’re happy to share the work completed by Google Summer of Code students working with TensorFlow Hub this year. If you’re a student who is interested in writing open source code, then you’ll likely be interested in Google’s Summer of Code program.

Through this program, students propose project ideas to open source organizations, and if selected, receive a stipend to work with them to complete their projects over the summer. Students have the opportunity to learn directly from mentors within their selected organization, and organizations benefit from the students’ contributions. This year, 17 successful students completed their projects with the TensorFlow organization on many projects. In this article, we’ll focus on some of the work completed on TensorFlow Hub.

We’re Sayak and Morgan, two mentors for projects on TensorFlow Hub (TF Hub). Here we share what the students learned about building and publishing state-of-the-art models, training them on large-scale benchmark datasets, what we learned as mentors, and how rewarding summer of code was for each of us, and for the community.

We had the opportunity to mentor two students – Aditya Kane and Vasudev Gupta. Aditya successfully implemented several variants of RegNets including one based on this paper, and trained them on the ImageNet-1k dataset. Vasudev ported the pre-trained wav2vec2 weights from this paper to TensorFlow, which required him to implement the model architecture from scratch. He then demonstrated fine-tuning these pre-trained checkpoints on the LibriSpeech dataset, making his work more customizable and relevant for the community.

With model training happening at such a large scale, it becomes especially important to follow good engineering practices during the implementation. These include code modularization, unit tests, good design patterns, optimizations, and so on. Models were trained on Cloud TPUs to accelerate training time, and as such, substantial effort was put into the data input pipelines to ensure maximum accelerator utilization.

All of these factors collectively contributed to the complexity of the projects. Thanks to the Summer of Code program, students have the opportunity to tackle these challenges with the help of experienced mentors. This also enables students to gain insight into their organizations, and interact with people with many skillsets who cooperate to make large projects possible. A big thank you here to our students, who gracefully handled this engineering work and listened to our feedback.

Vasudev and Aditya contributed significant pre-trained models to TensorFlow Hub, along with tutorials (Wav2Vec, RegNetY) on their use, and TensorFlow implementations for folks who want to dig deeper. In their own words:

The last 2-3 months were full of lots of learning and coding. GSoC helped me get into the speech domain and motivated me to explore more about the TensorFlow ecosystem. I am thankful to my mentors for their continuous & timely feedback. I am looking forward to contributing more to the TensorFlow community and other awesome open source projects out there. – Vasudev Gupta

More about RegNets and Wav2Vec2

Almost 6 years after they were first published, ResNets are still widely used as benchmark architectures across image understanding tasks. Many recent self-supervised and semi-supervised learning frameworks still leverage ResNet50 as their backbone architectures. However, ResNets often do not scale well under larger data regimes and suffer from large training and inference time latencies as they grow. In contrast, RegNets were developed specifically to be a scalable architecture framework that maintains low latency while demonstrating high performance on standard image recognition tasks. Aditya’s models are published on TF Hub, with code and tutorials on GitHub.

Self-supervised learning is an important area of machine learning research. Many recent success stories have been focused on NLP and Computer Vision, and for Vasudev’s project, we wanted to explore speech. Last year, a group of researchers released the wav2vec2 framework for learning representations from audio in a self-supervised manner, benefiting downstream tasks like speech-to-text.

Using wav2vec2, you can now pre-train speech models without labeled data, and fine-tune those models on downstream tasks like speaker recognition. Vasudev’s models are available on TF Hub, along with a new tutorial on fine-tuning, and code on GitHub.

Wrapping up

We’d like to say a heartfelt thank you to all the students, mentors, and organizers who made Summer of Code a success despite this year’s many challenges. We encourage you to check out these models and share what you have built with us by tagging #TFHub on your social media posts, or share your work for the community spotlight program. If you have questions or want to learn more about these new models, you can ask them on discuss.tensorflow.org.

Read More

How Waze Uses TFX to Scale Production-Ready ML

How Waze Uses TFX to Scale Production-Ready ML

Posted by Gal Moran, Iris Shmuel, and Daniel Marcous (Data Scientists at Waze)

Waze

Waze is the world’s largest community-based traffic and navigation app. It uses real-time data to help users circumvent literal and figurative bumps in the road. On top of mobile navigation, Waze offers a web platform, a carpool app, partnership services, an advertisement platform and more. Such a broad portfolio brings along diverse technological challenges and many different use cases.

GIF of Waze logo

ML @Waze

Waze relies on many ML solutions, including:

  • Predicting ETA
  • Matching Riders & Drivers (Carpool)
  • Serving The Right Ads

But it’s not that easy to get something like these right and “production grade”. It is very common for these kinds of projects to have requirements for complex surrounding infrastructure for getting them to production and hence require multiple engineers (data scientist, software engineer and software reliability engineers) and a lot of time. Even more so when you mix in the Waze-y requirements like large scale data, low (real-time, actually) latency inference, diverse use cases, and a whole lot of geospatial data.

The above is a good reason why opportunistically starting to do ML created a chaotic state at Waze. For us it manifested as:

  • Multiple ML frameworks – you name it (sklearn, xgboost, TensorFlow, fbprophet, Java PMML, hand made etc.)
  • ML & Ops disconnect – models & feature engineering embedded in (Java) backend servers by engineers with limited monitoring and validation capabilities
  • Semi-manual operations for training, validation and deployment
  • A hideously long development cycle from idea to production

Overall, data scientists ended up spending a lot of their time on ops and monitoring instead of focusing on the actual modelling and data processing. At a certain level of growth we’ve decided to organize the chaos and invest in automation and processes so we can scale faster. We’ve decided to heavily invest in a way to dramatically increase velocity and quality by adopting a full cycle data science philosophy. This means that in this new world we wanted to build, a single data scientist is able to close the product cycle from research to a production grade service.

Data scientists now directly contribute to production to maximize impact. They focus on modelling and data processing and get many infrastructures and ops work out-of-the-box. While we are not yet at the end of this journey fully realizing the above vision, we feel like the effort layed out here was crucial in putting us on the right track.

Waze’s ML Stack

Translating the above philosophy to a tech spec, we were set on creating an easy, stable, automated and uniform way of building ML pipelines at Waze.

Deep diving into tech requirements we came up with the below criteria:

  • Simple — to understand, use, operate
  • Managed — no servers, no hardware, just code
  • Customizable — get the simple stuff for free, yet flexible enough to go crazy for the 5% that would require going outside the lines
  • Scalable — auto scalable data processing, training, inference
  • Pythonic — we need something production-ready, that works with most tools and code today and fits the standard data scientist. There are practically no other options than Python these days.

For the above reasons we’ve landed on TFX and the power of its built-in components to deliver these capabilities mostly out of the box.

It’s worth saying – Waze runs its tech stack on Google Cloud Platform (GCP).

It happens to be that GCP offers a suite of tools called Vertex AI. It is the ML infrastructure platform Waze is building on top of. While we use many components of Vertex AI’s managed services, we will focus here on – Vertex Pipelines – a framework for ML pipelines that helps us encapsulate TFX (or any pipeline) complexity and setup.

Together with our data tech stack, the overall ML architecture at Waze (all managed, scaled, pythonic etc.) is as follows:

graph of ML architecture at Waze

Careful readers will notice the alleged caveat here – we go all in on TensorFlow.

TFX means TensorFlow (even though that’s not exactly true anymore, let’s assume it is).

It might be a little scary at first when you have many different use cases.

Fortunately, the TF ecosystem is rich and Waze has the merit of having large enough data that neural nets converge.

Since starting this we’ve yet to find a use case that TF magic does not solve better or adequately as other frameworks (and not talking about micro % points, not trying to do a Kaggle competition here but get something to production).

Waze TFX

You might think that landing on TFX and Vertex Pipelines solved all our problems, but that’s not exactly true.

In order to make things truly simple we’ve had to write some “glue code” (integrating the various products in the above architecture diagram) and abstracting enough details so the common data scientist could use this stuff effectively and fast.

That resulted in:

  • Eliminated boilerplate
  • Hiding all common TFX components so data scientists only focus on feature engineering and modelling and get the entire pipeline for free
  • Generating BigQuery based train / eval split
  • Providing pre-implemented optional common features transform (e.g. scaling, normalization, imputations)
  • Providing pre-implemented Keras models (e.g. DNN/RNN model. TF Estimator like but in Keras that speaks TFX)
  • Utility functions (e.g. TF columns preparation)
  • Unit testing framework for tf.transform feature engineering code
  • Orchestrated and scheduled pipeline runs from Airflow using a Cloud run instance with all TFX packages installed (without installing it on the Airflow composer)

We’ve put it all in an easy to use Python package called “waze-data-tfx”

Pyramid chart showing levels of Waze data tfx

On top, we provided a super detailed walkthrough, usage guides and code templates, to our data scientists, so the common DS workflow is: fork, change config, tweak the code a little, deploy.

For reference this is how a simple waze-data-tfx pipeline looks like:

  1. Configuration
    _DATASET_NAME = 'tfx_examples'
    _TABLE_NAME = 'simple_template_data'

    _LABEL_KEY = 'label'
    _CATEGORICAL_INT_FEATURES = {
    "categorical_calculated": 2,
    }
    _DENSE_FLOAT_FEATURE_KEYS = ["numeric_feature1", "numeric_feature2"]
    _BUCKET_FEATURES = {
    "numeric_feature1": 5,
    }
    _VOCAB_FEATURES = {
    "categorical_feature": {
    'top_k': 5,
    'num_oov_buckets': 3
    }
    }

    _TRAIN_BATCH_SIZE = 128
    _EVAL_BATCH_SIZE = 128
    _NUM_EPOCHS = 250

    _TRAINING_ARGS = {
    'dnn_hidden_units': [6, 3],
    'optimizer': tf.keras.optimizers.Adam,
    'optimizer_kwargs': {
    'learning_rate': 0.01
    },
    'layer_activation': None,
    'metrics': ["Accuracy"]
    }

    _EVAL_METRIC_SPEC = create_metric_spec([
    mse_metric(upper_bound=25, absolute_change=1),
    accuracy_metric()
    ])
  2. Feature Engineering
    def preprocessing_fn(inputs):
    """tf.transform's callback function for preprocessing inputs.

    Args:
    inputs: map from feature keys to raw not-yet-transformedfeatures.

    Returns:
    Map from string feature key to transformed feature operations.
    """
    outputs = features_transform(
    inputs=inputs,
    label_key=_LABEL_KEY,
    dense_features=_DENSE_FLOAT_FEATURE_KEYS,
    vocab_features=_VOCAB_FEATURES,
    bucket_features=_BUCKET_FEATURES,
    )
    return outputs
  3. Modelling
    def _build_keras_model(**training_args):
    """Build a keras model.

    Args:
    hidden_units: [int], the layer sizes of the DNN (input layer first).
    learning_rate: [float], learning rate of the Adam optimizer.

    Returns:
    A keras model
    """
    feature_columns =
    prepare_feature_columns(
    dense_features=_DENSE_FLOAT_FEATURE_KEYS,
    vocab_features=_VOCAB_FEATURES,
    bucket_features=_BUCKET_FEATURES,
    )

    return _dnn_regressor(deep_columns=list(feature_columns.values()),
    dnn_hidden_units=training_args.get(
    "dnn_hidden_units"),
    dense_features=_DENSE_FLOAT_FEATURE_KEYS,
    vocab_features=_VOCAB_FEATURES,
    bucket_features=_BUCKET_FEATURES,
    )
  4. Orchestration
    pipeline_run = WazeTFXPipelineOperator(
    dag=dag,
    task_id='pipeline_run',
    model_name='basic_pipeline_template',
    package=tfx_pipeline_basic,
    pipeline_project_id=EnvConfig.get_value('gcp-project-infra'),
    table_project_id=EnvConfig.get_value('gcp-project-infra'),
    project_utils_filename='utils.py',
    gcp_conn_id=gcp_conn_id,
    enable_pusher=True,
    )

Simple, right?

When you commit a configuration file to the code base it gets deployed and sets up continuous training, and a full blown pipeline including all TFX and Vertex AI magics like data validation, transforms deployed to Dataflow, monitoring etc.

Summary

We knew we were up to something good when one of our data scientists came back from a long leave and had to use this new framework for a use case. She said that she was able to spin up a full production-ready pipeline in hours, something that before her leave would have taken her weeks to do.

Going forward we have much planned that we want to bake into `waze-data-tfx`. A key advantage that we see in having this common infrastructure is that once a feature is added, then everyone can enjoy it “for free”. For example, we plan on adding additional components to the pipeline, such as Infra Validator and Fairness Indicators. Once these are supported, every new or existing ML pipeline will add these components out-of-the-box, no extra code needed.

Additional improvements we are planning are around deployment. We wish to provide deployment quality assurance while automating as much as possible.

One way we are currently exploring doing so is using canary deployments. A data scientist will simply need to configure an evaluation metric and the framework (using Vertex Prediction traffic splitting capabilities and other continuous evaluation magic) would test the new model in production and gradually deploy or rollback according to the evaluated metrics.

Read More