5 steps to go from a notebook to a deployed model

Posted by Nikita Namjoshi, Google Cloud Developer Advocate

When you start working on a new machine learning problem, I’m guessing the first environment you use is a notebook. Maybe you like running Jupyter in a local environment, using a Kaggle Kernel, or my personal favorite, Colab. With tools like these, creating and experimenting with machine learning is becoming increasingly accessible. But while experimentation in notebooks is great, it’s easy to hit a wall when it comes time to elevate your experiments up to production scale. Suddenly, your concerns are more than just getting the highest accuracy score.

What if you have a long running job, want to do distributed training, or host a model for online predictions? Or maybe your use case requires more granular permissions around security and data privacy. What is your data going to look like at serving time, how will you handle code changes, or monitor the performance of your model overtime?

Making production applications or training large models requires additional tooling to help you scale beyond just code in a notebook, and using a cloud service provider can help. But that process can feel a bit daunting. Take a look at the full list of Google Cloud products, and you might be completely unsure where to start.

So to make your journey a little easier, I’ll show you a fast path from experimental notebook code to a deployed model in the cloud.

The code used in this sample can be found here. This notebook trains an image classification model on the TF Flowers dataset. You’ll see how to deploy this model in the cloud and get predictions on a new flower image via a REST endpoint.

Note that you’ll need a Google Cloud project with billing enabled to follow this tutorial. If you’ve never used Google Cloud before, you can follow these instructions to set up a project and get $300 in free credits to experiment with.

Here are the five steps you’ll take:

  1. Create a Vertex AI Workbench managed notebook
  2. Upload .ipynb file
  3. Launch notebook execution
  4. Deploy model
  5. Get predictions

Create a Vertex AI Workbench managed notebook

To train and deploy the model, you’ll use Vertex AI, which is Google Cloud’s managed machine learning platform. Vertex AI contains lots of different products that help you across the entire lifecycle of an ML workflow. You’ll use a few of these products today, starting with Workbench, which is the managed notebook offering.

Under the Vertex AI section of the cloud console, select “Workbench”. Note that if this is the first time you’re using Vertex AI in a project, you’ll be prompted to enable the Vertex API and the Notebooks API. So be sure to click the button in the UI to do so.

Next, select MANAGED NOTEBOOKS, and then NEW NOTEBOOK.

Under Advanced Settings you can customize your notebook by specifying the machine type and location, adding GPUs, providing custom containers, and enabling terminal access. For now, keep the default settings and just provide a name for your notebook. Then click CREATE.

You’ll know your notebook is ready when you see the OPEN JUPYTERLAB text turn blue. The first time you open the notebook, you’ll be prompted to authenticate and you can follow the steps in the UI to do so.

When you open the JupyterLab instance, you’ll see a few different notebook options. Vertex AI Workbench provides different kernels (TensorFlow, R, XGBoost, etc), which are managed environments preinstalled with common libraries for data science. If you need to add additional libraries to a kernel, you can use pip install from a notebook cell, just like you would in Colab.

Step one is complete! You’ve created your managed JupyterLab environment.

Upload .ipynb file

Now it’s time to get our TensorFlow code into Google Cloud. If you’ve been working in a different environment (Colab, local, etc), you can upload any code artifacts you need to your Vertex AI Workbench managed notebook, and you can even integrate with GitHub. In the future, you can do all of your development right in Workbench, but for now let’s assume you’ve been using Colab.

Colab notebooks can be exported as .ipynb files.

You can upload the file to Workbench by clicking the “upload files” icon.

When you open the notebook in Workbench, you’ll be prompted to select the kernel, which is the environment where your notebook is run. There are a few different kernels you can choose from, but since this code sample uses TensorFlow, you’ll want to select the TensorFlow 2 kernel.

After you select the kernel, any cells you execute in your notebook will run in this managed TensorFlow environment. For example, if you execute the import cell, you’ll see that you can import TensorFlow, TensorFlow Datasets, and NumPy. This is because all of these libraries are included in the Vertex AI Workbench TensorFlow 2 kernel. Unsurprisingly, if you try to execute that same notebook cell in the XGBoost kernel, you’ll see an error message since TensorFlow is not installed there.

Launch a notebook execution

While we could run the rest of the notebook cells manually, for models that take a long time to train, a notebook isn’t always the most convenient option. And if you’re building an application with ML, it’s unlikely that you’ll only need to train your model once. Over time, you’ll want to retrain your model to make sure it stays fresh and keeps producing valuable results.

Manually executing the cells of your notebook might be the right option when you’re getting started with a new machine learning problem. But when you want to automate experimentation at a large scale, or retrain models for a production application, a managed ML training option will make things much easier.

The quickest way to launch a training job is through the notebook execution feature, which will run the notebook cell by cell on the Vertex AI managed training service.

When you launch the training job, it’s going to run on a machine you won’t have access to after the job completes. So you don’t want to save the TensorFlow model artifacts to a local path. Instead, you’ll want to save to Cloud Storage, which is Google Cloud’s object storage, meaning you can store images, csv files, txt files, saved model artifacts. Just about anything.

Cloud storage has the concept of a “bucket” which is what holds your data. You can create them via the UI. Everything you store in Cloud Storage must be contained in a bucket. And within a bucket, you can create folders to organize your data.

Each file in Cloud Storage has a path, just like a file on your local filesystem. Except that Cloud Storage paths always start with gs://

You’ll want to update your training code so that you’re saving to a Cloud Storage bucket instead of a local path.

For example, here I’ve updated the last cell of the notebook from model.save('model_ouput").Instead of saving locally, I’m now saving the artifacts to a bucket called nikita-flower-demo-bucket that I’ve created in my project.

Now we’re ready to launch the execution.

Select the Execute button, give your execution a name, then add a GPU. Under Environment, select the TensorFlow 2.7 GPU image. This container comes preinstalled with TensorFlow and many other data science libraries.

Then click SUBMIT.

You can track the status of your training job in the EXECUTIONS tab. The notebook and the output of each cell will be visible under VIEW RESULT when the job finishes and is stored in a GCS bucket. This means you can always tie a model run back to the code that was executed.

When the training completes you’ll be able to see the TensorFlow saved model artifacts in your bucket.

Deploy to endpoint

Now you know how to quickly launch serverless training jobs on Google Cloud. But ML is not just about training. What’s the point of all this effort if we don’t actually use the model to do something, right?

Just like with training, we could execute predictions directly from our notebook by calling model.predict. But when we want to get predictions for lots of data, or get low latency predictions on the fly, we’re going to need something more powerful than a notebook.

Back in your Vertex AI Workbench managed notebook, you can paste the code below in a cell, which will use the Vertex AI Python SDK to deploy the model you just trained to the Vertex AI Prediction service. Deploying the model to an endpoint associates the saved model artifacts with physical resources for low latency predictions.

First, import the Vertex AI Python SDK.

from google.cloud import aiplatform

Then, upload your model to the Vertex AI Model Registry. You’ll need to give your model a name, and provide a serving container image, which is the environment where your predictions will run. Vertex AI provides pre-built containers for serving, and in this example we’re using the TensorFlow 2.8 image.

You’ll also need to replace artifact_uri with the path to the bucket where you stored your saved model artifacts. For me, that was “nikita-flower-demo-bucket”. You’ll also need to replace project with your project ID.

my_model = aiplatform.Model.upload(display_name='flower-model',
artifact_uri='gs://{YOUR_BUCKET}',
serving_container_image_uri='us-docker.pkg.dev/vertex-ai/prediction/tf2-cpu.2-8:latest',
project={YOUR_PROJECT})

Then deploy the model to an endpoint. I’m using default values for now, but if you’d like to learn more about traffic splitting, and autoscaling, be sure to check out the docs. Note that if your use case does not require low latency predictions, you don’t need to deploy the model to an endpoint and can use the batch prediction feature instead.

endpoint = my_model.deploy(
deployed_model_display_name='my-endpoint',
traffic_split={"0": 100},
machine_type="n1-standard-4",
accelerator_count=0,
min_replica_count=1,
max_replica_count=1,
)

Once the deployment has completed, you can see your model and endpoint in the console

Get predictions

Now that this model is deployed to an endpoint, you can hit it like any other REST endpoint. This means you can integrate your model and get predictions into a downstream application.

For now, let’s just test it out directly within Workbench.

First, open a new TensorFlow notebook.

In the notebook, import the Vertex AI Python SDK.

from google.cloud import aiplatform

Then, create your endpoint, replacing project_number and endpoint_id.

endpoint = aiplatform.Endpoint(
endpoint_name="projects/{project_number}/locations/us-central1/endpoints/{endpoint_id}")

You can find your endpoint_id in the Endpoints section of the cloud Console.

You can find your Project Number on the home page of the console. Note that this is different from the Project ID.

When you send a request to an online prediction server, the request is received by an HTTP server. The HTTP server extracts the prediction request from the HTTP request content body. The extracted prediction request is forwarded to the serving function. The basic format for online prediction is a list of data instances. These can be either plain lists of values or members of a JSON object, depending on how you configured your inputs in your training application.

To test the endpoint, I first uploaded an image of a flower to my workbench instance.

The code below opens and resizes the image with PIL, and converts it into a numpy array.

import numpy as np
from PIL import Image

IMAGE_PATH = 'test_image.jpg'

im = Image.open(IMAGE_PATH)
im = im.resize((150, 150))

Then, we convert our numpy data to type float32 and to a list. We convert to a list because numpy data is not JSON serializable so we can’t send it in the body of our request. Note that we don’t need to scale the data by 255 because that step was included as part of our model architecture using tf.keras.layers.Rescaling(1./255). To avoid having to resizing our image, we could have added tf.keras.layers.Resizing to our model, instead of making it part of the tf.data pipeline.

# convert to float32 list
x_test = [np.asarray(im).astype(np.float32).tolist()]

Then, we call call predict

endpoint.predict(instances=x_test).predictions

The result you get is the output of the model, which is a softmax layer with 5 units. Looks like class at index 2 (tulips) scored the highest.

[[0.0, 0.0, 1.0, 0.0, 0.0]]

Tip: to save costs, be sure to undeploy your endpoint if you’re not planning to use it! You can undeploy by going to the Endpoints section of the console, selecting the endpoint and then the Undeploy model form endpoint option. You can always redeploy in the future if needed.

For more realistic examples, you’ll probably want to directly send the image itself to the endpoint, instead of loading it in NumPy first. If you’d like to see an example, check out this notebook.

What’s Next

You now know how to get from notebook experimentation to deployment in the cloud. With this framework in mind, I hope you start thinking about how you can build new ML applications with notebooks and Vertex AI.

If you’re interested in learning even more about how to use Google Cloud to get your TensorFlow models into production, be sure to register for the upcoming Google Cloud Applied ML Summit. This virtual event is scheduled for 9th June and brings together the world’s leading professional machine learning engineers and data scientists. Connect with other ML engineers and data scientists and discover new ways to speed up experimentation, quickly get into production, scale and manage models, and automate pipelines to deliver impact. Reserve your seat today!

Read More

Early sound exposure in the womb shapes the auditory system

Inside the womb, fetuses can begin to hear some sounds around 20 weeks of gestation. However, the input they are exposed to is limited to low-frequency sounds because of the muffling effect of the amniotic fluid and surrounding tissues.

A new MIT-led study suggests that this degraded sensory input is beneficial, and perhaps necessary, for auditory development. Using simple computer models of the human auditory processing, the researchers showed that initially limiting input to low-frequency sounds as the models learned to perform certain tasks actually improved their performance.

Along with an earlier study from the same team, which showed that early exposure to blurry faces improves computer models’ subsequent generalization ability to recognize faces, the findings suggest that receiving low-quality sensory input may be key to some aspects of brain development.

“Instead of thinking of the poor quality of the input as a limitation that biology is imposing on us, this work takes the standpoint that perhaps nature is being clever and giving us the right kind of impetus to develop the mechanisms that later prove to be very beneficial when we are asked to deal with challenging recognition tasks,” says Pawan Sinha, a professor of vision and computational neuroscience in MIT’s Department of Brain and Cognitive Sciences, who led the research team.

In the new study, the researchers showed that exposing a computational model of the human auditory system to a full range of frequencies from the beginning led to worse generalization performance on tasks that require absorbing information over longer periods of time — for example, identifying emotions from a voice clip. From the applied perspective, the findings suggest that babies born prematurely may benefit from being exposed to lower-frequency sounds rather than the full spectrum of frequencies that they now hear in neonatal intensive care units, the researchers say.

Marin Vogelsang and Lukas Vogelsang, currently both students at EPFL Lausanne, are the lead authors of the study, which appears in the journal Developmental Science. Sidney Diamond, a retired neurologist and now an MIT research affiliate, is also an author of the paper.

Low-quality input

Several years ago, Sinha and his colleagues became interested in studying how low-quality sensory input affects the brain’s subsequent development. This question arose in part after the researchers had the opportunity to meet and study a young boy who had been born with cataracts that were not removed until he was four years old.

This boy, who was born in China, was later adopted by an American family and referred to Sinha’s lab at the age of 10. Studies revealed that his vision was nearly normal, with one notable exception: He performed very poorly in recognizing faces. Other studies of children born blind have also revealed deficits in face recognition after their sight was restored.

The researchers hypothesized that this impairment might be a result of missing out on some of the low-quality visual input that babies and young children normally receive. When babies are born, their visual acuity is very poor — around 20/800, 1/40 the strength of normal 20/20 vision. This is in part because of the lower packing density of photoreceptors in the newborn retina. As the baby grows, the receptors become more densely packed and visual acuity improves.

“The theory we proposed was that this initial period of blurry or degraded vision was very important. Because everything is so blurry, the brain needs to integrate over larger areas of the visual field,” Sinha says.

To explore this theory, the researchers used a type of computational model of vision known as a convolutional neural network. They trained the model to recognize faces, giving it either blurry input followed later by clear input, or clear input from the beginning. They found that the models that received fuzzy input early on showed superior generalization performance on facial recognition tasks. Additionally, the neural networks’ receptive fields — the size of the visual area that they cover — were larger than the receptive fields in models trained on the clear input from the beginning.

After that study was published in 2018, the researchers wanted to explore whether this phenomenon could also be seen in other types of sensory systems. For audition, the timeline of development is slightly different, as full-term babies are born with nearly normal hearing across the sound spectrum. However, during the prenatal period, while the auditory system is still developing, babies are exposed to degraded sound quality in the womb.

To examine the effects of that degraded input, the researchers trained a computational model of human audition to perform a task that requires integrating information over long time periods — identifying emotion from a voice clip. As the models learned the task, the researchers fed them one of four different types of auditory input: low frequency only, full frequency only, low frequency followed by full frequency, and full frequency followed by low frequency.

Low frequency followed by full frequency most closely mimics what developing infants are exposed to, and the researchers found that the computer models exposed to that scenario exhibited the most generalized performance profile on the emotion recognition task. Those models also generated larger temporal receptive fields, meaning that they were able to analyze sounds occurring over a longer time period.

This suggests, just like the vision study, that degraded input early in development actually promotes better sensory integration abilities later in life.

“It supports the idea that starting with very limited information, and then getting better and better over time might actually be a feature of the system rather than being a bug,” Lukas Vogelsang says.

Effects of premature birth

Previous research done by other labs has found that babies born prematurely do show impairments in processing low-frequency sounds. They perform worse than full-term babies on tests of emotion classification, later in life. The MIT team’s computational findings suggest that these impairments may be the result of missing out on some of the low-quality sensory input they would normally receive in the womb.

“If you provide full-frequency input right from the get-go, then you are taking away the impetus on the part of the brain to try to discover long range or extended temporal structure. It can get by with just local temporal structure,” Sinha says. “Presumably that is what immediate immersion in full-frequency soundscapes does to the brain of a prematurely born child.”

The researchers suggest that for babies born prematurely, it could be beneficial to expose them to primarily low-frequency sounds after birth, to mimic the womb-like conditions they’re missing out on.

The research team is now exploring other areas in which this kind of degraded input may be beneficial to brain development. These include aspects of vision, such as color perception, as well as qualitatively different domains such as linguistic development.

“We have been surprised by how consistent the narrative and the hypothesis of the experimental results are, to this idea of initial degradations being adaptive for developmental purposes,” Sinha says. “I feel that this work illustrates the gratifying surprises science offers us. We did not expect that the ideas which germinated from our work with congenitally blind children would have much bearing on our thinking about audition. But, in fact, there appears to be a beautiful conceptual commonality between the two domains. And, maybe that common thread goes even beyond these two sensory modalities. There are clearly a host of exciting research questions ahead of us.”

The research was funded by the National Institutes of Health.

Read More

Stanford AI Lab Papers and Talks at ACL 2022

The 60th Annual Meeting of the Association for Computational Linguistics (ACL) 2022 is taking place May 22nd – May 27th. We’re excited to share all the work from SAIL that’s being presented, and you’ll find links to papers, videos and blogs below. Feel free to reach out to the contact authors directly to learn more about the work that’s happening at Stanford!

List of Accepted Papers

LinkBERT: Pretraining Language Models with Document Links


Authors: Michihiro Yasunaga, Jure Leskovec*, Percy Liang*

Contact: myasu@cs.stanford.edu

Links: Paper | Website

Keywords: language model, pretraining, knowledge, hyperlink, bionlp

When classifying grammatical role, BERT doesn’t care about word order… except when it matters


Authors: Isabel Papadimitriou, Richard Futrell, Kyle Mahowald

Contact: isabelvp@stanford.edu

Links: Paper

Keywords: large language models, analysis, word order, order invariance, grammatical role, syntax, semantics

Problems with Cosine as a Measure of Embedding Similarity for High Frequency Words


Authors: Kaitlyn Zhou, Kawin Ethayarajh, Dallas Card, Dan Jurafsky

Contact: katezhou@stanford.edu

Keywords: cosine similarity, training data frequency, model analysis

Faithful or Extractive? On Mitigating the Faithfulness-Abstractiveness Trade-off in Abstractive Summarization


Authors: Faisal Ladhak, Esin Durmus, He He, Claire Cardie, Kathleen McKeown

Contact: esdurmus@stanford.edu

Links: Paper

Keywords: text summarization, text generation, evaluation, faithfulness

Spurious Correlations in Reference-Free Evaluation of Text Generation


Authors: Esin Durmus, Faisal Ladhak, Tatsunori Hashimoto

Contact: esdurmus@stanford.edu

Links: Paper

Keywords: text summarization, text generation, dialogue generation, evaluation, metrics,

TABi: Type-Aware Bi-Encoders for Open-Domain Entity Retrieval


Authors: Megan Leszczynski, Daniel Y. Fu, Mayee F. Chen, Christopher Ré

Contact: mleszczy@stanford.edu

Links: Paper | Blog Post | Website

Keywords: entity retrieval, contrastive learning, bi-encoders

A Few-Shot Semantic Parser for Wizard-of-Oz Dialogues with the Precise ThingTalk Representation


Authors: Giovanni Campagna, Sina J. Semnani, Ryan Kearns, Lucas Jun Koba Sato, Silei Xu, Monica S. Lam

Contact: gcampagn@cs.stanford.edu

Venue: Findings of ACL

Links: Paper | Website

Keywords: dialogue agents, task-oriented dialogues, data synthesis

Richer Countries and Richer Representations


Authors: Kaitlyn Zhou, Kawin Ethayarajh, Dan Jurafsky

Contact: katezhou@stanford.edu

Venue: Findings of ACL

Keywords: representational harms, model analysis, geographic entities

Modular Domain Adaptation


Authors: Junshen K. Chen, Dallas Card, Dan Jurafsky

Contact: dalc@umich.edu

Venue: Findings of ACL

Links: Paper | Blog Post | Website

Keywords: domain adaptation, computational social science, text classification, lexicons, sentiment

Shared Autonomy for Robotic Manipulation with Language Corrections


Authors: Siddharth Karamcheti*, Raj Palleti*, Yuchen Cui, Percy Liang, Dorsa Sadigh

Contact: skaramcheti@cs.stanford.edu

Venue: ACL LNLS workshop

Links: Paper

Keywords: human-robot interaction, online language corrections, language supervision


Read More

Image-Text Pre-training with Contrastive Captioners

Oftentimes, machine learning (ML) model developers begin their design using a generic backbone model that is trained at scale and with capabilities transferable to a wide range of downstream tasks. In natural language processing, a number of popular backbone models, including BERT, T5, GPT-3 (sometimes also referred to as “foundation models”), are pre-trained on web-scale data and have demonstrated generic multi-tasking capabilities through zero-shot, few-shot or transfer learning. Compared with training over-specialized individual models, pre-training backbone models for a large number of downstream tasks can amortize the training costs, allowing one to overcome resource limitations when building large scale models.

In computer vision, pioneering work has shown the effectiveness of single-encoder models pre-trained for image classification to capture generic visual representations that are effective for other downstream tasks. More recently, contrastive dual-encoder (CLIP, ALIGN, Florence) and generative encoder-decoder (SimVLM) approaches trained using web-scale noisy image-text pairs have been explored. Dual-encoder models exhibit remarkable zero-shot image classification capabilities but are less effective for joint vision-language understanding. On the other hand, encoder-decoder methods are good at image captioning and visual question answering but cannot perform retrieval-style tasks.

In “CoCa: Contrastive Captioners are Image-Text Foundation Models”, we present a unified vision backbone model called Contrastive Captioner (CoCa). Our model is a novel encoder-decoder approach that simultaneously produces aligned unimodal image and text embeddings and joint multimodal representations, making it flexible enough to be directly applicable for all types of downstream tasks. Specifically, CoCa achieves state-of-the-art results on a series of vision and vision-language tasks spanning vision recognition, cross-modal alignment, and multimodal understanding. Furthermore, it learns highly generic representations so that it can perform as well or better than fully fine-tuned models with zero-shot learning or frozen encoders.

Overview of Contrastive Captioners (CoCa) compared to single-encoder, dual-encoder and encoder-decoder models.

Method
We propose CoCa, a unified training framework that combines contrastive loss and captioning loss on a single training data stream consisting of image annotations and noisy image-text pairs, effectively merging single-encoder, dual-encoder and encoder-decoder paradigms.

To this end, we present a novel encoder-decoder architecture where the encoder is a vision transformer (ViT), and the text decoder transformer is decoupled into two parts, a unimodal text decoder and a multimodal text decoder. We skip cross-attention in unimodal decoder layers to encode text-only representations for contrastive loss, and cascade multimodal decoder layers with cross-attention to image encoder outputs to learn multimodal image-text representations for captioning loss. This design maximizes the model’s flexibility and universality in accommodating a wide spectrum of tasks, and at the same time, it can be efficiently trained with a single forward and backward propagation for both training objectives, resulting in minimal computational overhead. Thus, the model can be trained end-to-end from scratch with training costs comparable to a naïve encoder-decoder model.

Illustration of forward propagation used by CoCa for both contrastive and captioning losses.

Benchmark Results
The CoCa model can be directly fine-tuned on many tasks with minimal adaptation. By doing so, our model achieves a series of state-of-the-art results on popular vision and multimodal benchmarks, including (1) visual recognition: ImageNet, Kinetics-400/600/700, and MiT; (2) cross-modal alignment: MS-COCO, Flickr30K, and MSR-VTT; and (3) multimodal understanding: VQA, SNLI-VE, NLVR2, and NoCaps.

Comparison of CoCa with other image-text backbone models (without task-specific customization) and multiple state-of-the-art task-specialized models.

It is noteworthy that CoCa attains these results as a single model adapted for all tasks while often lighter than prior top-performing specialized models. For example, CoCa obtains 91.0% ImageNet top-1 accuracy while using less than half the parameters of prior state-of-the-art models. In addition, CoCa also obtains strong generative capability of high-quality image captions.

Image classification scaling performance comparing fine-tuned ImageNet top-1 accuracy versus model size.
Text captions generated by CoCa with NoCaps images as input.

Zero-Shot Performance
Besides achieving excellent performance with fine-tuning, CoCa also outperforms previous state-of-the-art models on zero-shot learning tasks, including image classification,and cross-modal retrieval. CoCa obtains 86.3% zero-shot accuracy on ImageNet while also robustly outperforming prior models on challenging variant benchmarks, such as ImageNet-A, ImageNet-R, ImageNet-V2, and ImageNet-Sketch. As shown in the figure below, CoCa obtains better zero-shot accuracy with smaller model sizes compared to prior methods.

Image classification scaling performance comparing zero-shot ImageNet top-1 accuracy versus model size.

Frozen Encoder Representation
One particularly exciting observation is that CoCa achieves results comparable to the best fine-tuned models using only a frozen visual encoder, in which features extracted after model training are used to train a classifier, rather than the more computationally intensive effort of fine-tuning a model. On ImageNet, a frozen CoCa encoder with a learned classification head obtains 90.6% top-1 accuracy, which is better than the fully fine-tuned performance of existing backbone models (90.1%). We also find this setup to work extremely well for video recognition. We feed sampled video frames into the CoCa frozen image encoder individually, and fuse output features by attentional pooling before applying a learned classifier. This simple approach using a CoCa frozen image encoder achieves video action recognition top-1 accuracy of 88.0% on Kinetics-400 dataset and demonstrates that CoCa learns a highly generic visual representation with the combined training objectives.

Comparison of Frozen CoCa visual encoder with (multiple) best-performing fine-tuned models.

Conclusion
We present Contrastive Captioner (CoCa), a novel pre-training paradigm for image-text backbone models. This simple method is widely applicable to many types of vision and vision-language downstream tasks, and obtains state-of-the-art performance with minimal or even no task-specific adaptations.

Acknowledgements
We would like to thank our co-authors Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu who have been involved in all aspects of the project. We also would like to thank Yi-Ting Chen, Kaifeng Chen, Ye Xia, Zhen Li, Chao Jia, Yinfei Yang, Zhengdong Zhang, Wei Han, Yuan Cao, Tao Zhu, Futang Peng, Soham Ghosh, Zihang Dai, Xin Li, Anelia Angelova, Jason Baldridge, Izhak Shafran, Shengyang Dai, Abhijit Ogale, Zhifeng Chen, Claire Cui, Paul Natsev, Tom Duerig for helpful discussions, Andrew Dai for help with contrastive models, Christopher Fifty and Bowen Zhang for help with video models, Yuanzhong Xu for help with model scaling, Lucas Beyer for help with data preparation, Andy Zeng for help with MSR-VTT evaluation, Hieu Pham and Simon Kornblith for help with zero-shot evaluations, Erica Moreira and Victor Gomes for help with resource coordination, Liangliang Cao for proofreading, Tom Small for creating the animations used in this blogpost, and others in the Google Brain team for support throughout this project.

Read More

Powering Next Generation Applications with OpenAI Codex


Powering Next Generation Applications with OpenAI Codex

OpenAI Codex, a natural language-to-code system based on GPT-3, helps turn simple English instructions into over a dozen popular coding languages. Codex was released last August through our API and is the principal building block of GitHub Copilot.

Our motivation behind Codex is to supplement developers’ work and increase productivity. Codex helps computers to better understand people’s intent, which enables everyone to do more with computers. This is an integral part of our mission to build general-purpose AI that benefits all of humanity.

For enterprise customers, Microsoft’s Azure OpenAI Service provides developers with access to Codex and our other models, like GPT-3 and embeddings, along with enterprise-grade capabilities that are built into Microsoft Azure. At its Build conference today, Microsoft announced that Azure OpenAI Service—previously available by invitation only—is now available in a limited access preview. We’re already seeing new applications of Azure OpenAI Service across many industry verticals, from healthcare to financial services.

Applications and Industries

Since its release via our API, we’ve been working closely with developers to build on top of Codex. These applications utilize the system’s capabilities in a variety of categories including creativity, learning, productivity and problem solving.

Applications using Codex:

Powering Next Generation Applications with OpenAI Codex

GitHub Copilot is an AI pair programmer that provides suggestions for whole lines or entire functions right inside the code editor.

Through tight integration with Codex, GitHub Copilot can convert comments to code, autofill repetitive code, suggest tests and show alternatives.

Available for Visual Studio and Visual Studio Code, among other environments, GitHub Copilot works with a broad set of frameworks and languages, and for some programming languages suggests approximately 35% of the code generated by tens of thousands of developers who use it today.

Microsoft announced at its Build developer conference that GitHub Copilot will move to general availability this summer.

Powering Next Generation Applications with OpenAI Codex

Pygma aims to turn Figma designs into high-quality code.

Pygma utilizes Codex to turn Figma designs into different frontend frameworks and match the coding style and preferences of the developer. Codex enables Pygma to help developers do tasks instantly that previously could have taken hours.

“Codex has allowed me to integrate innovative features into my app with very little coding. As someone without a strong machine learning background, certain features like flexible code-tweaking would be incredibly difficult to build in-house. With Codex, it works almost out of the box.”

—Emile Paffard-Wray, Founder, Pygma

Powering Next Generation Applications with OpenAI Codex

Replit is a programming platform for any programming language that lets users collaborate live on projects, learn about code and share work with a community of learners and builders.

Replit leverages Codex to describe what a selection of code is doing in simple language so everyone can get quality explanation and learning tools. Users can highlight selections of code and click “Explain Code” to use Codex to understand its functionality.

“Codex helps learning on Replit better understand code they encounter. We’ve only scratched the surface of what semantic code understanding can offer those who want to get from idea to working code quickly.”

—Amjad Masad, Founder, Replit

Powering Next Generation Applications with OpenAI Codex

Warp is a Rust-based terminal, reimagined from the ground up to help both individuals and teams be more productive in the command-line.

Terminal commands are typically difficult to remember, find and construct. Users often have to leave the terminal and search the web for answers and even then the results might not give them the right command to execute. Warp uses Codex to allow users to run a natural language command to search directly from within the terminal and get a result they can immediately use.

“Codex allows Warp to make the terminal more accessible and powerful. Developers search for entire commands using natural language rather than trying to remember them or assemble them piecemeal. Codex-powered command search has become one of our game changing features.

—Zach Lloyd, Founder, Warp

Powering Next Generation Applications with OpenAI Codex

Machinet helps professional Java developers write quality code by using Codex to generate intelligent unit test templates.

Machinet was able to accelerate their development several-fold by switching from building their own machine learning systems to using Codex. The flexibility of Codex allows for the ability to easily add new features and capabilities saving their users time and helping them be more productive.

“Codex is an amazing tool in our arsenal. Not only does it allow us to generate more meaningful code, but it has also helped us find a new design of product architecture and got us out of a local maximum.”

—Vladislav Yanchenko, Founder, Machinet


OpenAI