GPT-3 Powers the Next Generation of Apps

GPT-3 Powers the Next Generation of Apps

Join the waitlist

GPT-3 Powers the Next Generation of Apps

Nine months since the launch of our first commercial product, the OpenAI API, more than 300 applications are now using GPT-3, and tens of thousands of developers around the globe are building on our platform. We currently generate an average of 4.5 billion words per day, and continue to scale production traffic.

Given any text prompt like a phrase or a sentence, GPT-3 returns a text completion in natural language. Developers can “program” GPT-3 by showing it just a few examples or “prompts.” We’ve designed the API to be both simple for anyone to use but also flexible enough to make machine learning teams more productive.

Applications and industries

To date, over 300 apps are using GPT-3 across varying categories and industries, from productivity and education to creativity and games. These applications utilize a suite of GPT-3’s diverse capabilities (and have helped us discover new ones!). A few of these include:

GPT-3 Powers the Next Generation of Apps

Viable helps companies better understand their customers by using GPT-3 to provide useful insights from customer feedback in easy-to-understand summaries.

Using GPT-3, Viable identifies themes, emotions, and sentiment from surveys, help desk tickets, live chat logs, reviews, and more. It then pulls insights from this aggregated feedback and provides a summary in seconds.

For example, if asked, “What’s frustrating our customers about the checkout experience?”, Viable might provide the insight: “Customers are frustrated with the checkout flow because it takes too long to load. They also want a way to edit their address in checkout and save multiple payment methods.”

“GPT-3’s ability to identify themes from natural language and generate summaries allows Viable to give product, customer experience, and marketing teams at companies across industries a better understanding of their customers’ wants and needs,” said Daniel Erickson, CEO of Viable.

Visit Viable

Fable Studio is creating a new genre of interactive stories and using GPT-3 to help power their story-driven “Virtual Beings.”

Lucy, the hero of Neil Gaiman and Dave McKean’s Wolves in the Walls, which was adapted by Fable into the Emmy Award-winning VR experience, can have natural conversations with people thanks to dialogue generated by GPT-3. Lucy appeared as a guest at Sundance Film Festival 2021 and presented her own movie, Dracula.

“GPT-3 has given us the ability to give our characters life,” said Fable Studio CEO Edward Saatchi, adding, “We’re excited to combine an artist’s vision, AI, and emotional intelligence to create powerful narratives, and believe that one day, everyone will know a Virtual Being.”

Visit Fable Studio

GPT-3 Powers the Next Generation of Apps

Algolia uses GPT-3 in their Algolia Answers product to offer relevant, lightning-fast semantic search for their customers.

When the OpenAI API launched, Algolia partnered with OpenAI to integrate GPT-3 with their advanced search technology in order to create their new Answers product that better understands customers’ questions and connects them to the specific part of the content that answers their questions. Algolia Answers helps publishers and customer support help desks query in natural language and surface nontrivial answers. After running tests of GPT-3 on 2.1 million news articles, Algolia saw 91% precision or better and Algolia was able to accurately answer complex natural language questions four times more often than BERT.

“We’ve seen great results from Algolia Answers on questions that are difficult to answer with textual search alone,” said Peter Buffington, Product Manager at ABC Australia. “It was able to return very relevant, evergreen content from our news archives for questions such as ‘Why does a volcano erupt?’”

“GPT-3 allows Algolia to answer more complex queries than ever before with our Algolia Answers product, identifying deeper contextual information to improve the quality of results and deliver them in seconds,” said Dustin Coates, Product and GTM Manager at Algolia.

Visit Algolia

Platform improvements

As we scale access, our team is continually improving the platform—from implementing a content filter to offering new features for developers including our recently launched:

  • Answers endpoint: Searches provided information (documents, knowledge bases etc.) for relevant context to be added to the prompt before completing with GPT-3. Can be used to build applications like customer support bots with no fine-tuning.
  • Classifications endpoint: Can leverage labeled training data without fine-tuning. By searching for the closest examples with respect to the input query and adding them to prompt, it often matches the performance of state of the art fine-tuned models, providing an autoML solution that is easy to configure and adapt.
  • Enhanced search endpoint: Provides the backbone for the Answers and Classifications endpoints that scales to a large number of documents while also being cheap and fast.
  • Safety: Bias and misuse are important, industry-wide problems we take very seriously. We review all applications and approve only those for production that use GPT-3 in a responsible manner. We require developers to implement safety measures such as rate limits, user verification and testing, or human-in-the-loop requirements before they move into production. We also actively monitor for signs of misuse as well as “red team” applications for possible vulnerabilities. Additionally, we have developed and deployed a content filter that classifies text as safe, sensitive, or unsafe. We currently have it set to err on the side of caution, which results in a higher rate of false positives.
  • Prompt library: Provides starter prompt design examples for dozens of use cases that users can begin programming with directly in Playground, like a Spreadsheet Generator, Grammar Corrector, or Airport Code Extractor.
GPT-3 Powers the Next Generation of Apps

Prompt design examples that users can begin programming with directly.

Our growing developer community

We have a growing community of tens of thousands of developers around the world, with the majority across North America, Europe, Asia, and Australia. We’ve also found that many of our developers tend to be those without a traditional AI or software engineering background. It’s been encouraging to hear from several of our developers that their first experience with an API or programming has been with OpenAI’s interface.

For myself, and other mission-driven innovators, OpenAI has given us the tool we finally need to make transformative change in the community with GPT-3. With natural language processing, technical experience is no longer a barrier, and we can truly keep our focus on solving real world problems. In my work with a lot of first-time developers, those who are most successful at building with GPT-3 are great communicators as they are able to unlock the nuances of prompt design.”

GPT-3 Powers the Next Generation of Apps
Abran Maldonado
Co-Founder of Create Labs

Programming with GPT-3 can feel like a much more creative process compared to traditional coding because of the natural language prompts. I believe AI will be integrated into every product in the future, and it’s been a pleasure working with developers of all experience levels from across the world who are creating innovative apps through the API.”

GPT-3 Powers the Next Generation of Apps
Natalie Pistunovich
Lead Developer Advocate at Aerospike
Founder of Women Techmakers Berlin

Call for developers

We think there are still many new capabilities of GPT-3 yet to be discovered and we want you to help us uncover them! In a similar spirit to our previous Requests for Research and Y Combinator’s Requests for Startups, we’d love to see our current and future developers push the limits of what’s possible with GPT-3 and build new applications in the following areas:

Productivity Tools
Healthcare and Biotechnology
Climate Science and Energy
Educational Technology and Learning Tools

We are happy to support hackathons and provide API access for these events, especially if they include challenges in the above areas (we of course are open to other challenge areas as well!). Please email community@openai.com with details about the event. We’re excited to see what our developers build next.

If you are interested in joining our Applied AI team, who focus on bringing OpenAI’s technology and products to the world, we’re hiring!


Acknowledgments

Hannah Wong, Justin Jay Wang, Steve Dowling, Greg Brockman, Mira Murati, Peter Welinder, Sam Altman, Luke Miller, Katie Mayer, Steven Adler, David Schnurr, Maddie Simens, Miles Brundage, Gretchen Krueger, Andrew Mayne & Raf Jakubanis.

OpenAI

Multimodal Neurons in Artificial Neural Networks

Multimodal Neurons in Artificial Neural Networks

Multimodal Neurons in Artificial Neural Networks

We’ve discovered neurons in CLIP that respond to the same concept whether presented literally, symbolically, or conceptually. This may explain CLIP’s accuracy in classifying surprising visual renditions of concepts, and is also an important step toward understanding the associations and biases that CLIP and similar models learn.

Read PaperView CodeBrowse Microscope

Fifteen years ago, Quiroga et al. discovered that the human brain possesses multimodal neurons. These neurons respond to clusters of abstract concepts centered around a common high-level theme, rather than any specific visual feature. The most famous of these was the “Halle Berry” neuron, a neuron featured in both Scientific American and The New York Times, that responds to photographs, sketches, and the text “Halle Berry” (but not other names).

Two months ago, OpenAI announced CLIP, a general-purpose vision system that matches the performance of a ResNet-50, but outperforms existing vision systems on some of the most challenging datasets. Each of these challenge datasets, ObjectNet, ImageNet Rendition, and ImageNet Sketch, stress tests the model’s robustness to not recognizing not just simple distortions or changes in lighting or pose, but also to complete abstraction and reconstruction—sketches, cartoons, and even statues of the objects.

Now, we’re releasing our discovery of the presence of multimodal neurons in CLIP. One such neuron, for example, is a “Spider-Man” neuron (bearing a remarkable resemblance to the “Halle Berry” neuron) that responds to an image of a spider, an image of the text “spider,” and the comic book character “Spider-Man” either in costume or illustrated.

Our discovery of multimodal neurons in CLIP gives us a clue as to what may be a common mechanism of both synthetic and natural vision systems—abstraction. We discover that the highest layers of CLIP organize images as a loose semantic collection of ideas, providing a simple explanation for both the model’s versatility and the representation’s compactness.

Biological neurons, such as the famed Halle Berry neuron, do not fire for visual clusters of ideas, but semantic clusters. At the highest layers of CLIP, we find similar semantic invariance. Note that images are replaced by higher resolution substitutes from Quiroga et al., and that the images from Quiroga et al. are themselves substitutes of the original stimuli.

Using the tools of interpretability, we give an unprecedented look into the rich visual concepts that exist within the weights of CLIP. Within CLIP, we discover high-level concepts that span a large subset of the human visual lexicon—geographical regions, facial expressions, religious iconography, famous people and more. By probing what each neuron affects downstream, we can get a glimpse into how CLIP performs its classification.

Multimodal neurons in CLIP

Our paper builds on nearly a decade of research into interpreting convolutional networks, beginning with the observation that many of these classical techniques are directly applicable to CLIP. We employ two tools to understand the activations of the model: feature visualization, which maximizes the neuron’s firing by doing gradient-based optimization on the input, and dataset examples, which looks at the distribution of maximal activating images for a neuron from a dataset.

Using these simple techniques, we’ve found the majority of the neurons in CLIP RN50x4 (a ResNet-50 scaled up 4x using the EfficientNet scaling rule) to be readily interpretable. Indeed, these neurons appear to be extreme examples of “multi-faceted neurons,” neurons that respond to multiple distinct cases, only at a higher level of abstraction.


summer
Multimodal Neurons in Artificial Neural Networks

Any
Multimodal Neurons in Artificial Neural Networks

Text
Multimodal Neurons in Artificial Neural Networks

Face
Multimodal Neurons in Artificial Neural Networks

Logo
Multimodal Neurons in Artificial Neural Networks

Architecture
Multimodal Neurons in Artificial Neural Networks

Indoor
Multimodal Neurons in Artificial Neural Networks

Nature
Multimodal Neurons in Artificial Neural Networks

Pose
winter
Multimodal Neurons in Artificial Neural Networks

Any
Multimodal Neurons in Artificial Neural Networks

Text
Multimodal Neurons in Artificial Neural Networks

Face
Multimodal Neurons in Artificial Neural Networks

Logo
Multimodal Neurons in Artificial Neural Networks

Architecture
Multimodal Neurons in Artificial Neural Networks

Indoor
Multimodal Neurons in Artificial Neural Networks

Nature
Multimodal Neurons in Artificial Neural Networks

Pose
shocked
Multimodal Neurons in Artificial Neural Networks

Any
Multimodal Neurons in Artificial Neural Networks

Text
Multimodal Neurons in Artificial Neural Networks

Face
Multimodal Neurons in Artificial Neural Networks

Logo
Multimodal Neurons in Artificial Neural Networks

Architecture
Multimodal Neurons in Artificial Neural Networks

Indoor
Multimodal Neurons in Artificial Neural Networks

Nature
Multimodal Neurons in Artificial Neural Networks

Pose
mid-1900s
Multimodal Neurons in Artificial Neural Networks

Any
Multimodal Neurons in Artificial Neural Networks

Text
Multimodal Neurons in Artificial Neural Networks

Face
Multimodal Neurons in Artificial Neural Networks

Logo
Multimodal Neurons in Artificial Neural Networks

Architecture
Multimodal Neurons in Artificial Neural Networks

Indoor
Multimodal Neurons in Artificial Neural Networks

Nature
Multimodal Neurons in Artificial Neural Networks

Pose
self + relief
Multimodal Neurons in Artificial Neural Networks

Any
Multimodal Neurons in Artificial Neural Networks

Text
Multimodal Neurons in Artificial Neural Networks

Face
Multimodal Neurons in Artificial Neural Networks

Logo
Multimodal Neurons in Artificial Neural Networks

Architecture
Multimodal Neurons in Artificial Neural Networks

Indoor
Multimodal Neurons in Artificial Neural Networks

Nature
Multimodal Neurons in Artificial Neural Networks

Pose
Christmas
Multimodal Neurons in Artificial Neural Networks

Any
Multimodal Neurons in Artificial Neural Networks

Text
Multimodal Neurons in Artificial Neural Networks

Face
Multimodal Neurons in Artificial Neural Networks

Logo
Multimodal Neurons in Artificial Neural Networks

Architecture
Multimodal Neurons in Artificial Neural Networks

Indoor
Multimodal Neurons in Artificial Neural Networks

Nature
Multimodal Neurons in Artificial Neural Networks

Pose
Roman art
Multimodal Neurons in Artificial Neural Networks

Any
Multimodal Neurons in Artificial Neural Networks

Text
Multimodal Neurons in Artificial Neural Networks

Face
Multimodal Neurons in Artificial Neural Networks

Logo
Multimodal Neurons in Artificial Neural Networks

Architecture
Multimodal Neurons in Artificial Neural Networks

Indoor
Multimodal Neurons in Artificial Neural Networks

Nature
Multimodal Neurons in Artificial Neural Networks

Pose
child’s drawing
Multimodal Neurons in Artificial Neural Networks

Any
Multimodal Neurons in Artificial Neural Networks

Text
Multimodal Neurons in Artificial Neural Networks

Face
Multimodal Neurons in Artificial Neural Networks

Logo
Multimodal Neurons in Artificial Neural Networks

Architecture
Multimodal Neurons in Artificial Neural Networks

Indoor
Multimodal Neurons in Artificial Neural Networks

Nature
Multimodal Neurons in Artificial Neural Networks

Pose
USA
Multimodal Neurons in Artificial Neural Networks

Any
Multimodal Neurons in Artificial Neural Networks

Text
Multimodal Neurons in Artificial Neural Networks

Face
Multimodal Neurons in Artificial Neural Networks

Logo
Multimodal Neurons in Artificial Neural Networks

Architecture
Multimodal Neurons in Artificial Neural Networks

Indoor
Multimodal Neurons in Artificial Neural Networks

Nature
Multimodal Neurons in Artificial Neural Networks

Pose
India
Multimodal Neurons in Artificial Neural Networks

Any
Multimodal Neurons in Artificial Neural Networks

Text
Multimodal Neurons in Artificial Neural Networks

Face
Multimodal Neurons in Artificial Neural Networks

Logo
Multimodal Neurons in Artificial Neural Networks

Architecture
Multimodal Neurons in Artificial Neural Networks

Indoor
Multimodal Neurons in Artificial Neural Networks

Nature
Multimodal Neurons in Artificial Neural Networks

Pose
heart
Multimodal Neurons in Artificial Neural Networks

Any
Multimodal Neurons in Artificial Neural Networks

Text
Multimodal Neurons in Artificial Neural Networks

Face
Multimodal Neurons in Artificial Neural Networks

Logo
Multimodal Neurons in Artificial Neural Networks

Architecture
Multimodal Neurons in Artificial Neural Networks

Indoor
Multimodal Neurons in Artificial Neural Networks

Nature
Multimodal Neurons in Artificial Neural Networks

Pose
West Africa
Multimodal Neurons in Artificial Neural Networks

Any
Multimodal Neurons in Artificial Neural Networks

Text
Multimodal Neurons in Artificial Neural Networks

Face
Multimodal Neurons in Artificial Neural Networks

Logo
Multimodal Neurons in Artificial Neural Networks

Architecture
Multimodal Neurons in Artificial Neural Networks

Indoor
Multimodal Neurons in Artificial Neural Networks

Nature
Multimodal Neurons in Artificial Neural Networks

Pose

Any
Multimodal Neurons in Artificial Neural Networks

summer
Multimodal Neurons in Artificial Neural Networks

winter
Multimodal Neurons in Artificial Neural Networks

shocked
Multimodal Neurons in Artificial Neural Networks

mid-1900s
Multimodal Neurons in Artificial Neural Networks

self + relief
Multimodal Neurons in Artificial Neural Networks

Christmas
Multimodal Neurons in Artificial Neural Networks

Roman art
Multimodal Neurons in Artificial Neural Networks

child’s drawing
Multimodal Neurons in Artificial Neural Networks

USA
Multimodal Neurons in Artificial Neural Networks

India
Multimodal Neurons in Artificial Neural Networks

heart
Multimodal Neurons in Artificial Neural Networks

West Africa
Text
Multimodal Neurons in Artificial Neural Networks

summer
Multimodal Neurons in Artificial Neural Networks

winter
Multimodal Neurons in Artificial Neural Networks

shocked
Multimodal Neurons in Artificial Neural Networks

mid-1900s
Multimodal Neurons in Artificial Neural Networks

self + relief
Multimodal Neurons in Artificial Neural Networks

Christmas
Multimodal Neurons in Artificial Neural Networks

Roman art
Multimodal Neurons in Artificial Neural Networks

child’s drawing
Multimodal Neurons in Artificial Neural Networks

USA
Multimodal Neurons in Artificial Neural Networks

India
Multimodal Neurons in Artificial Neural Networks

heart
Multimodal Neurons in Artificial Neural Networks

West Africa
Face
Multimodal Neurons in Artificial Neural Networks

summer
Multimodal Neurons in Artificial Neural Networks

winter
Multimodal Neurons in Artificial Neural Networks

shocked
Multimodal Neurons in Artificial Neural Networks

mid-1900s
Multimodal Neurons in Artificial Neural Networks

self + relief
Multimodal Neurons in Artificial Neural Networks

Christmas
Multimodal Neurons in Artificial Neural Networks

Roman art
Multimodal Neurons in Artificial Neural Networks

child’s drawing
Multimodal Neurons in Artificial Neural Networks

USA
Multimodal Neurons in Artificial Neural Networks

India
Multimodal Neurons in Artificial Neural Networks

heart
Multimodal Neurons in Artificial Neural Networks

West Africa
Logo
Multimodal Neurons in Artificial Neural Networks

summer
Multimodal Neurons in Artificial Neural Networks

winter
Multimodal Neurons in Artificial Neural Networks

shocked
Multimodal Neurons in Artificial Neural Networks

mid-1900s
Multimodal Neurons in Artificial Neural Networks

self + relief
Multimodal Neurons in Artificial Neural Networks

Christmas
Multimodal Neurons in Artificial Neural Networks

Roman art
Multimodal Neurons in Artificial Neural Networks

child’s drawing
Multimodal Neurons in Artificial Neural Networks

USA
Multimodal Neurons in Artificial Neural Networks

India
Multimodal Neurons in Artificial Neural Networks

heart
Multimodal Neurons in Artificial Neural Networks

West Africa
Architecture
Multimodal Neurons in Artificial Neural Networks

summer
Multimodal Neurons in Artificial Neural Networks

winter
Multimodal Neurons in Artificial Neural Networks

shocked
Multimodal Neurons in Artificial Neural Networks

mid-1900s
Multimodal Neurons in Artificial Neural Networks

self + relief
Multimodal Neurons in Artificial Neural Networks

Christmas
Multimodal Neurons in Artificial Neural Networks

Roman art
Multimodal Neurons in Artificial Neural Networks

child’s drawing
Multimodal Neurons in Artificial Neural Networks

USA
Multimodal Neurons in Artificial Neural Networks

India
Multimodal Neurons in Artificial Neural Networks

heart
Multimodal Neurons in Artificial Neural Networks

West Africa
Indoor
Multimodal Neurons in Artificial Neural Networks

summer
Multimodal Neurons in Artificial Neural Networks

winter
Multimodal Neurons in Artificial Neural Networks

shocked
Multimodal Neurons in Artificial Neural Networks

mid-1900s
Multimodal Neurons in Artificial Neural Networks

self + relief
Multimodal Neurons in Artificial Neural Networks

Christmas
Multimodal Neurons in Artificial Neural Networks

Roman art
Multimodal Neurons in Artificial Neural Networks

child’s drawing
Multimodal Neurons in Artificial Neural Networks

USA
Multimodal Neurons in Artificial Neural Networks

India
Multimodal Neurons in Artificial Neural Networks

heart
Multimodal Neurons in Artificial Neural Networks

West Africa
Nature
Multimodal Neurons in Artificial Neural Networks

summer
Multimodal Neurons in Artificial Neural Networks

winter
Multimodal Neurons in Artificial Neural Networks

shocked
Multimodal Neurons in Artificial Neural Networks

mid-1900s
Multimodal Neurons in Artificial Neural Networks

self + relief
Multimodal Neurons in Artificial Neural Networks

Christmas
Multimodal Neurons in Artificial Neural Networks

Roman art
Multimodal Neurons in Artificial Neural Networks

child’s drawing
Multimodal Neurons in Artificial Neural Networks

USA
Multimodal Neurons in Artificial Neural Networks

India
Multimodal Neurons in Artificial Neural Networks

heart
Multimodal Neurons in Artificial Neural Networks

West Africa
Pose
Multimodal Neurons in Artificial Neural Networks

summer
Multimodal Neurons in Artificial Neural Networks

winter
Multimodal Neurons in Artificial Neural Networks

shocked
Multimodal Neurons in Artificial Neural Networks

mid-1900s
Multimodal Neurons in Artificial Neural Networks

self + relief
Multimodal Neurons in Artificial Neural Networks

Christmas
Multimodal Neurons in Artificial Neural Networks

Roman art
Multimodal Neurons in Artificial Neural Networks

child’s drawing
Multimodal Neurons in Artificial Neural Networks

USA
Multimodal Neurons in Artificial Neural Networks

India
Multimodal Neurons in Artificial Neural Networks

heart
Multimodal Neurons in Artificial Neural Networks

West Africa

Selected neurons from the final layer of four CLIP models. Each neuron is represented by a feature visualization with a human-chosen concept labels to help quickly provide a sense of each neuron. Labels were picked after looking at hundreds of stimuli that activate the neuron, in addition to feature visualizations. We chose to include some of the examples here to demonstrate the model’s proclivity towards stereotypical depictions of regions, emotions, and other concepts. We also see discrepancies in the level of neuronal resolution: while certain countries like the US and India were associated with well-defined neurons, the same was not true of countries in Africa, where neurons tended to fire for entire regions. We discuss some of these biases and their implications in later sections.

Indeed, we were surprised to find many of these categories appear to mirror neurons in the medial temporal lobe documented in epilepsy patients with intracranial depth electrodes. These include neurons that respond to emotions, animals, and famous people.

But our investigation into CLIP reveals many more such strange and wonderful abstractions, including neurons that appear to count [17, 202, 310], neurons responding to art styles [75, 587, 122], even images with evidence of digital alteration [1640].

Absent concepts

While this analysis shows a great breadth of concepts, we note that a simple analysis on a neuron level cannot represent a complete documentation of the model’s behavior. The authors of CLIP have demonstrated, for example, that the model is capable of very precise geolocation, (Appendix E.4, Figure 20) with a granularity that extends down to the level of a city and even a neighborhood. In fact, we offer an anecdote: we have noticed, by running our own personal photos through CLIP, that CLIP can often recognize if a photo was taken in San Francisco, and sometimes even the neighbourhood (e.g., “Twin Peaks”).

Despite our best efforts, however, we have not found a “San Francisco” neuron, nor did it seem from attribution that San Francisco decomposes nicely into meaningful unit concepts like “California” and “city.” We believe this information to be encoded within the activations of the model somewhere, but in a more exotic way, either as a direction or as some other more complex manifold. We believe this to be a fruitful direction for further research.

How multimodal neurons compose

These multimodal neurons can give us insight into understanding how CLIP performs classification. With a sparse linear probe, we can easily inspect CLIP’s weights to see which concepts combine to achieve a final classification for ImageNet classification:

Multimodal Neurons in Artificial Neural Networks
piggy bank

=
2.5
Multimodal Neurons in Artificial Neural Networks
finance

+
1.1
Multimodal Neurons in Artificial Neural Networks
dolls, toys

+
···

Multimodal Neurons in Artificial Neural Networks
barn spider

=
2.9
Multimodal Neurons in Artificial Neural Networks
Spider-Man

+
1.5
Multimodal Neurons in Artificial Neural Networks
animal

+
···

The piggy bank class appears to be a composition of a “finance” neuron along with a porcelain neuron. The Spider-Man neuron referenced in the first section of the paper is also a spider detector, and plays an important role in the classification of the class “barn spider.”

For text classification, a key observation is that these concepts are contained within neurons in a way that, similar to the word2vec objective, is almost linear. The concepts, therefore, form a simple algebra that behaves similarly to a linear probe. By linearizing the attention, we too can inspect any sentence, much like a linear probe, as shown below:

Multimodal Neurons in Artificial Neural Networks
surprised

=
1.0
Multimodal Neurons in Artificial Neural Networks
celebration, hug

+
1.0
Multimodal Neurons in Artificial Neural Networks
shock

+
0.17
Multimodal Neurons in Artificial Neural Networks
smile, grin

Multimodal Neurons in Artificial Neural Networks
intimate

=
1.0
Multimodal Neurons in Artificial Neural Networks
soft smile

+
0.92
Multimodal Neurons in Artificial Neural Networks
heart

0.8
Multimodal Neurons in Artificial Neural Networks
illness

Probing how CLIP understands words, it appears to the model that the word “surprised” implies some not just some measure of shock, but a shock of a very specific kind, one combined perhaps with delight or wonder. “Intimate” consists of a soft smile and hearts, but not sickness. We note that this reveals a reductive understanding of the the full human experience of intimacy-the subtraction of illness precludes, for example, intimate moments with loved ones who are sick. We find many such omissions when probing CLIP’s understanding of language.

Fallacies of abstraction

The degree of abstraction in CLIP surfaces a new vector of attack that we believe has not manifested in previous systems. Like many deep networks, the representations at the highest layers of the model are completely dominated by such high-level abstractions. What distinguishes CLIP, however, is a matter of degree—CLIP’s multimodal neurons generalize across the literal and the iconic, which may be a double-edged sword.

Through a series of carefully-constructed experiments, we demonstrate that we can exploit this reductive behavior to fool the model into making absurd classifications. We have observed that the excitations of the neurons in CLIP are often controllable by its response to images of text, providing a simple vector of attacking the model.

The finance neuron [1330], for example, responds to images of piggy banks, but also responds to the string “$$$”. By forcing the finance neuron to fire, we can fool our model into classifying a dog as a piggy bank.

Attacks in the wild

We refer to these attacks as typographic attacks. We believe attacks such as those described above are far from simply an academic concern. By exploiting the model’s ability to read text robustly, we find that even photographs of hand-written text can often fool the model. Like the Adversarial Patch, this attack works in the wild; but unlike such attacks, it requires no more technology than pen and paper.

We also believe that these attacks may also take a more subtle, less conspicuous form. An image, given to CLIP, is abstracted in many subtle and sophisticated ways, and these abstractions may over-abstract common patterns—oversimplifying and, by virtue of that, overgeneralizing.

Bias and overgeneralization

Our model, despite being trained on a curated subset of the internet, still inherits its many unchecked biases and associations. Many associations we have discovered appear to be benign, but yet we have discovered several cases where CLIP holds associations that could result in representational harm, such as denigration of certain individuals or groups.

We have observed, for example, a “Middle East” neuron [1895] with an association with terrorism; and an “immigration” neuron [395] that responds to Latin America. We have even found a neuron that fires for both dark-skinned people and gorillas [1257], mirroring earlier photo tagging incidents in other models we consider unacceptable.

These associations present obvious challenges to applications of such powerful visual systems.[1] Whether fine-tuned or used zero-shot, it is likely that these biases and associations will remain in the system, with their effects manifesting in both visible and nearly invisible ways during deployment. Many biased behaviors may be difficult anticipate a priori, making their measurement and correction difficult. We believe that these tools of interpretability may aid practitioners the ability to preempt potential problems, by discovering some of these associations and ambigiuities ahead of time.

Our own understanding of CLIP is still evolving, and we are still determining if and how we would release large versions of CLIP. We hope that further community exploration of the released versions as well as the tools we are announcing today will help advance general understanding of multimodal systems, as well as inform our own decision-making.

Conclusion

Alongside the publication of “Multimodal Neurons in Artificial Neural Networks,” we are also releasing some of the tools we have ourselves used to understand CLIP—the OpenAI Microscope catalog has been updated with feature visualizations, dataset examples, and text feature visualizations for every neuron in CLIP RN50x4. We are also releasing the weights of CLIP RN50x4 and RN101 to further accommodate such research. We believe these investigations of CLIP only scratch the surface in understanding CLIP’s behavior, and we invite the research community to in improving our understanding of CLIP and models like it.

OpenAI

Scaling Kubernetes to 7,500 Nodes

Scaling Kubernetes to 7,500 Nodes

Scaling Kubernetes to 7,500 Nodes

We’ve scaled Kubernetes clusters to 7,500 nodes, producing a scalable infrastructure for large models like GPT-3, CLIP, and DALL·E, but also for rapid small-scale iterative research such as Scaling Laws for Neural Language Models. Scaling a single Kubernetes cluster to this size is rarely done and requires some special care, but the upside is a simple infrastructure that allows our machine learning research teams to move faster and scale up without changing their code.

Since our last post on Scaling to 2,500 Nodes we’ve continued to grow our infrastructure to meet researcher needs, in the process learning many additional lessons. This post summarizes those lessons so that others in the Kubernetes community can benefit from them, and ends with problems we still face that we’ll be tackling next.

Our workload

Before we get too far, it’s important to describe our workload. The applications and hardware we run with Kubernetes are pretty different from what you may encounter at a typical company. Our problems and corresponding solutions may, or may not, be a good match to your own setup!

A large machine learning job spans many nodes and runs most efficiently when it has access to all of the hardware resources on each node. This allows GPUs to cross-communicate directly using NVLink, or GPUs to directly communicate with the NIC using GPUDirect. So for many of our workloads, a single pod occupies the entire node. Any NUMA, CPU, or PCIE resource contention aren’t factors for scheduling. Bin-packing or fragmentation is not a common problem. Our current clusters have full bisection bandwidth, so we also don’t make any rack or network topology considerations. All of this means that, while we have many nodes, there’s relatively low strain on the scheduler.

That said, strain on the kube-scheduler is spiky. A new job may consist of many hundreds of pods all being created at once, then return to a relatively low rate of churn.

Our biggest jobs run MPI, and all pods within the job are participating in a single MPI communicator. If any of the participating pods die, the entire job halts and needs to be restarted. The job checkpoints regularly, and when restarted it resumes from the last checkpoint. Thus we consider the pods to be semi-stateful—killed pods can be replaced and work can continue, but doing so is disruptive and should be kept to a minimum.

We don’t rely on Kubernetes load balancing all that much. We have very little HTTPS traffic, with no need for A/B testing, blue/green, or canaries. Pods communicate directly with one another on their pod IP addresses with MPI via SSH, not service endpoints. Service “discovery” is limited; we just do a one-time lookup for which pods are participating in MPI at job startup time.

Most jobs interact with some form of blob storage. They usually either stream some shards of a dataset or checkpoint directly from blob storage, or cache it to a fast local ephemeral disk. We have a few PersistentVolumes for cases where POSIX semantics are useful, but blob storage is far more scalable and doesn’t require slow detach/attach operations.

Lastly, the nature of our work is fundamentally research, which means the workloads themselves are ever-changing. While the Supercomputing team strives to provide what we’d consider a “production” quality level of compute infrastructure, the applications that run on that cluster are short-lived and their developers iterate quickly. New usage patterns may emerge at any time that challenge our assumptions about trends and appropriate tradeoffs. We need a sustainable system that also allows us to respond quickly when things change.

Networking

As the number of nodes and pods within our clusters increased, we found that Flannel had difficulties scaling up the throughput required. We switched to using the native pod networking technologies for our respective cloud providers (Alias IPs for Managed Instance Groups in GCP, and IP Configurations for VMSSes in Azure) and the relevant CNI plugins. This allowed us to get host level network throughput on our pods.

Another reason we’ve switched to using alias-based IP addressing is that on our largest clusters, we could possibly have approximately 200,000 IP addresses in use at any one time. When we tested route-based pod networking, we found there were significant limitations in the number of routes we could effectively use.

Avoiding encapsulation increases the demands on the underlying SDN or routing engine, but it keeps our networking setup simple. Adding VPN or tunneling can be done without any additional adapters. We don’t need to worry about packet fragmentation due to some portion of the network having a lower MTU. Network policies and traffic monitoring is straightforward; there’s no ambiguity about the source and destination of packets.

We use iptables tagging on the host to track network resource usage per Namespace and pod. This lets researchers visualize their network usage patterns. In particular, since a lot of our experiments have distinct Internet and intra-pod communication patterns, it’s often useful to be able to investigate where any bottlenecks might be occurring.

iptables mangle rules can be used to arbitrarily mark packets that match particular criteria. Here are our rules to detect whether traffic is internal or internet-bound. The FORWARD rules cover traffic from pods, vs INPUT and OUTPUT traffic from the host:

iptables -t mangle -A INPUT ! -s 10.0.0.0/8 -m comment --comment "iptables-exporter openai traffic=internet-in"
iptables -t mangle -A FORWARD ! -s 10.0.0.0/8 -m comment --comment "iptables-exporter openai traffic=internet-in"
iptables -t mangle -A OUTPUT ! -d 10.0.0.0/8 -m comment --comment "iptables-exporter openai traffic=internet-out"
iptables -t mangle -A FORWARD ! -d 10.0.0.0/8 -m comment --comment "iptables-exporter openai traffic=internet-out"

Once marked, iptables will start counters to track the number of bytes and packets that match this rule. You can eyeball these counters by using iptables itself:

% iptables -t mangle -L -v
Chain FORWARD (policy ACCEPT 50M packets, 334G bytes)
 pkts bytes target     prot opt in     out     source               destination
....
1253K  555M            all  --  any    any     anywhere            !10.0.0.0/8           /* iptables-exporter openai traffic=internet-out */
1161K 7937M            all  --  any    any    !10.0.0.0/8           anywhere             /* iptables-exporter openai traffic=internet-in */

We use an open-source Prometheus exporter called iptables-exporter to then get these tracked into our monitoring system. This a simple way to track packets matching a variety of different types of conditions.

One somewhat unique aspect of our network model is that we fully expose the node, pod, and service network CIDR ranges to our researchers. We have a hub and spoke network model, and use the native node and pod CIDR ranges to route that traffic. Researchers connect to the hub, and from there have access to any of the individual clusters (the spokes). But the clusters themselves cannot talk to one another. This ensures that clusters remain isolated with no cross-cluster dependencies that can break failure isolation.

We use a “NAT” host to translate the service network CIDR range for traffic coming from outside of the cluster. This setup allows our researchers significant flexibility in choosing how and what kinds of network configurations they are able to choose from for their experiments.

API Servers

Kubernetes API Servers and etcd are critical components to a healthy working cluster, so we pay special attention to the stress on these systems. We use the Grafana dashboards provided by kube-prometheus, as well as additional in-house dashboards. We’ve found it useful to alert on the rate of HTTP status 429 (Too Many Requests) and 5xx (Server Error) on the API Servers as a high-level signal of problems.

While some folks run API Servers within kube, we’ve always run them outside the cluster itself. Both etcd and API servers run on their own dedicated nodes. Our largest clusters run 5 API servers and 5 etcd nodes to spread the load and minimize impact if one were to ever go down. We’ve had no notable trouble with etcd since splitting out Kubernetes Events into their own etcd cluster back in our last blog post. API Servers are stateless and generally easy to run in a self-healing instance group or scaleset. We haven’t yet tried to build any self-healing automation of etcd clusters because incidents have been extremely rare.

API Servers can take up a fair bit of memory, and that tends to scale linearly with the number of nodes in the cluster. For our cluster with 7,500 nodes we observe up to 70GB of heap being used per API Server, so fortunately this should continue to be well-within hardware capabilities into the future.

One big strain on API Servers was WATCHes on Endpoints. There are a few services, such as ‘kubelet’ and ‘node-exporter’ of which every node in the cluster is a member. When a node would be added or removed from the cluster, this WATCH would fire. And because typically each node itself was watching the kubelet service via kube-proxy, the # and bandwidth required in these responses would be $N^2$ and enormous, occasionally 1GB/s or more. EndpointSlices, launched in Kubernetes 1.17, were a huge benefit that brought this load down 1000x.

In general we are very mindful of any API Server requests that scale with the size of the cluster. We try to avoid having any DaemonSets interact with the API Server. In cases where you do need each node to watch for changes, introducing an intermediary caching service, such as the Datadog Cluster Agent, seems to be a good pattern to avoid cluster-wide bottlenecks.

As our clusters have grown, we do less actual autoscaling of our clusters. But we have run into trouble occasionally when autoscaling too much at once. There are many requests generated when a new node joins a cluster, and adding hundreds of nodes at once can overload API server capacity. Smoothing this out, even just by a few seconds, has helped avoid outages.

Time-series metrics with Prometheus and Grafana

We use Prometheus to collect time-series metrics and Grafana for graphs, dashboards, and alerts. We started with a deployment of kube-prometheus that collects a wide variety of metrics and good dashboards for visualization. Over time we’ve added many of our own dashboards, metrics, and alerts.

As we added more and more nodes, we struggled with the sheer amount of metrics being collected by Prometheus. While kube-prometheus exposes a lot of useful data, some of it we weren’t actually ever looking at, and some was just too granular to collect, store, and query effectively. We use Prometheus rules to “drop” some of these metrics from being ingested.

For a while we struggled with a problem where Prometheus would consume more and more memory until eventually crashing the container in an Out-Of-Memory error (OOM). This seemed to occur even after throwing enormous amounts of memory capacity at the application. What’s worse was, when it did crash, it would take many hours on startup replaying write-ahead-log files before it was usable again.

Eventually we tracked down the source of these OOMs to be an interaction between Grafana and Prometheus, where Grafana would use the /api/v1/series API on Prometheus with a query of {le!=""} (Basically, “give me all the histogram metrics”). The implementation of /api/v1/series was unbounded in both time and space—for a query with a lot of results, this would continue to consume ever-more memory and time. It also continues to grow even after the requester has given up and closed the connection. For us, there was never enough memory, and Prometheus would eventually crash. We patched Prometheus to contain this API within a Context to enforce a timeout, which fixed it entirely.

While Prometheus crashed far less often, in times when we did need to restart it, WAL replay remained an issue. It would often take many hours to replay through all WAL logs before Prometheus was up collecting new metrics and servicing queries. With help from Robust Perception, we found that applying a GOMAXPROCS=24 had a big improvement. Prometheus tries to use all cores when during WAL replay, and for servers with a large number of cores, the contention kills all performance.

We’re exploring new options to increase our monitoring capacity, described in the “Unsolved problems” section below.

Healthchecks

With a cluster this large, we of course rely on automation to detect and remove misbehaving nodes from the cluster. Over time we have built up a number of healthcheck systems.

Passive healthchecks

Some healthchecks are passive, always running on all nodes. These monitor basic system resources such as network reachability, bad or full disks, or GPU errors. GPUs exhibit problems a number of different ways, but an easy common one is an “Uncorrectable ECC error.” Nvidia’s Data Center GPU Manager (DCGM) tools make it easy to query for this and a number of other “Xid” errors. One way we track these errors is via dcgm-exporter to ingest the metrics into Prometheus, our monitoring system. This will appear as the DCGM_FI_DEV_XID_ERRORS metric and be set to the error code that has most recently occurred. Additionally, the NVML Device Query API exposes more detailed information about the health and operation of a GPU.

Once we detect an error, they can often be fixed by resetting the GPU or system, though in some cases it does lead to the underlying GPU needing to be physically replaced.

Another form of healthcheck tracks Maintenance events from the upstream cloud provider. Each of the major cloud providers expose a way to know if the current VM is due for an upcoming maintenance event that will eventually cause a disruption. The VM may need to be rebooted so an underlying hypervisor patch can be applied or the physical node swapped out for other hardware.

These passive healthchecks run constantly in the background on all nodes. If a healthcheck starts failing, the node is automatically cordoned so no new pods are to be scheduled on the node. For more serious healthcheck failures, we will also attempt a pod eviction to request all currently-running pods to exit immediately. It’s still up to the pod itself, configurable via a Pod Disruption Budget, to decide if it wants to allow this eviction to occur. Eventually, either after all pods have terminated, or 7 days has elapsed (part of our SLA), we will forcibly terminate the VM.

Active GPU tests

Unfortunately not all GPU problems manifest as error codes visible through DCGM. We’ve built up our own library of tests that exercise GPUs to catch additional problems and ensure that the hardware and driver is behaving as expected. These tests can’t be run in the background—they require exclusive use of a GPU for several seconds or minutes to run.

We first run these tests on nodes upon boot, in a system we call “preflight.” All nodes join the cluster with a “preflight” taint and label applied. This taint prevents normal pods from being scheduled on the node. A DaemonSet is configured to run preflight test pods on all nodes with this label. Upon successful completion of the test, the test itself removes the taint and label and the node is then available for general use.

We also then run these tests periodically during the lifetime of a node. We run this as a CronJob, allowing it to land on any available node in the cluster. This is admittedly a bit random and uncontrolled about which nodes get tested, but we’ve found that over time it provides sufficient coverage with minimal coordination or disruption.

Quotas & resource usage

As we scaled up our clusters, researchers started to find themselves having difficulty getting all of the capacity that they were allocated. Traditional job scheduling systems have a lot of different features available to fairly run work between competing teams, which Kubernetes does not have. Over time, we took inspiration from those job scheduling systems and build several capabilities in a Kubernetes-native way.

Team taints

We have a service in each cluster, “team-resource-manager” that has multiple functions. It’s data source is a ConfigMap that specifies tuples of (node selector, team label to apply, allocation amount) for all of the research teams that have capacity in a given cluster. It reconciles this with the current nodes in the cluster, tainting the appropriate number of nodes with openai.com/team=teamname:NoSchedule.

“team-resource-manager” also has an admission webhook service, such that as each job is submitted, a corresponding toleration is applied based on the submitter’s team membership. Using taints allows us to constrain the Kubernetes pod scheduler flexibly, such as allowing a “any” toleration for lower priority pods, which allows teams to borrow each other’s capacity without requiring heavyweight coordination.

CPU & GPU balloons

In addition to using cluster-autoscaler to dynamically scale our VM-backed clusters, we use it to remediate (remove & re-add) unhealthy members within the cluster. We do this by setting the “min size” of the cluster to zero, and the “max size” of the cluster to the capacity available. However, cluster-autoscaler, if it sees idle nodes, will attempt to scale down to only needed capacity. For multiple reasons (VM spin up latency, pre-allocated costs, the API server impacts mentioned above) this idle-scaling isn’t ideal.

So, we introduced a balloon Deployment for both our CPU-only and GPU hosts. This Deployment contains a ReplicaSet with “max size” number of low-priority pods. These pods occupy resources within a node, so the autoscaler doesn’t consider them as idle. However since they’re low priority, the scheduler can evict them immediately to make room for actual work. (We chose to use a Deployment instead of a DaemonSet, to avoid the DaemonSet being considered idle workload on a node.)

One thing of note, we use pod anti-affinity to ensure the pods would evenly distribute across the nodes. Earlier versions of the Kubernetes scheduler had an $O(N^2)$ performance issue with pod anti-affinity. This has been corrected since Kubernetes 1.18.

Gang scheduling

Our experiments often involve one or more StatefulSets, each operating a different portion of the training effort. For Optimizers, researchers need all members of the StatefulSet to be scheduled, before any training can be done (as we often use MPI to coordinate between optimizer members, and MPI is sensitive to group membership changes).

However, Kubernetes by default won’t necessarily prioritize fulfilling all requests from one StatefulSet over another. For example if two experiments each requested 100% of the cluster’s capacity, instead of scheduling all of one experiment or the other, Kubernetes might schedule only half of each experiment’s pods, leading to a deadlock where neither experiment can make progress.

We tried a few things needing a custom scheduler, but ran into edge cases that caused conflicts with how normal pods were scheduled. Kubernetes 1.18 introduced a plugin architecture for the core Kubernetes scheduler, making it much easier to add features like this natively. We recently landed on the Coscheduling plugin as a good way to solve this problem.

Unsolved problems

There are many problems still to address as we scale up our Kubernetes clusters. A few of them include:

Metrics

At our scale we’ve had many difficulties with Prometheus’s built-in TSDB storage engine being slow to compact, and needing long times needed to replay the WAL (Write-Ahead-Log) any time it restarts. Queries also tend to result in “query processing would load too many samples” errors. We’re in the process of migrating to a different Prometheus-compatible storage and query engine. Look forward to a future blog post about how it goes!

Pod network traffic shaping

As we scale up our clusters, each pod is calculated to have a certain amount of Internet bandwidth available. The aggregate Internet bandwidth requirements per person have become substantial, and our researchers now have the ability to unintentionally put a significant resource strain on other locations on the Internet, such as datasets for download and software packages to install.

Conclusions

We’ve found Kubernetes to be an exceptionally flexible platform for our research needs. It has the ability to scale up to meet the most demanding workloads we’ve put on it. There are many areas yet though where it needs improvement, and the Supercomputing team at OpenAI will continue to explore how Kubernetes can scale. If this kind of work seems interesting, you should consider applying at OpenAI!

OpenAI

CLIP: Connecting Text and Images

CLIP: Connecting Text and Images

CLIP: Connecting Text and Images

We’re introducing a neural network called CLIP which efficiently learns visual concepts from natural language supervision. CLIP (Contrastive Language–Image Pre-training) can be applied to any visual classification benchmark by simply providing the names of the visual categories to be recognized, similar to the ”zero-shot” capabilities of GPT-2 and 3.

Read paperView code

Although deep learning has revolutionized computer vision, current approaches have several major problems: typical vision datasets are labor intensive and costly to create while teaching only a narrow set of visual concepts; standard vision models are good at one task and one task only, and require significant effort to adapt to a new task; and models that perform well on benchmarks have disappointingly poor performance on stress tests, casting doubt on the entire deep learning approach to computer vision.

We present a neural network that aims to address these problems: it is trained on a wide variety of images with a wide variety of natural language supervision that’s abundantly available on the internet. By design, the network can be instructed in natural language to perform a great variety of classification benchmarks, without directly optimizing for the benchmark’s performance, similar to the “zero-shot” capabilities of GPT-2 and GPT-3. This is a key change: by not directly optimizing for the benchmark, we show that it becomes much more representative: our system closes this “robustness gap” by up to 75% while matching the performance of the original ResNet50 on ImageNet zero-shot without using any of the original 1.28M labeled examples.

Although both models have the same accuracy on the ImageNet test set, CLIP’s performance is much more representative of how it will fare on datasets that measure accuracy in different, non-ImageNet settings. For instance, ObjectNet checks a model’s ability to recognize objects in many different poses and with many different backgrounds inside homes while ImageNet Rendition and ImageNet Sketch check a model’s ability to recognize more abstract depictions of objects.

Background and related work

CLIP builds on a large body of work on zero-shot transfer, natural language supervision, and multimodal learning. The idea of zero-data learning dates back over a decade but until recently was mostly studied in computer vision as a way of generalizing to unseen object categories. A critical insight was to leverage natural language as a flexible prediction space to enable generalization and transfer. In 2013, Richer Socher and co-authors at Stanford developed a proof of concept by training a model on CIFAR-10 to make predictions in a word vector embedding space and showed this model could predict two unseen classes. The same year DeVISE scaled this approach and demonstrated that it was possible to fine-tune an ImageNet model so that it could generalize to correctly predicting objects outside the original 1000 training set.

Most inspirational for CLIP is the work of Ang Li and his co-authors at FAIR who in 2016 demonstrated using natural language supervision to enable zero-shot transfer to several existing computer vision classification datasets, such as the canonical ImageNet dataset. They achieved this by fine-tuning an ImageNet CNN to predict a much wider set of visual concepts (visual n-grams) from the text of titles, descriptions, and tags of 30 million Flickr photos and were able to reach 11.5% accuracy on ImageNet zero-shot.

Finally, CLIP is part of a group of papers revisiting learning visual representations from natural language supervision in the past year. This line of work uses more modern architectures like the Transformer and includes VirTex, which explored autoregressive language modeling, ICMLM, which investigated masked language modeling, and ConVIRT, which studied the same contrastive objective we use for CLIP but in the field of medical imaging.

Approach

We show that scaling a simple pre-training task is sufficient to achieve competitive zero-shot performance on a great variety of image classification datasets. Our method uses an abundantly available source of supervision: the text paired with images found across the internet. This data is used to create the following proxy training task for CLIP: given an image, predict which out of a set of 32,768 randomly sampled text snippets, was actually paired with it in our dataset.

In order to solve this task, our intuition is that CLIP models will need to learn to recognize a wide variety of visual concepts in images and associate them with their names. As a result, CLIP models can then be applied to nearly arbitrary visual classification tasks. For instance, if the task of a dataset is classifying photos of dogs vs cats we check for each image whether a CLIP model predicts the text description “a photo of a dog” or “a photo of a cat” is more likely to be paired with it.

CLIP: Connecting Text and Images
CLIP: Connecting Text and Images

CLIP pre-trains an image encoder and a text encoder to predict which images were paired with which texts in our dataset. We then use this behavior to turn CLIP into a zero-shot classifier. We convert all of a dataset’s classes into captions such as “a photo of a dog” and predict the class of the caption CLIP estimates best pairs with a given image.

CLIP was designed to mitigate a number of major problems in the standard deep learning approach to computer vision:

Costly datasets: Deep learning needs a lot of data, and vision models have traditionally been trained on manually labeled datasets that are expensive to construct and only provide supervision for a limited number of predetermined visual concepts. The ImageNet dataset, one of the largest efforts in this space, required over 25,000 workers to annotate 14 million images for 22,000 object categories. In contrast, CLIP learns from text–image pairs that are already publicly available on the internet. Reducing the need for expensive large labeled datasets has been extensively studied by prior work, notably self-supervised learning, contrastive methods, self-training approaches, and generative modeling.

Narrow: An ImageNet model is good at predicting the 1000 ImageNet categories, but that’s all it can do “out of the box.” If we wish to perform any other task, an ML practitioner needs to build a new dataset, add an output head, and fine-tune the model. In contrast, CLIP can be adapted to perform a wide variety of visual classification tasks without needing additional training examples. To apply CLIP to a new task, all we need to do is “tell” CLIP’s text-encoder the names of the task’s visual concepts, and it will output a linear classifier of CLIP’s visual representations. The accuracy of this classifier is often competitive with fully supervised models.

We show random, non-cherry picked, predictions of zero-shot CLIP classifiers on examples from various datasets below.

Show more
Show less

Poor real-world performance: Deep learning systems are often reported to achieve human or even superhuman performance[1] on vision benchmarks, yet when deployed in the wild, their performance can be far below the expectation set by the benchmark. In other words, there is a gap between “benchmark performance” and “real performance.” We conjecture that this gap occurs because the models “cheat” by only optimizing for performance on the benchmark, much like a student who passed an exam by studying only the questions on past years’ exams. In contrast, the CLIP model can be evaluated on benchmarks without having to train on their data, so it can’t “cheat” in this manner. This results in its benchmark performance being much more representative of its performance in the wild. To verify the “cheating hypothesis”, we also measure how CLIP’s performance changes when it is able to “study” for ImageNet. When a linear classifier is fitted on top of CLIP’s features, it improves CLIP’s accuracy on the ImageNet test set by almost 10%. However, this classifier does no better on average across an evaluation suite of 7 other datasets measuring “robust” performance.

Key takeaways

1. CLIP is highly efficient

CLIP learns from unfiltered, highly varied, and highly noisy data, and is intended to be used in a zero-shot manner. We know from GPT-2 and 3 that models trained on such data can achieve compelling zero shot performance; however, such models require significant training compute. To reduce the needed compute, we focused on algorithmic ways to improve the training efficiency of our approach.

We report two algorithmic choices that led to significant compute savings. The first choice is the adoption of a contrastive objective for connecting text with images. We originally explored an image-to-text approach, similar to VirTex, but encountered difficulties scaling this to achieve state-of-the-art performance. In small to medium scale experiments, we found that the contrastive objective used by CLIP is 4x to 10x more efficient at zero-shot ImageNet classification. The second choice was the adoption of the Vision Transformer, which gave us a further 3x gain in compute efficiency over a standard ResNet. In the end, our best performing CLIP model trains on 256 GPUs for 2 weeks which is similar to existing large scale image models.


We originally explored training image-to-caption language models but found this approach struggled at zero-shot transfer. In this 16 GPU day experiment, a language model only achieves 16% accuracy on ImageNet after training for 400 million images. CLIP is much more efficient and achieves the same accuracy roughly 10x faster.

2. CLIP is flexible and general

Because they learn a wide range of visual concepts directly from natural language, CLIP models are significantly more flexible and general than existing ImageNet models. We find they are able to zero-shot perform many different tasks. To validate this we have measured CLIP’s zero-shot performance on over 30 different datasets including tasks such as fine-grained object classification, geo-localization, action recognition in videos, and OCR.[2] In particular, learning OCR is an example of an exciting behavior that does not occur in standard ImageNet models. Above, we visualize a random non-cherry picked prediction from each zero-shot classifier.

This finding is also reflected on a standard representation learning evaluation using linear probes. The best CLIP model outperforms the best publicly available ImageNet model, the Noisy Student EfficientNet-L2, on 20 out of 26 different transfer datasets we tested.


Across a suite of 27 datasets measuring tasks such as fine-grained object classification, OCR, activity recognition in videos, and geo-localization, we find that CLIP models learn more widely useful image representations. CLIP models are also more compute efficient than the models from 10 prior approaches that we compare with.

Limitations

While CLIP usually performs well on recognizing common objects, it struggles on more abstract or systematic tasks such as counting the number of objects in an image and on more complex tasks such as predicting how close the nearest car is in a photo. On these two datasets, zero-shot CLIP is only slightly better than random guessing. Zero-shot CLIP also struggles compared to task specific models on very fine-grained classification, such as telling the difference between car models, variants of aircraft, or flower species.

CLIP also still has poor generalization to images not covered in its pre-training dataset. For instance, although CLIP learns a capable OCR system, when evaluated on handwritten digits from the MNIST dataset, zero-shot CLIP only achieves 88% accuracy, well below the 99.75% of humans on the dataset. Finally, we’ve observed that CLIP’s zero-shot classifiers can be sensitive to wording or phrasing and sometimes require trial and error “prompt engineering” to perform well.

Broader impacts

CLIP allows people to design their own classifiers and removes the need for task-specific training data. The manner in which these classes are designed can heavily influence both model performance and model biases. For example, we find that when given a set of labels including Fairface race labels[3] and a handful of egregious terms such as “criminal”, “animal” etc. the model tends to classify images of people aged 0–20 in the egregious category at a rate of ~32.3%. However, when we add the class “child” to the list of possible classes, this behaviour drops to ~8.7%.

Additionally, given that CLIP does not need task-specific training data it can unlock certain niche tasks with greater ease. Some of these tasks may raise privacy or surveillance related risks and we explore this concern by studying the performance of CLIP on celebrity identification. CLIP has a top-1 accuracy of 59.2% for “in the wild” celebrity image classification when choosing from 100 candidates and a top-1 accuracy of 43.3% when choosing from 1000 possible choices. Although it’s noteworthy to achieve these results with task agnostic pre-training, this performance is not competitive when compared to widely available production level models. We further explore challenges that CLIP poses in our paper and we hope that this work motivates future research on the characterization of the capabilities, shortcomings, and biases of such models. We are excited to engage with the research community on such questions.

Conclusion

With CLIP, we’ve tested whether task agnostic pre-training on internet scale natural language, which has powered a recent breakthrough in NLP, can also be leveraged to improve the performance of deep learning for other fields. We are excited by the results we’ve seen so far applying this approach to computer vision. Like the GPT family, CLIP learns a wide variety of tasks during pre-training which we demonstrate via zero-shot transfer. We are also encouraged by our findings on ImageNet that suggest zero-shot evaluation is a more representative measure of a model’s capability.

OpenAI

DALL·E: Creating Images from Text

DALL·E: Creating Images from Text


DALL·E: Creating Images from Text

DALL·E[1] is a 12-billion parameter version of GPT-3 trained to generate images from text descriptions, using a dataset of text–image pairs. We’ve found that it has a diverse set of capabilities, including creating anthropomorphized versions of animals and objects, combining unrelated concepts in plausible ways, rendering text, and applying transformations to existing images.

Text prompt
an illustration of a baby daikon radish in a tutu walking a dog
AI-generated images
DALL·E: Creating Images from Text
DALL·E: Creating Images from Text
DALL·E: Creating Images from Text
DALL·E: Creating Images from Text
View more images or edit prompt
Text prompt
a store front that has the word ‘openai’ written on it […]
AI-generated images
DALL·E: Creating Images from Text
DALL·E: Creating Images from Text
DALL·E: Creating Images from Text
DALL·E: Creating Images from Text
View more images or edit prompt
Text prompt
an armchair in the shape of an avocado […]
AI-generated images
DALL·E: Creating Images from Text
DALL·E: Creating Images from Text
DALL·E: Creating Images from Text
DALL·E: Creating Images from Text
View more images or edit prompt
Text and image prompt
the exact same cat on the top as a sketch on the bottom
AI-generated images
DALL·E: Creating Images from Text
DALL·E: Creating Images from Text
DALL·E: Creating Images from Text
DALL·E: Creating Images from Text
View more images or edit prompt

GPT-3 showed that language can be used to instruct a large neural network to perform a variety of text generation tasks. Image GPT showed that the same type of neural network can also be used to generate images with high fidelity. We extend these findings to show that manipulating visual concepts through language is now within reach.

Overview

Like GPT-3, DALL·E is a transformer language model. It receives both the text and the image as a single stream of data containing up to 1280 tokens, and is trained using maximum likelihood to generate all of the tokens, one after another.[2] This training procedure allows DALL·E to not only generate an image from scratch, but also to regenerate any rectangular region of an existing image that extends to the bottom-right corner, in a way that is consistent with the text prompt.

We recognize that work involving generative models has the potential for significant, broad societal impacts. In the future, we plan to analyze how models like DALL·E relate to societal issues like economic impact on certain work processes and professions, the potential for bias in the model outputs, and the longer term ethical challenges implied by this technology.

Capabilities

We find that DALL·E is able to create plausible images for a great variety of sentences that explore the compositional structure of language. We illustrate this using a series of interactive visuals in the next section. The samples shown for each caption in the visuals are obtained by taking the top 32 of 512 after reranking with CLIP, but we do not use any manual cherry-picking, aside from the thumbnails and standalone images that appear outside.[3]

Controlling attributes

We test DALL·E’s ability to modify several of an object’s attributes, as well as the number of times that it appears.

Click to edit text prompt or view more AI-generated images
a pentagonal green clock. a green clock in the shape of a pentagon.
DALL·E: Creating Images from Text
navigatedownwide

navigateupwide
Text prompt

AI-generated
images
We find that DALL·E can render familiar objects in polygonal shapes that are sometimes unlikely to occur in the real world. For some objects, such as “picture frame” and “plate,” DALL·E can reliably draw the object in any of the polygonal shapes except heptagon. For other objects, such as “manhole cover” and “stop sign,” DALL·E’s success rate for more unusual shapes, such as “pentagon,” is considerably lower.

For several of the visuals in this post, we find that repeating the caption, sometimes with alternative phrasings, improves the consistency of the results.

navigateupwide


a cube made of porcupine. a cube with the texture of a porcupine.
DALL·E: Creating Images from Text
navigatedownwide

navigateupwide
Text prompt

AI-generated
images
We find that DALL·E can map the textures of various plants, animals, and other objects onto three dimensional solids. As in the preceding visual, we find that repeating the caption with alternative phrasing improves the consistency of the results.

navigateupwide


a collection of glasses is sitting on a table
DALL·E: Creating Images from Text
navigatedownwide

navigateupwide
Text prompt

AI-generated
images
We find that DALL·E is able to draw multiple copies of an object when prompted to do so, but is unable to reliably count past three. When prompted to draw nouns for which there are multiple meanings, such as “glasses,” “chips,” and “cups” it sometimes draws both interpretations, depending on the plural form that is used.

navigateupwide

Drawing multiple objects

Simultaneously controlling multiple objects, their attributes, and their spatial relationships presents a new challenge. For example, consider the phrase “a hedgehog wearing a red hat, yellow gloves, blue shirt, and green pants”. To correctly interpret this sentence, DALL·E must not only correctly compose each piece of apparel with the animal, but also form the associations (hat, red), (gloves, yellow), (shirt, blue), and (pants, green) without mixing them up.[4] We test DALL·E’s ability to do this for relative positioning, stacking objects, and controlling multiple attributes.

a small red block sitting on a large green block
DALL·E: Creating Images from Text
navigatedownwide

navigateupwide
Text prompt

AI-generated
images
We find that DALL·E correctly responds to some types of relative positions, but not others. The choices “sitting on” and “standing in front of” sometimes appear to work, “sitting below,” “standing behind,” “standing left of,” and “standing right of” do not. DALL·E also has a lower success rate when asked to draw a large object sitting on top of a smaller one, when compared to the other way around.

navigateupwide


a stack of 3 cubes. a red cube is on the top, sitting on a green cube. the green cube is in the middle, sitting on a blue cube. the blue cube is on the bottom.
DALL·E: Creating Images from Text
navigatedownwide

navigateupwide
Text prompt

AI-generated
images
We find that DALL·E typically generates an image with one or two of the objects having the correct colors. However, only a few samples for each setting tend to have exactly three objects colored precisely as specified.

navigateupwide


an emoji of a baby penguin wearing a blue hat, red gloves, green shirt, and yellow pants
DALL·E: Creating Images from Text
navigatedownwide

navigateupwide
Text prompt

AI-generated
images
We find that DALL·E typically generates an image with two or three articles of clothing having the correct colors. However, only a few of the samples for each setting tend to have all four articles of clothing with the specified colors.

navigateupwide

While DALL·E does offer some level of controllability over the attributes and positions of a small number of objects, the success rate can depend on how the caption is phrased. As more objects are introduced, DALL·E is prone to confusing the associations between the objects and their colors, and the success rate decreases sharply. We also note that DALL·E is brittle with respect to rephrasing of the caption in these scenarios: alternative, semantically equivalent captions often yield no correct interpretations.

Visualizing perspective and three-dimensionality

We find that DALL·E also allows for control over the viewpoint of a scene and the 3D style in which a scene is rendered.

an extreme close-up view of a capybara sitting in a field
DALL·E: Creating Images from Text
navigatedownwide

navigateupwide
Text prompt

AI-generated
images
We find that DALL·E can draw each of the animals in a variety of different views. Some of these views, such as “aerial view” and “rear view,” require knowledge of the animal’s appearance from unusual angles. Others, such as “extreme close-up view,” require knowledge of the fine-grained details of the animal’s skin or fur.

navigateupwide


a capybara made of voxels sitting in a field
DALL·E: Creating Images from Text
navigatedownwide

navigateupwide
Text prompt

AI-generated
images
We find that DALL·E is often able to modify the surface of each of the animals according to the chosen 3D style, such as “claymation” and “made of voxels,” and render the scene with plausible shading depending on the location of the sun. The “x-ray” style does not always work reliably, but it shows that DALL·E can sometimes orient the bones within the animal in plausible (though not anatomically correct) configurations.

navigateupwide

To push this further, we test DALL·E’s ability to repeatedly draw the head of a well-known figure at each angle from a sequence of equally spaced angles, and find that we can recover a smooth animation of the rotating head.

a photograph of a bust of homer
DALL·E: Creating Images from Text
navigatedownwide

navigateupwide
Text prompt

Image prompt
AI-generated
images
We prompt DALL·E with both a caption describing a well-known figure and the top region of an image showing a hat drawn at a particular angle. Then, we ask DALL·E to complete the remaining part of the image given this contextual information. We do this repeatedly, each time rotating the hat a few more degrees, and find that we are able to recover smooth animations of several well-known figures, with each frame respecting the precise specification of angle and ambient lighting.

navigateupwide

DALL·E appears to be able to apply some types of optical distortions to scenes, as we see with the options “fisheye lens view” and “a spherical panorama.” This motivated us to explore its ability to generate reflections.

a plain white cube looking at its own reflection in a mirror. a plain white cube gazing at itself in a mirror.
DALL·E: Creating Images from Text
navigatedownwide

navigateupwide
Text prompt

Image prompt
AI-generated
images
Similar to what was done before, we prompt DALL·E to complete the bottom-right corners of a sequence of frames, each of which contains a mirror and reflective floor. While the reflection in the mirror usually resembles the object outside of it, it often does not render the reflection in a physically correct way. By contrast, the reflection of an object drawn on a reflective floor is typically more plausible.

navigateupwide

Visualizing internal and external structure

The samples from the “extreme close-up view” and “x-ray” style led us to further explore DALL·E’s ability to render internal structure with cross-sectional views, and external structure with macro photographs.

a cross-section view of a walnut
DALL·E: Creating Images from Text
navigatedownwide

navigateupwide
Text prompt

AI-generated
images
We find that DALL·E is able to draw the interiors of several different kinds of objects.

navigateupwide


a macro photograph of brain coral
DALL·E: Creating Images from Text
navigatedownwide

navigateupwide
Text prompt

AI-generated
images
We find that DALL·E is able to draw the fine-grained external details of several different kinds of objects. These details are only apparent when the object is viewed up close.

navigateupwide

Inferring contextual details

The task of translating text to images is underspecified: a single caption generally corresponds to an infinitude of plausible images, so the image is not uniquely determined. For instance, consider the caption “a painting of a capybara sitting on a field at sunrise.” Depending on the orientation of the capybara, it may be necessary to draw a shadow, though this detail is never mentioned explicitly. We explore DALL·E’s ability to resolve underspecification in three cases: changing style, setting, and time; drawing the same object in a variety of different situations; and generating an image of an object with specific text written on it.

a painting of a capybara sitting in a field at sunrise
DALL·E: Creating Images from Text
navigatedownwide

navigateupwide
Text prompt

AI-generated
images
We find that DALL·E is able to render the same scene in a variety of different styles, and can adapt the lighting, shadows, and environment based on the time of day or season.

navigateupwide


a stained glass window with an image of a blue strawberry
DALL·E: Creating Images from Text
navigatedownwide

navigateupwide
Text prompt

AI-generated
images
We find that DALL·E is able to flexibly adapt the representation of the object based on the medium on which it is being drawn. For “a mural,” “a soda can,” and “a teacup,” DALL·E must change how it draws the object based on the angle and curvature of the drawing surface. For “a stained glass window” and “a neon sign,” it must alter the appearance of the object from how it usually appears.

navigateupwide


a store front that has the word ‘openai’ written on it. a store front that has the word ‘openai’ written on it. a store front that has the word ‘openai’ written on it. ‘openai’ store front.
DALL·E: Creating Images from Text
navigatedownwide

navigateupwide
Text prompt

AI-generated
images
We find that DALL·E is sometimes able to render text and adapt the writing style to the context in which it appears. For example, “a bag of chips” and “a license plate” each requires different types of fonts, and “a neon sign” and “written in the sky” require the appearance of the letters to be changed.

Generally, the longer the string that DALL·E is prompted to write, the lower the success rate. We find that the success rate improves when parts of the caption are repeated. Additionally, the success rate sometimes improves as the sampling temperature for the image is decreased, although the samples become simpler and less realistic.

navigateupwide

With varying degrees of reliability, DALL·E provides access to a subset of the capabilities of a 3D rendering engine via natural language. It can independently control the attributes of a small number of objects, and to a limited extent, how many there are, and how they are arranged with respect to one another. It can also control the location and angle from which a scene is rendered, and can generate known objects in compliance with precise specifications of angle and lighting conditions.

Unlike a 3D rendering engine, whose inputs must be specified unambiguously and in complete detail, DALL·E is often able to “fill in the blanks” when the caption implies that the image must contain a certain detail that is not explicitly stated.

Applications of preceding capabilities

Next, we explore the use of the preceding capabilities for fashion and interior design.

a male mannequin dressed in an orange and black flannel shirt
DALL·E: Creating Images from Text
navigatedownwide

navigateupwide
Text prompt

Image prompt
DALL·E: Creating Images from Text
AI-generated
images
We explore DALL·E’s ability to render male mannequins in a variety of different outfits. When prompted with two colors, e.g., “an orange and white bomber jacket” and “an orange and black turtleneck sweater,” DALL·E often exhibits a range of possibilities for how both colors can be used for the same article of clothing.

DALL·E also seems to occasionally confuse less common colors with other neighboring shades. For example, when prompted to draw clothes in “navy,” DALL·E sometimes uses lighter shades of blue, or shades very close to black. Similarly, DALL·E sometimes confuses “olive” with shades of brown or brighter shades of green.

navigateupwide


a female mannequin dressed in a black leather jacket and gold pleated skirt
DALL·E: Creating Images from Text
navigatedownwide

navigateupwide
Text prompt

Image prompt
DALL·E: Creating Images from Text
AI-generated
images
We explore DALL·E’s ability to render female mannequins in a variety of different outfits. We find that DALL·E is able to portray unique textures such as the sheen of a “black leather jacket” and “gold” skirts and leggings. As before, we see that DALL·E occasionally confuses less common colors, such as “navy” and “olive,” with other neighboring shades.

navigateupwide


a living room with two white armchairs and a painting of the colosseum. the painting is mounted above a modern fireplace.
DALL·E: Creating Images from Text
navigatedownwide

navigateupwide
Text prompt

Image prompt
DALL·E: Creating Images from Text
AI-generated
images
We explore DALL·E’s ability to generate images of rooms with several details specified. We find that it can generate paintings of a wide range of different subjects, including real-world locations such as “the colosseum” and fictional characters like “yoda.” For each subject, DALL·E exhibits a variety of interpretations. While the painting is almost always present in the scene, DALL·E sometimes fails to draw the fireplace or the correct number of armchairs.

navigateupwide


a loft bedroom with a white bed next to a nightstand. there is a fish tank beside the bed.
DALL·E: Creating Images from Text
navigatedownwide

navigateupwide
Text prompt

Image prompt
DALL·E: Creating Images from Text
AI-generated
images
We explore DALL·E’s ability to generate bedrooms with several details specified. Despite the fact that we do not tell DALL·E what should go on top of the nightstand or shelf beside the bed, we find that it sometimes decides to place the other specified object on top. As before, we see that it often fails to draw one or more of the specified objects.

navigateupwide

Combining unrelated concepts

The compositional nature of language allows us to put together concepts to describe both real and imaginary things. We find that DALL·E also has the ability to combine disparate ideas to synthesize objects, some of which are unlikely to exist in the real world. We explore this ability in two instances: transferring qualities from various concepts to animals, and designing products by taking inspiration from unrelated concepts.

a snail made of harp. a snail with the texture of a harp.
DALL·E: Creating Images from Text
navigatedownwide

navigateupwide
Text prompt

AI-generated
images
We find that DALL·E can generate animals synthesized from a variety of concepts, including musical instruments, foods, and household items. While not always successful, we find that DALL·E sometimes takes the forms of the two objects into consideration when determining how to combine them. For example, when prompted to draw “a snail made of harp,” it sometimes relates the pillar of the harp to the spiral of the snail’s shell.

In a previous section, we saw that as more objects are introduced into the scene, DALL·E is liable to confuse the associations between the objects and their specified attributes. Here, we see a different sort of failure mode: sometimes, rather than binding some attribute of the specified concept (say, “a faucet”) to the animal (say, “a snail”), DALL·E just draws the two as separate items.

navigateupwide


an armchair in the shape of an avocado. an armchair imitating an avocado.
DALL·E: Creating Images from Text
navigatedownwide

navigateupwide
Text prompt

AI-generated
images
In the preceding visual, we explored DALL·E’s ability to generate fantastical objects by combining two unrelated ideas. Here, we explore its ability to take inspiration from an unrelated idea while respecting the form of the thing being designed, ideally producing an object that appears to be practically functional. We found that prompting DALL·E with the phrases “in the shape of,” “in the form of,” and “in the style of” gives it the ability to do this.

When generating some of these objects, such as “an armchair in the shape of an avocado”, DALL·E appears to relate the shape of a half avocado to the back of the chair, and the pit of the avocado to the cushion. We find that DALL·E is susceptible to the same kinds of mistakes mentioned in the previous visual.

navigateupwide

Animal illustrations

In the previous section, we explored DALL·E’s ability to combine unrelated concepts when generating images of real-world objects. Here, we explore this ability in the context of art, for three kinds of illustrations: anthropomorphized versions of animals and objects, animal chimeras, and emojis.

an illustration of a baby daikon radish in a tutu walking a dog
DALL·E: Creating Images from Text
navigatedownwide

navigateupwide
Text prompt

AI-generated
images
We find that DALL·E is sometimes able to transfer some human activities and articles of clothing to animals and inanimate objects, such as food items. We include “pikachu” and “wielding a blue lightsaber” to explore DALL·E’s ability to incorporate popular media.

We find it interesting how DALL·E adapts human body parts onto animals. For example, when asked to draw a daikon radish blowing its nose, sipping a latte, or riding a unicycle, DALL·E often draws the kerchief, hands, and feet in plausible locations.

navigateupwide


a professional high quality illustration of a giraffe turtle chimera. a giraffe imitating a turtle. a giraffe made of turtle.
DALL·E: Creating Images from Text
navigatedownwide

navigateupwide
Text prompt

AI-generated
images
We find that DALL·E is sometimes able to combine distinct animals in plausible ways. We include “pikachu” to explore DALL·E’s ability to incorporate knowledge of popular media, and “robot” to explore its ability to generate animal cyborgs. Generally, the features of the second animal mentioned in the caption tend to be dominant.

We also find that inserting the phrase “professional high quality” before “illustration” and “emoji” sometimes improves the quality and consistency of the results.

navigateupwide


a professional high quality emoji of a lovestruck cup of boba
DALL·E: Creating Images from Text
navigatedownwide

navigateupwide
Text prompt

AI-generated
images
We find that DALL·E is sometimes able to transfer some emojis to animals and inanimate objects, such as food items. As in the preceding visual, we find that inserting the phrase “professional high quality” before “emoji” sometimes improves the quality and consistency of the results.

navigateupwide

Zero-shot visual reasoning

GPT-3 can be instructed to perform many kinds of tasks solely from a description and a cue to generate the answer supplied in its prompt, without any additional training. For example, when prompted with the phrase “here is the sentence ‘a person walking his dog in the park’ translated into French:”, GPT-3 answers “un homme qui promène son chien dans le parc.” This capability is called zero-shot reasoning. We find that DALL·E extends this capability to the visual domain, and is able to perform several kinds of image-to-image translation tasks when prompted in the right way.

the exact same cat on the top as a sketch on the bottom
DALL·E: Creating Images from Text
navigatedownwide

navigateupwide
Text prompt

Image prompt
AI-generated
images
We find that DALL·E is able to apply several kinds of image transformations to photos of animals, with varying degrees of reliability. The most straightforward ones, such as “photo colored pink” and “photo reflected upside-down,” also tend to be the most reliable, although the photo is often not copied or reflected exactly. The transformation “animal in extreme close-up view” requires DALL·E to recognize the breed of the animal in the photo, and render it up close with the appropriate details. This works less reliably, and for several of the photos, DALL·E only generates plausible completions in one or two instances.

Other transformations, such as “animal with sunglasses” and “animal wearing a bow tie,” require placing the accessory on the correct part of the animal’s body. Those that only change the color of the animal, such as “animal colored pink,” are less reliable, but show that DALL·E is sometimes capable of segmenting the animal from the background. Finally, the transformations “a sketch of the animal” and “a cell phone case with the animal” explore the use of this capability for illustrations and product design.

navigateupwide


the exact same teapot on the top with ’gpt’ written on it on the bottom
DALL·E: Creating Images from Text
navigatedownwide

navigateupwide
Text prompt

Image prompt
AI-generated
images
We find that DALL·E is able to apply several different kinds of image transformations to photos of teapots, with varying degrees of reliability. Aside from being able to modify the color of the teapot (e.g., “colored blue”) or its pattern (e.g., “with stripes”), DALL·E can also render text (e.g., “with ‘gpt’ written on it”) and map the letters onto the curved surface of the teapot in a plausible way. With much less reliability, it can also draw the teapot in a smaller size (for the “tiny” option) and in a broken state (for the “broken” option).

navigateupwide

We did not anticipate that this capability would emerge, and made no modifications to the neural network or training procedure to encourage it. Motivated by these results, we measure DALL·E’s aptitude for analogical reasoning problems by testing it on Raven’s progressive matrices, a visual IQ test that saw widespread use in the 20th century.

a sequence of geometric shapes.
DALL·E: Creating Images from Text
navigatedownwide

navigateupwide
Text prompt

Example Image prompt
AI-generated
images
Rather than treating the IQ test a multiple-choice problem as originally intended, we ask DALL·E to complete the bottom-right corner of each image using argmax sampling, and consider its completion to be correct if it is a close visual match to the original.

DALL·E is often able to solve matrices that involve continuing simple patterns or basic geometric reasoning, such as those in sets B and C. It is sometimes able to solve matrices that involve recognizing permutations and applying boolean operations, such as those in set D. The instances in set E tend to be the most difficult, and DALL·E gets almost none of them correct.

For each of the sets, we measure DALL·E’s performance on both the original images, and the images with the colors inverted. The inversion of colors should pose no additional difficulty for a human, yet does generally impair DALL·E’s performance, suggesting its capabilities may be brittle in unexpected ways.

navigateupwide

Geographic knowledge

We find that DALL·E has learned about geographic facts, landmarks, and neighborhoods. Its knowledge of these concepts is surprisingly precise in some ways and flawed in others.

a photo of the food of china
DALL·E: Creating Images from Text
navigatedownwide

navigateupwide
Text prompt

AI-generated
images
We test DALL·E’s understanding of simple geographical facts, such as country flags, cuisines, and local wildlife. While DALL·E successfully answers many of these queries, such as those involving national flags, it often reflects superficial stereotypes for choices like “food” and “wildlife,” as opposed to representing the full diversity encountered in the real world.

navigateupwide


a photo of alamo square, san francisco, from a street at night
DALL·E: Creating Images from Text
navigatedownwide

navigateupwide
Text prompt

AI-generated
images
We find that DALL·E is sometimes capable of rendering semblances of certain locations in San Francisco. For locations familiar to the authors, such as San Francisco, they evoke a sense of déjà vu—eerie simulacra of streets, sidewalks and cafes that remind us of very specific locations that do not exist.

navigateupwide


a photo of san francisco’s golden gate bridge
DALL·E: Creating Images from Text
navigatedownwide

navigateupwide
Text prompt

Image prompts
AI-generated
images
We can also prompt DALL·E to draw famous landmarks. In fact, we can even dictate when the photo was taken by specifying the first few rows of the sky. When the sky is dark, for example, DALL·E recognizes it is night, and turns on the lights in the buildings.

navigateupwide

Temporal knowledge

In addition to exploring DALL·E’s knowledge of concepts that vary over space, we also explore its knowledge of concepts that vary over time.

a photo of a phone from the 20s
DALL·E: Creating Images from Text
navigatedownwide

navigateupwide
Text prompt

Image prompt
AI-generated
images
We find that DALL·E has learned about basic stereotypical trends in design and technology over the decades. Technological artifacts appear to go through periods of explosion of change, dramatically shifting for a decade or two, then changing more incrementally, becoming refined and streamlined.

navigateupwide

Summary of approach and prior work

DALL·E is a simple decoder-only transformer that receives both the text and the image as a single stream of 1280 tokens—256 for the text and 1024 for the image—and models all of them autoregressively. The attention mask at each of its 64 self-attention layers allows each image token to attend to all text tokens. DALL·E uses the standard causal mask for the text tokens, and sparse attention for the image tokens with either a row, column, or convolutional attention pattern, depending on the layer. We plan to provide more details about the architecture and training procedure in an upcoming paper.

Text-to-image synthesis has been an active area of research since the pioneering work of Reed et. al, whose approach uses a GAN conditioned on text embeddings. The embeddings are produced by an encoder pretrained using a contrastive loss, not unlike CLIP. StackGAN and StackGAN++ use multi-scale GANs to scale up the image resolution and improve visual fidelity. AttnGAN incorporates attention between the text and image features, and proposes a contrastive text-image feature matching loss as an auxiliary objective. This is interesting to compare to our reranking with CLIP, which is done offline. Other work incorporates additional sources of supervision during training to improve image quality. Finally, work by Nguyen et. al and Cho et. al explores sampling-based strategies for image generation that leverage pretrained multimodal discriminative models.

Similar to the rejection sampling used in VQVAE-2, we use CLIP to rerank the top 32 of 512 samples for each caption in all of the interactive visuals. This procedure can also be seen as a kind of language-guided search, and can have a dramatic impact on sample quality.

an illustration of a baby daikon radish in a tutu walking a dog
DALL·E: Creating Images from Text
navigatedownwide

navigateupwide
Text prompt

AI-generated
images
Reranking the samples from DALL·E using CLIP can dramatically improve consistency and quality of the samples.

navigateupwide



OpenAI

Organizational Update from OpenAI

It’s been a year of dramatic change and growth at OpenAI. In May, we introduced GPT-3—the most powerful language model to date—and soon afterward launched our first commercial product, an API to safely access artificial intelligence models using simple, natural-language prompts. We’re proud of these and other research breakthroughs by our team, all made as part of our mission to achieve general-purpose AI that is safe and reliable, and which benefits all humanity.

Today we’re announcing that Dario Amodei, VP of Research, is leaving OpenAI after nearly five years with the company. Dario has made tremendous contributions to our research in that time, collaborating with the team to build GPT-2 and GPT-3, and working with Ilya Sutskever as co-leader in setting the direction for our research.

Dario has always shared our goal of responsible AI. He and a handful of OpenAI colleagues are planning a new project, which they tell us will probably focus less on product development and more on research. We support their move and we’re grateful for the time we’ve spent working together.

“We are incredibly thankful to Dario for his contributions over the past four and a half years. We wish him and his co-founders all the best in their new project, and we look forward to a collaborative relationship with them for years to come,” said OpenAI chief executive Sam Altman.

When his departure was announced at an employee meeting earlier this month, Dario told coworkers, “I want to thank Sam and thank everyone. I’m really proud of the work we’ve done together. I want to wish everyone the best, and I know that OpenAI will do really great things in the years ahead. We share the same goal of safe artificial general intelligence to benefit humanity, so it’s incumbent on all of us in this space to work together to make sure things go well.”

OpenAI is also making a few organizational changes to put greater focus on the integration of research, product, and safety. Mira Murati is taking on new responsibilities as senior vice president of Research, Product, and Partnerships, reflecting her strong leadership during our API rollout and across the company.

Sam added, “OpenAI’s mission is to thoughtfully and responsibly develop general-purpose artificial intelligence, and as we enter the new year our focus on research—especially in the area of safety—has never been stronger. Making AI safer is a company-wide priority, and a key part of Mira’s new role.”

“While GPT-3 and the other AI models we’ve developed are still nascent, we’re beginning to get a better understanding of their behavior, how to make them safer and how to align them with human preferences,” said Mira of her new role. “Our approach is to carefully use these models to improve products that people already use in their everyday lives, as well as create new products—all of which allows us to gain experience with safe deployment. At the same time, we will continue to conduct and publish research on the impact and challenges of AI, independent of our immediate product goals.”

Our team will be back in January with new research and updates on our API’s progress.

Happy New Year from all of us at OpenAI!

OpenAI

OpenAI Licenses GPT-3 Technology to Microsoft

OpenAI Licenses GPT-3 Technology to Microsoft

OpenAI Licenses GPT-3 Technology to Microsoft

OpenAI released its first commercial product back in June: an API for developers to access advanced technologies for building new applications and services. The API features a powerful general purpose language model, GPT-3, and has received tens of thousands of applications to date.

In addition to offering GPT-3 and future models via the OpenAI API, and as part of a multiyear partnership announced last year, OpenAI has agreed to license GPT-3 to Microsoft for their own products and services. The deal has no impact on continued access to the GPT-3 model through OpenAI’s API, and existing and future users of it will continue building applications with our API as usual.

Unlike most AI systems which are designed for one use-case, OpenAI’s API today provides a general-purpose “text in, text out” interface, allowing users to try it on virtually any English language task. GPT-3 is the most powerful model behind the API today, with 175 billion parameters. There are several other models available via the API today, as well as other technologies and filters that allow developers to customize GPT-3 and other language models for their own use.

Today, the API remains in a limited beta as OpenAI and academic partners test and assess the capabilities and limitations of these powerful language models. To learn more about the API or to sign up for the beta, please visit beta.openai.com.

OpenAI

Learning to Summarize with Human Feedback

Learning to Summarize with Human Feedback

We’ve applied reinforcement learning from human feedback to train language models that are better at summarization. Our models generate summaries that are better than summaries from 10x larger models trained only with supervised learning. Even though we train our models on the Reddit TL;DR dataset, the same models transfer to generate good summaries of CNN/DailyMail news articles without any further fine-tuning. Our techniques are not specific to summarization; in the long run, our goal is to make aligning AI systems with human preferences a central component of AI research and deployment in many domains.
Read paper


View code


View samples
Human feedback models outperform much larger supervised models and reference summaries on TL;DR
Figure 1: The performance of various training procedures for different model sizes. Model performance is measured by how often summaries from that model are preferred to the human-written reference summaries. Our pre-trained models are early versions of GPT-3, our supervised baselines were fine-tuned to predict 117K human-written TL;DRs, and our human feedback models are additionally fine-tuned on a dataset of about 65K summary comparisons.


Learning to Summarize with Human Feedback

Large-scale language models are becoming increasingly capable on NLP tasks. These models are usually trained with the objective of next word prediction on a dataset of human-written text. But this objective doesn’t capture exactly what we want; usually, we don’t want our models to imitate humans, we want them to give high-quality answers. This mismatch is clear when a model is trained to imitate low-quality human-written text, but it can also happen in more subtle ways. For example, a model trained to predict what a human would say might make up facts when it is unsure, or generate sentences reflecting harmful social bias, both failure modes that have been well-documented.

As part of our work on safety, we want to develop techniques that align our models’ objectives with the end behavior we really care about. As our models become more powerful, we believe aligning them with our goals will be very important to ensure they are beneficial for humans. In the short term, we wanted to test if human feedback techniques could help our models improve performance on useful tasks.

We focused on English text summarization, as it’s a challenging problem where the notion of what makes a “good summary” is difficult to capture without human input. We apply our method primarily to an existing dataset of posts submitted to the social network Reddit[1] together with human-written “TL;DRs,” which are short summaries written by the original poster.

We first train a reward model via supervised learning to predict which summaries humans will prefer.[2] We then fine-tune a language model with reinforcement learning (RL) to produce summaries that score highly according to that reward model. We find that this significantly improves the quality of the summaries, as evaluated by humans, even on datasets very different from the one used for fine-tuning.

Our approach follows directly from our previous work on learning from human feedback. There has also been other work on using human feedback to train summarization models. We push the technique further by scaling to larger models, collecting more feedback data, closely monitoring researcher-labeler agreement, and providing frequent feedback to labelers. Human feedback has also been used to train models in several other domains, such as dialogue, semantic parsing, translation, story and review generation, evidence extraction, and more traditional RL tasks.

Results


Post from Reddit (r/)

show more


Human-written reference summary

Human feedback 6B model

Supervised 6B model

Pre-trained 6B model

We evaluated several different summarization models—some pre-trained on a broad distribution of text from the internet, some fine-tuned via supervised learning to predict TL;DRs, and some fine-tuned using human feedback.[3] To evaluate each model, we had it summarize posts from the validation set and asked humans to compare their summaries to the human-written TL;DR. The results are shown in Figure 1.

We found that RL fine-tuning with human feedback had a very large effect on quality compared to both supervised fine-tuning and scaling up model size. In particular, our 1.3 billion parameter (1.3B) model trained with human feedback outperforms our 12B model trained only with supervised learning. Summaries from both our 1.3B and 6.7B human feedback models are preferred by our labelers to the original human-written TL;DRs in the dataset.[4]

People make different trade-offs when writing summaries, including between conciseness and coverage of the original text; depending on the purpose of the summary, different summary lengths might be preferred. Our labelers tended to prefer longer summaries, so our models adapted to that preference and converged to the longest allowable length. Controlling for length reduced human preferences for our 6.7B model’s summaries from 70% to 65%, explaining a minority of our gains.[5]

Transfer results

Human feedback models trained on Reddit transfer to generate excellent summaries of CNN/DM news articles without further training


The performance (human-rated summary quality on a 1–7 scale) of various training procedures and model sizes.[6] Note that our human feedback models generate summaries that are significantly shorter than summaries from models trained on CNN/DM.

At a given summary length, our 6.7B human feedback model trained on Reddit performs almost as well as a fine-tuned 11B T5 model, despite not being re-trained on CNN/DM.


Article from CNN/DM ()

show more


Human-written reference summary

Human feedback 6B model (transfer)

Supervised 6B model (transfer)

Pre-trained 6B model

T5 11B model (fine-tuned on CNN/DM)

Supervised 6B model (fine-tuned on CNN/DM)

To test our models’ generalization, we also applied them directly to the popular CNN/DM news dataset. These articles are more than twice as long as Reddit posts and are written in a very different style. Our models have seen news articles during pre-training, but all of our human data and RL fine-tuning was on the Reddit TL;DR dataset.

This time we evaluated our models by asking our labelers to rate them on a scale from 1–7.[7] We discovered that our human feedback models transfer to generate excellent short summaries of news articles without any training. When controlling for summary length, our 6.7B human feedback model generates summaries that are rated higher than the CNN/DM reference summaries written by humans. This suggests that our human feedback models have learned something more general about how to summarize text, and are not specific to Reddit posts.

Approach

Learning to Summarize with Human Feedback
A diagram of our method, which is similar to the one used in our previous work.

Our core method consists of four steps: training an initial summarization model, assembling a dataset of human comparisons between summaries, training a reward model to predict the human-preferred summary, and then fine-tuning our summarization models with RL to get a high reward.

We trained several supervised baselines by starting from GPT-style transformer models trained on text from the Internet, and fine-tuning them to predict the human-written TL;DR via supervised learning. We mainly use models with 1.3 and 6.7 billion parameters. As a sanity check, we confirmed that this training procedure led to competitive results[8] on the CNN/DM dataset.

We then collected a dataset of human quality judgments. For each judgment, a human compares two summaries of a given post and picks the one they think is better.[9] We use this data to train a reward model that maps a (post, summary) pair to a reward r. The reward model is trained to predict which summary a human will prefer, using the rewards as logits.

Finally, we optimize the policy against the reward model using RL. We use PPO with 1 million episodes in total, where each episode consists of the policy summarizing a single article and then receiving a reward r. We include a KL penalty that incentivizes the policy to remain close to the supervised initialization.

Collecting data from humans

Any training procedure that uses human feedback is directly influenced by the actual humans labeling the data. In our previous work on fine-tuning language models from human preferences, our labelers often gave high ratings to summaries we thought were average, which was reflected in the quality of our trained models.

In response, in this project we invested heavily in ensuring high data quality. We hired about 80 contractors using third-party vendor sites,[10] and paid them an hourly wage regardless of the number of summaries evaluated.[11] Hiring contractors rather than relying on crowdsourcing websites allowed us to maintain a hands-on relationship with labelers: we created an onboarding process, developed a website with a customizable labeler interface, answered questions in a shared chat room, and had one-on-one video calls with labelers. We also made sure to clearly communicate our definition of summary quality, after spending significant time reading summaries ourselves, and we carefully monitored agreement rates between us and labelers throughout the project.

Optimizing the reward model

Optimizing our reward model eventually leads to sample quality degradation

Starting from the 1.3B supervised baseline (point 0 on the x-axis), we use RL to optimize the policy against the reward model, which results in policies with different “distances” from the baseline (x-axis, measured using the KL divergence from the supervised baseline). Optimizing against the reward model initially improves summaries according to humans, but eventually overfits, giving worse summaries. This graph uses an older version of our reward model, which is why the peak of the reward model is less than 0.5.


Post from Reddit (r/AskReddit)
I’m a 28yo man, and I would like to get into gymnastics for the first time.

Title said just about all of it. I’m 28, very athletic (bike/ surf/ snowboard) and I have always wanted to do gymnastics.

I like to do flips and spins off bridges and on my snowboard, and it seems to me gymnastics would be a great way to do those movements I like, in a controlled environment. The end goal of this is that it would be fun, and make me better at these movements in real life.

But is it too late for me? Should 28 year old guys such as myself be content with just watching those parkour guys on youtube? Or can I learn the ways of the gymnastic jedi? BTW, I live in San Jose CA.


KL = 0
I want to do gymnastics, but I’m 28 yrs old. Is it too late for me to be a gymnaste?!

KL = 9
28yo guy would like to get into gymnastics for the first time. Is it too late for me given I live in San Jose CA?

KL = 260
28yo dude stubbornly postponees start pursuing gymnastics hobby citing logistics reasons despite obvious interest??? negatively effecting long term fitness progress both personally and academically thoght wise? want change this dumbass shitty ass policy pls

Optimizing against our reward model is supposed to make our policy align with human preferences. But the reward model is only a proxy for human preferences, as it only sees a small amount of comparison data from a narrow distribution of summaries. While the reward model performs well on the kinds of summaries it was trained on, we wanted to know how much we could optimize against it until it started giving useless evaluations.

We trained policies at different “optimization strengths” against the reward model, and asked our labelers to evaluate the summaries from these models. We did this by varying the KL coefficient, which trades off the incentive to get a higher reward against the incentive to remain close to the initial supervised policy. We found the best samples had roughly the same predicted reward as the 99th percentile of reference summaries from the dataset. Eventually optimizing the reward model actually makes things worse.

Limitations

If we have a well-defined notion of the desired behavior for a model, our method of training from human feedback allows us to optimize for this behavior. However, this is not a method for determining what the desired model behavior should be. Deciding what makes a good summary is fairly straightforward, but doing this for tasks with more complex objectives, where different humans might disagree on the correct model behavior, will require significant care. In these cases, it is likely not appropriate to use researcher labels as the “gold standard”; rather, individuals from groups that will be impacted by the technology should be included in the process to define “good” behavior, and hired as labelers to reinforce this behavior in the model.

We trained on the Reddit TL;DR dataset because the summarization task is significantly more challenging than on CNN/DM. However, since the dataset consists of user-submitted posts with minimal moderation, they sometimes contain content that is offensive or reflects harmful social biases. This means our models can generate biased or offensive summaries, as they have been trained to summarize such content.

Part of our success involves scaling up our reward model and policy size. This requires a large amount of compute, which is not available to all researchers: notably, fine-tuning our 6.7B model with RL required about 320 GPU-days. However, since smaller models trained with human feedback can exceed the performance of much larger models, our procedure is more cost-effective than simply scaling up for training high-quality models on specific tasks.

Though we outperform the human-written reference summaries on TL;DR, our models have likely not reached human-level performance, as the reference summary baselines for TL;DR and CNN/DM are not the highest possible quality. When evaluating our model’s TL;DR summaries on a 7-point scale along several axes of quality (accuracy, coverage, coherence, and overall), labelers find our models can still generate inaccurate summaries, and give a perfect overall score 45% of the time.[12] For cost reasons, we also do not directly compare to using a similar budget to collect high-quality demonstrations, and training on those using standard supervised fine-tuning.

Future directions

We’re interested in scaling human feedback to tasks where humans can’t easily evaluate the quality of model outputs. For example, we might want our models to answer questions that would take humans a lot of research to verify; getting enough human evaluations to train our models this way would take a long time. One approach to tackle this problem is to give humans tools to help them evaluate more quickly and accurately. If these tools use ML, we can also improve them with human feedback, which could allow humans to accurately evaluate model outputs for increasingly complicated tasks.

In addition to tackling harder problems, we’re also exploring different types of feedback beyond binary comparisons: we can ask humans to provide demonstrations, edit model outputs to make them better, or give explanations as to why one model output is better than another. We’d like to figure out which kinds of feedback are most effective for training models that are aligned with human preferences.

If you are interested in working on these research questions, we’re hiring!

OpenAI

OpenAI Scholars Spring 2020: Final Projects

OpenAI Scholars Spring 2020: Final Projects

OpenAI Scholars Spring 2020: Final Projects

Our third class of OpenAI Scholars presented their final projects at virtual Demo Day, showcasing their research results from over the past five months. These projects investigated problems such as analyzing how GPT-2 represents grammar, measuring the interpretability of models trained on Coinrun, and predicting epileptic seizures using brain recordings. More information about the next class of Scholars and how to apply will be announced this fall.

The OpenAI Scholars program provides stipends and mentorship to individuals from underrepresented groups to study deep learning and open-source a project.

Our Scholars have demonstrated core technical skills across various expert domains and self-motivation—critical competencies for a self-directed program like this one. They each entered the field of machine learning as relative newcomers, and we hope their progress shows how accessible machine learning is.

Demo Day introductions by Sam Altman and Greg Brockman

Learn more about our Scholars program.

Alethea Power

Looking for Grammar in All The Right Places

I’m fascinated by neural network interpretability. Understanding how networks of various architectures represent information can help us build simpler and more efficient networks, as well as predict how the networks we’ve built will behave, and perhaps even give us some insight into how human beings think. Along these lines, I analyzed how GPT-2 represents English grammar, and found smaller sub-networks that seem to correspond to various grammatical structures. I will present my methodology and results.

Next, I want to work on understanding how neural networks represent information, and use that understanding to better predict how deep learning systems behave. I believe this work will make such systems safer and more beneficial to humanity, as well as making them simpler, faster, and more computationally efficient.

Blog

Andre Carerra

Semantic Parsing English to GraphQL

My scholars program project is semantic parsing English-to-GraphQL. Given an English prompt such as “How many employees do we have?”, find a corresponding GraphQL query to return the information. The project involved creating a dataset, training models, and creating an interaction tool to see results.

I wanted to have a say in how AI is shaped—the Scholars program has been a great opportunity to learn and participate.

Blog

Cathy Yeh

Long Term Credit Assignment with Temporal Reward Transport

Standard reinforcement learning algorithms struggle with poor sample efficiency in the presence of sparse rewards with long temporal delays between action and effect. To address the long term credit assignment problem, we use “temporal reward transport” (TRT) to augment the immediate rewards of significant state-action pairs with rewards from the distant future, using an attention mechanism to identify candidates for TRT. A series of gridworld experiments show clear improvements in learning when TRT is used in conjunction with a standard advantage actor critic algorithm.

I appreciate that this program gave me the freedom to learn deeply and flex my creativity.

Blog

Jorge Orbay

Quantifying Interpretability of Models Trained on Coinrun

This project’s purpose is to create a scalar that measures the interpretability of an A2C model trained on Procgen’s Coinrun. The scalar is generated using a combination of attribution on the model and masks of Coinrun’s assets. The scalar is used to test the validity of the diversity hypothesis.

This program, and specifically my mentor, has fostered a self-confidence in me to dive into a field I don’t understand and breakdown problems until I can solve them. I’m hoping to take the self-confidence I’ve learned from this program to continue breaking-down problems in and with AI.

Blog

Kamal Ndousse

Social Learning in Independent Multi-Agent Reinforcement Learning

My project has explored the social transfer of expertise among completely independent RL agents trained in shared environments. The motivating question is whether novice agents can learn to mimic expert behavior to solve hard-exploration tasks that they couldn’t master in isolation. I’ll discuss my observations as well as the environments I developed to experiment with social skill transfer.

I joined the Scholars program in order to learn from the brilliant folks at OpenAI and to immerse myself in AI research. I’m grateful to have had the opportunity to explore state of the art research with the support of such talented researchers (special thanks to my mentor Natasha Jaques!)

Blog


Kata Slama

Towards Epileptic Seizure Prediction with Deep Network

I have been working on a project to predict epileptic seizures using brain recordings. I framed it as an image classification problem based on the spectrogram representation of the brain data. My most successful model so far has been a ResNet18. In my post-Scholars life, I plan to continue working on this project, and make my way to interpretability of spectrogram classification networks.

I wanted to learn how to apply deep learning for solving scientific and real-world problems. The OpenAI Scholars program was this magical opportunity to get started by learning from the very best minds in the field.

Blog


Pamela Mishkin

Universal Adversarial Perturbations and Language Models

Adversarial perturbations are well-understood for images but less so for language. My presentation will review the literature on how universal adversarial examples can inform understanding of generative models, replicating results generating universal adversarial triggers for GPT-2 and for attacking NLI models.

This program strengthened my technical basis in machine learning and helped me understand how AI researchers understand policy implications of their work.

Blog

Diversity is core to AI having a positive effect on the world—it’s necessary to ensure the advanced AI systems in the future are built to benefit everyone.

If you’re excited to begin your own journey into ML, check out some of our educational materials. More information about the next class of scholars and how to apply will be announced this fall. Stay tuned!

Huge thanks to Microsoft for providing Azure compute credits to scholars, to our mentors for their time and commitment, and to all the supporters that made this program possible.

OpenAI

Image GPT

Image GPT

Image GPT

We find that, just as a large transformer model trained on language can generate coherent text, the same exact model trained on pixel sequences can generate coherent image completions and samples. By establishing a correlation between sample quality and image classification accuracy, we show that our best generative model also contains features competitive with top convolutional nets in the unsupervised setting.

CodeICML 2020 Paper (v1)Paper (v2)

Introduction

Unsupervised and self-supervised learning, or learning without human-labeled data, is a longstanding challenge of machine learning. Recently, it has seen incredible success in language, as transformer models like BERT, GPT-2, RoBERTa, T5, and other variants have achieved top performance on a wide array of language tasks. However, the same broad class of models has not been successful in producing strong features for image classification. Our work aims to understand and bridge this gap.

Transformer models like BERT and GPT-2 are domain agnostic, meaning that they can be directly applied to 1-D sequences of any form. When we train GPT-2 on images unrolled into long sequences of pixels, which we call iGPT, we find that the model appears to understand 2-D image characteristics such as object appearance and category. This is evidenced by the diverse range of coherent image samples it generates, even without the guidance of human provided labels. As further proof, features from the model achieve state-of-the-art performance on a number of classification datasets and near state-of-the-art unsupervised accuracy[1] on ImageNet.

Evaluation Dataset Our Result Best non-iGPT Result
Logistic regression on learned features (linear probe) CIFAR-10
96.3

iGPT-L 32×32 w/ 1536 features

95.3

SimCLR w/ 8192 features

CIFAR-100
82.8

iGPT-L 32×32 w/ 1536 features

80.2

SimCLR w/ 8192 features

STL-10
95.5

iGPT-L 32×32 w/ 1536 features

94.2

AMDIM w/ 8192 features

ImageNet
72.0

iGPT-XLa 64×64 w/ 15360 features

76.5

SimCLR w/ 8192 features

Full fine-tune CIFAR-10
99.0

iGPT-L 32×32, trained on ImageNet

99.0b

GPipe, trained on ImageNet

ImageNet 32×32
66.3

iGPT-L 32×32

70.2

Isometric Nets

  1. We only show ImageNet linear probe accuracy for iGPT-XL since other experiments did not finish before we needed to transition to different supercomputing facilities.
  2. Bit-L, trained on JFT (300M images with 18K classes), achieved a result of 99.3.

To highlight the potential of generative sequence modeling as a general purpose unsupervised learning algorithm, we deliberately use the same transformer architecture as GPT-2 in language. As a consequence, we require significantly more compute in order to produce features competitive with those from top unsupervised convolutional nets. However, our results suggest that when faced with a new domain where the correct model priors are unknown, a large GPT-2 can learn excellent features without the need for domain-specific architectural design choices.

Completions

Favorites
Animals
Painted Landscapes
Sports
Architecture
ImageNet-R
Movie Posters
Famous Artworks
Popular Memes
Landscapes
Album Covers
Common English Words
US & State Flags
OpenAI Research Covers
OpenAI Pets
OpenAI Cooking

navigateleft
navigateright

Model Input

Completions right
Original

Image GPT
Image GPT
Image GPT
Image GPT
Image GPT
Image GPT

Image GPT
Image GPT
Image GPT
Image GPT
Image GPT
Image GPT

Image GPT
Image GPT
Image GPT
Image GPT
Image GPT
Image GPT

Image GPT
Image GPT
Image GPT
Image GPT
Image GPT
Image GPT

Image GPT
Image GPT
Image GPT
Image GPT
Image GPT
Image GPT

Image GPT
Image GPT
Image GPT
Image GPT
Image GPT
Image GPT

Image GPT
Image GPT
Image GPT
Image GPT
Image GPT
Image GPT

Image GPT
Image GPT
Image GPT
Image GPT
Image GPT
Image GPT

Model-generated completions of human-provided half-images. We sample the remaining halves with temperature 1 and without tricks like beam search or nucleus sampling. While we showcase our favorite completions in the first panel, we do not cherry-pick images or completions in all following panels.

Samples

Image GPT
Image GPT
Image GPT
Image GPT
Image GPT
Image GPT
Image GPT
Image GPT
Image GPT
Image GPT
Image GPT
Image GPT
Image GPT
Image GPT
Image GPT
Image GPT
Image GPT
Image GPT
Image GPT
Image GPT
Image GPT
Image GPT
Image GPT
Image GPT

Model-generated image samples. We sample these images with temperature 1 and without tricks like beam search or nucleus sampling. All of our samples are shown, with no cherry-picking. Nearly all generated images contain clearly recognizable objects.




From language GPT to image GPT

In language, unsupervised learning algorithms that rely on word prediction (like GPT-2 and BERT) have been extremely successful, achieving top performance on a wide array of language tasks. One possible reason for this success is that instances of downstream language tasks appear naturally in text: questions are often followed by answers (which could help with question-answering) and passages are often followed by summaries (which could help with summarization). In contrast, sequences of pixels do not clearly contain labels for the images they belong to.

Even without this explicit supervision, there is still a reason why GPT-2 on images might work: a sufficiently large transformer trained on next pixel prediction might eventually learn to generate diverse[2] samples with clearly recognizable objects. Once it learns to do so, an idea known as “Analysis by Synthesis”[3] suggests that the model will also know about object categories. Many early generative models were motivated by this idea, and more recently, BigBiGAN was an example which produced encouraging samples and features. In our work, we first show that better generative models achieve stronger classification performance. Then, through optimizing GPT-2 for generative capabilities, we achieve top-level classification performance in many settings, providing further evidence for analysis by synthesis.

Towards general unsupervised learning

Generative sequence modeling is a universal unsupervised learning algorithm: since all data types can be represented as sequences of bytes, a transformer can be directly applied to any data type without additional engineering. Our work tests the power of this generality by directly applying the architecture used to train GPT-2 on natural language to image generation. We deliberately chose to forgo hand coding any image specific knowledge in the form of convolutions or techniques like relative attention, sparse attention, and 2-D position embeddings.

As a consequence of its generality, our method requires significantly more compute to achieve competitive performance in the unsupervised setting. Indeed, contrastive methods are still the most computationally efficient methods for producing high quality features from images. However, in showing that an unsupervised transformer model is competitive with the best unsupervised convolutional nets, we provide evidence that it is possible to trade off hand coded domain knowledge for compute. In new domains, where there isn’t much knowledge to hand code, scaling compute seems an appropriate technique to test.

Approach

We train iGPT-S, iGPT-M, and iGPT-L, transformers containing 76M, 455M, and 1.4B parameters respectively, on ImageNet. We also train iGPT-XL[4], a 6.8 billion parameter transformer, on a mix of ImageNet and images from the web. Due to the large computational cost of modeling long sequences with dense attention, we train at the low resolutions of 32×32, 48×48, and 64×64.

While it is tempting to work at even lower resolutions to further reduce compute cost, prior work has demonstrated that human performance on image classification begins to drop rapidly below these sizes. Instead, motivated by early color display palettes, we create our own 9-bit color palette to represent pixels. Using this palette yields an input sequence length 3 times shorter than the standard (R, G, B) palette, while still encoding color faithfully.

Experimental results

There are two methods we use to assess model performance, both of which involve a downstream classification task. The first, which we refer to as a linear probe, uses the trained model to extract features[5] from the images in the downstream dataset, and then fits a logistic regression to the labels. The second method fine-tunes[6] the entire model on the downstream dataset.

Since next pixel prediction is not obviously relevant to image classification, features from the final layer may not be the most predictive of the object category. Our first result shows that feature quality is a sharply increasing, then mildly decreasing function of depth. This behavior suggests that a transformer generative model operates in two phases: in the first phase, each position gathers information from its surrounding context in order to build a contextualized image feature. In the second phase, this contextualized feature is used to solve the conditional next pixel prediction task. The observed two stage performance of our linear probes is reminiscent of another unsupervised neural net, the bottleneck autoencoder, which is manually designed so that features in the middle are used.

Feature quality depends heavily on the layer we choose to evaluate. In contrast with supervised models, the best features for these generative models lie in the middle of the network.

Our next result establishes the link between generative performance and feature quality. We find that both increasing the scale of our models and training for more iterations result in better generative performance, which directly translates into better feature quality.

Hover to see sample images up
Each line tracks a model throughout generative pre-training: the dotted markers denote checkpoints at steps 131K, 262K, 524K, and 1000K. The positive slopes suggest a link between improved generative performance and improved feature quality. Larger models also produce better features than smaller models. iGPT-XL is not included because it was trained on a different dataset.

When we evaluate our features using linear probes on CIFAR-10, CIFAR-100, and STL-10, we outperform features from all supervised and unsupervised transfer algorithms. Our results are also compelling in the full fine-tuning setting.

Pre-trained on ImageNet
Evaluation Model Accuracy w/o labels w/ labels
CIFAR-10
Linear Probe
ResNet-152 94.0 check
SimCLR 95.3 check
iGPT-L 32×32 96.3 check
CIFAR-100
Linear Probe
ResNet-152 78.0 check
SimCLR 80.2 check
iGPT-L 32×32 82.8 check
STL-10
Linear Probe
AMDIM-L 94.2 check
iGPT-L 32×32 95.5 check
CIFAR-10
Fine-tune
AutoAugment 98.5
SimCLR 98.6 check
GPipe 99.0 check
iGPT-L 99.0 check
CIFAR-100
Fine-tune
iGPT-L 88.5 check
SimCLR 89.0 check
AutoAugment 89.3
EfficientNet 91.7 check
A comparison of linear probe and fine-tune accuracies between our models and top performing models which utilize either unsupervised or supervised ImageNet transfer. We also include AutoAugment, the best performing model trained end-to-end on CIFAR.

Given the resurgence of interest in unsupervised and self-supervised learning on ImageNet, we also evaluate the performance of our models using linear probes on ImageNet. This is an especially difficult setting, as we do not train at the standard ImageNet input resolution. Nevertheless, a linear probe on the 1536 features from the best layer of iGPT-L trained on 48×48 images yields 65.2% top-1 accuracy, outperforming AlexNet.

Contrastive methods typically report their best results on 8192 features, so we would ideally evaluate iGPT with an embedding dimension of 8192 for comparison. However, training such a model is prohibitively expensive, so we instead concatenate features from multiple layers as an approximation. Unfortunately, our features tend to be correlated across layers, so we need more of them to be competitive. Taking 15360 features from 5 layers in iGPT-XL yields 72.0% top-1 accuracy, outperforming AMDIM, MoCo, and CPC v2, but still underperforming SimCLR by a decent margin.

Method Input Resolution Features Parameters Accuracy
Rotation original 8192 86M 55.4
iGPT-L 32×32 1536 1362M 60.3
BigBiGAN original 16384 86M 61.3
iGPT-L 48×48 1536 1362M 65.2
AMDIM original 8192 626M 68.1
MoCo original 8192 375M 68.6
iGPT-XL 64×64 3072 6801M 68.7
SimCLR original 2048 24M 69.3
CPC v2 original 4096 303M 71.5
iGPT-XL 64×64 3072 x 5 6801M 72.0
SimCLR original 8192 375M 76.5
A comparison of linear probe accuracies between our models and state-of-the-art self-supervised models. We achieve competitive performance while training at much lower input resolutions, though our method requires more parameters and compute.

Because masked language models like BERT have outperformed generative models on most language tasks, we also evaluate the performance of BERT on our image models. Instead of training our model to predict the next pixel given all preceding pixels, we mask out 15% of the pixels and train our model to predict them from the unmasked ones. We find that though linear probe performance on BERT models is significantly worse, they excel during fine-tuning:

CIFAR-10
ImageNet
Comparison of generative pre-training with BERT pre-training using iGPT-L at an input resolution of 322 × 3. Bold colors show the performance boost from ensembling BERT masks. We see that generative models produce much better features than BERT models after pre-training, but BERT models catch up after fine-tuning.

While unsupervised learning promises excellent features without the need for human-labeled data, significant recent progress has been made under the more forgiving framework of semi-supervised learning, which allows for limited amounts of human-labeled data. Successful semi-supervised methods often rely on clever techniques such as consistency regularization, data augmentation, or pseudo-labeling, and purely generative-based approaches have not been competitive for years. We evaluate iGPT-L[7] on a competitive benchmark for this sub-field and find that a simple linear probe on features from non-augmented images outperforms Mean Teacher and MixMatch, though it underperforms FixMatch.

Model 40 labels 250 labels 4000 labels
Improved GAN 81.4 ± 2.3
Mean Teacher 67.7 ± 2.3 90.8 ± 0.2
MixMatch 52.5 ± 11.5 89.0 ± 0.9 93.6 ± 0.1
iGPT-L 73.2 ± 01.5 87.6 ± 0.6 94.3 ± 0.1
UDA 71.0 ± 05.9 91.2 ± 1.1 95.1 ± 0.2
FixMatch RA 86.2 ± 03.4 94.9 ± 0.7 95.7 ± 0.1
FixMatch CTA 88.6 ± 03.4 94.9 ± 0.3 95.7 ± 0.2

A comparison of performance on low-data CIFAR-10. By leveraging many unlabeled ImageNet images, iGPT-L is able to outperform methods such as Mean Teacher and MixMatch but still underperforms the state of the art methods. Our approach to semi-supervised learning is very simple since we only fit a logistic regression classifier on iGPT-L’s features without any data augmentation or fine-tuning—a significant difference from specially designed semi-supervised approaches.

Limitations

While we have shown that iGPT is capable of learning powerful image features, there are still significant limitations to our approach. Because we use the generic sequence transformer used for GPT-2 in language, our method requires large amounts of compute: iGPT-L was trained for roughly 2500 V100-days while a similarly performing MoCo model can be trained in roughly 70 V100-days.

Relatedly, we model low resolution inputs using a transformer, while most self-supervised results use convolutional-based encoders which can easily consume inputs at high resolution. A new architecture, such as a domain-agnostic multiscale transformer, might be needed to scale further. Given these limitations, our work primarily serves as a proof-of-concept demonstration of the ability of large transformer-based language models to learn excellent unsupervised representations in novel domains, without the need for hardcoded domain knowledge. However, the significant resource cost to train these models and the greater accuracy of convolutional neural-network based methods precludes these representations from practical real-world applications in the vision domain.

Finally, generative models can exhibit biases that are a consequence of the data they’ve been trained on. Many of these biases are useful, like assuming that a combination of brown and green pixels represents a branch covered in leaves, then using this bias to continue the image. But some of these biases will be harmful, when considered through a lens of fairness and representation. For instance, if the model develops a visual notion of a scientist that skews male, then it might consistently complete images of scientists with male-presenting people, rather than a mix of genders. We expect that developers will need to pay increasing attention to the data that they feed into their systems and to better understand how it relates to biases in trained models.

Conclusion

We have shown that by trading off 2-D knowledge for scale and by choosing predictive features from the middle of the network, a sequence transformer can be competitive with top convolutional nets for unsupervised image classification. Notably, we achieved our results by directly applying the GPT-2 language model to image generation. Our results suggest that due to its simplicity and generality, a sequence transformer given sufficient compute might ultimately be an effective way to learn excellent features in many domains.

If you’re excited to work with us on this area of research, we’re hiring!

OpenAI