DALL·E 2 Pre-Training Mitigations

DALL·E 2 Pre-Training Mitigations

DALL·E 2 Pre-Training Mitigations

In order to share the magic of DALL·E 2 with a broad audience, we needed to reduce the risks associated with powerful image generation models. To this end, we put various guardrails in place to prevent generated images from violating our content policy. This post focuses on pre-training mitigations, a subset of these guardrails which directly modify the data that DALL·E 2 learns from. In particular, DALL·E 2 is trained on hundreds of millions of captioned images from the internet, and we remove and reweight some of these images to change what the model learns.

This post is organized in three sections, each describing a different pre-training mitigation:

  • In the first section, we describe how we filtered out violent and sexual images from DALL·E 2’s training dataset. Without this mitigation, the model would learn to produce graphic or explicit images when prompted for them, and might even return such images unintentionally in response to seemingly innocuous prompts.
  • In the second section, we find that filtering training data can amplify biases, and describe our technique to mitigate this effect. For example, without this mitigation, we noticed that models trained on filtered data sometimes generated more images depicting men and fewer images depicting women compared to models trained on the original dataset.
  • In the final section, we turn to the issue of memorization, finding that models like DALL·E 2 can sometimes reproduce images they were trained on rather than creating novel images. In practice, we found that this image regurgitation is caused by images that are replicated many times in the dataset, and mitigate the issue by removing images that are visually similar to other images in the dataset.

Reducing Graphic and Explicit Training Data

Since training data shapes the capabilities of any learned model, data filtering is a powerful tool for limiting undesirable model capabilities. We applied this approach to two categories—images depicting graphic violence and sexual content—by using classifiers to filter images in these categories out of the dataset before training DALL·E 2. We trained these image classifiers in-house and are continuing to study the effects of dataset filtering on our trained model.

To train our image classifiers, we reused an approach that we had previously employed to filter training data for GLIDE. The basic steps to this approach are as follows: first, we create a specification for the image categories we would like to label; second, we gather a few hundred positive and negative examples for each category; third, we use an active learning procedure to gather more data and improve the precision/recall trade-off; and finally, we run the resulting classifier on the entire dataset with a conservative classification threshold to favor recall over precision. To set these thresholds, we prioritized filtering out all of the bad data over leaving in all of the good data. This is because we can always fine-tune our model with more data later to teach it new things, but it’s much harder to make the model forget something that it has already learned.

DALL·E 2 Pre-Training Mitigations
DALL·E 2 Pre-Training Mitigations
We start with a small dataset of labeled images (top of figure). We then train a classifier on this data. The active learning process then uses the current classifier to select a handful of unlabeled images that are likely to improve classifier performance. Finally, humans produce labels for these images, adding them to the labeled dataset. The process can be repeated to iteratively improve the classifier’s performance.

During the active learning phase, we iteratively improved our classifiers by gathering human labels for potentially difficult or misclassified images. Notably, we used two active learning techniques to choose images from our dataset (which contains hundreds of millions of unlabeled images) to present to humans for labeling. First, to reduce our classifier’s false positive rate (i.e., the frequency with which it misclassifies a benign image as violent or sexual), we assigned human labels to images that the current model classified as positive. For this step to work well, we tuned our classification threshold for nearly 100% recall but a high false-positive rate; this way, our labelers were mostly labeling truly negative cases. While this technique helps to reduce false positives and reduces the need for labelers to look at potentially harmful images, it does not help find more positive cases that the model is currently missing.

To reduce our classifier’s false negative rate, we employed a second active learning technique: nearest neighbor search. In particular, we ran many-fold cross-validation to find positive samples in our current labeled dataset which the model tended to misclassify as negative (to do this, we literally trained hundreds of versions of the classifier with different train-validation splits). We then scanned our large collection of unlabeled images for nearest neighbors of these samples in a perceptual feature space, and assigned human labels to the discovered images. Thanks to our compute infrastructure, it was trivial to scale up both classifier training and nearest neighbor search to many GPUs, allowing the active learning step to take place over a number of minutes rather than hours or days.

To verify the effectiveness of our data filters, we trained two GLIDE models with the same hyperparameters: one on unfiltered data, and one on the dataset after filtering. We refer to the former model as the unfiltered model, and the latter as the filtered model. As expected, we found that the unfiltered model generally produced less explicit or graphic content in response to requests for this kind of content. However, we also found an unexpected side-effect of data filtering: it created or amplified the model’s biases towards certain demographics.

DALL·E 2 Pre-Training Mitigations

DALL·E 2 Pre-Training Mitigations

Generations for the prompt “military protest” from our unfiltered model (left) and filtered model (right). Notably, the filtered model almost never produces images of guns.

Fixing Bias Introduced by Data Filters

Generative models attempt to match the distribution of their training data, including any biases therein. As a result, filtering the training data has the potential to create or amplify biases in downstream models. In general, fixing biases in the original dataset is a difficult sociotechnical task that we continue to study, and is beyond the scope of this post. The problem we address here is the amplification of biases caused specifically by data filtering itself. With our approach, we aim to prevent the filtered model from being more biased than the unfiltered model, essentially reducing the distribution shift caused by data filtering.

As a concrete example of bias amplification due to filtering, consider the prompt “a ceo”. When our unfiltered model generated images for this prompt, it tended to produce more images of men than women, and we expect that most of this bias is a reflection of our current training data. However, when we ran the same prompt through our filtered model, the bias appeared to be amplified; the generations were almost exclusively images of men.

We hypothesize that this particular case of bias amplification comes from two places: first, even if women and men have roughly equal representation in the original dataset, the dataset may be biased toward presenting women in more sexualized contexts; and second, our classifiers themselves may be biased either due to implementation or class definition, despite our efforts to ensure that this was not the case during the data collection and validation phases. Due to both of these effects, our filter may remove more images of women than men, which changes the gender ratio that the model observes in training.

To investigate filter-induced bias more thoroughly, we wanted a way to measure how much our data filters were affecting the bias towards various concepts. Notably, our violence and sexual content filters are purely image-based, but the multimodal nature of our dataset allows us to directly measure the effects of these filters on text. Since every image is accompanied by a text caption, we were able to look at the relative frequency of hand-selected keywords across the filtered and unfiltered dataset to estimate how much the filters were affecting any given concept.

To put this into practice, we used Apache Spark to compute the frequencies of a handful of keywords (e.g., “parent”, “woman”, “kid”) over all of the captions in both our filtered and unfiltered datasets. Even though our dataset contains hundreds of millions of text-image pairs, computing these keyword frequencies only took a few minutes using our compute cluster.

After computing keyword frequencies, we were able to confirm that our dataset filters had indeed skewed the frequencies of certain keywords more than others. For example, the filters reduced the frequency of the word “woman” by 14%, while the frequency of the word “man” was only reduced by 6%. This confirmed, on a large scale, what we had already observed anecdotally by sampling from GLIDE models trained on both datasets.

DALL·E 2 Pre-Training Mitigations
DALL·E 2 Pre-Training Mitigations
An illustration of dataset reweighting. We start with a balanced dataset (left). If our filter affects one category more than another, it can create a biased dataset (middle). Using reweighting, we effectively “repeat” some data more than others, allowing us to rebalance the bias caused by the filters (right).

Now that we had a proxy for measuring filter-induced bias, we needed a way to mitigate it. To tackle this problem, we aimed to re-weight the filtered dataset so that its distribution better matched the distribution of unfiltered images. As a toy example to illustrate this idea, suppose our dataset consists of 50% cat photos and 50% dog photos, but our data filters remove 75% of dogs but only 50% of cats. The final dataset would be ⅔ cats and ⅓ dogs, and a likelihood-based generative model trained on this dataset would likely generate more images of cats than dogs. We can fix this imbalance by multiplying the training loss of every image of a dog by 2, emulating the effect of repeating every dog image twice. It turns out that we can scale this approach to our real datasets and models in a way that is largely automatic–that is, we needn’t hand-select the features that we want to reweight.

We compute weights for images in the filtered dataset using probabilities from a special classifier, similar to the approach used by Choi et al. (2019). To train this classifier, we uniformly sample images from both datasets and predict which dataset the image came from. In particular, this model predicts P(unfiltered|image), given a prior P(unfiltered) = 0.5. In practice, we don’t want this model to be too powerful, or else it might learn the exact function implemented by our filters in the first place. Instead, we want the model to be smoother than our original data filters, capturing broad categories that are affected by the filters while still being unsure about whether a particular image would be filtered or not. To this end, we trained a linear probe on top of a small CLIP model.

Once we have a classifier which predicts the probability that an image is from the unfiltered dataset, we still need to convert this prediction into a weight for the image. For example, suppose that P(unfiltered|image) = 0.8. This means that the sample is 4 times more likely to be found in the unfiltered data than the filtered data, and a weight of 4 should correct the imbalance. More generally, we can use the weight P(unfiltered|image)/P(filtered|image).[1]

How well does this reweighting scheme actually mitigate the amplified bias? When we fine-tuned our previous filtered model with the new weighting scheme, the fine-tuned model’s behavior much more closely matched the unfiltered model on the biased examples we had previously found. While this was encouraging, we also wanted to evaluate this mitigation more thoroughly using our keyword-based bias heuristic. To measure keyword frequencies while taking our new weighting scheme into account, we can simply weight every instance of a keyword in the filtered dataset by the weight of the sample that contains it. Doing this, we get a new set of keyword frequencies that reflect the sample weights in the filtered dataset.

Across most of the keywords we checked, the reweighting scheme reduced the frequency change induced by filtering. For our previous examples of “man” and “woman”, the relative frequency reductions became 1% and –1%, whereas their previous values were 14% and 6%, respectively. While this metric is just a proxy for actual filtering bias, it is reassuring that our image-based reweighting scheme actually improves a text-based metric so significantly.

We are continuing to investigate remaining biases in DALL·E 2, in part through larger evaluations of the model’s behavior and investigations of how filtering impacted bias and capability development.

Preventing Image Regurgitation

We observed that our internal predecessors to DALL·E 2 would sometimes reproduce training images verbatim. This behavior was undesirable, since we would like DALL·E 2 to create original, unique images by default and not just “stitch together” pieces of existing images. Additionally, reproducing training images verbatim can raise legal questions around copyright infringement, ownership, and privacy (if people’s photos were present in training data).

To better understand the issue of image regurgitation, we collected a dataset of prompts that frequently resulted in duplicated images. To do this, we used a trained model to sample images for 50,000 prompts from our training dataset, and sorted the samples by perceptual similarity to the corresponding training image. Finally, we inspected the top matches by hand, finding only a few hundred true duplicate pairs out of the 50k total prompts. Even though the regurgitation rate appeared to be less than 1%, we felt it was necessary to push the rate down to 0 for the reasons stated above.

When we studied our dataset of regurgitated images, we noticed two patterns. First, the images were almost all simple vector graphics, which were likely easy to memorize due to their low information content. Second, and more importantly, the images all had many near-duplicates in the training dataset. For example, there might be a vector graphic which looks like a clock showing the time 1 o’clock—but then we would discover a training sample containing the same clock showing 2 o’clock, and then 3 o’clock, etc. Once we realized this, we used a distributed nearest neighbor search to verify that, indeed, all of the regurgitated images had perceptually similar duplicates in the dataset. Other works have observed a similar phenomenon in large language models, finding that data duplication is strongly linked to memorization.

The above finding suggested that, if we deduplicated our dataset, we might solve the regurgitation problem. To achieve this, we planned to use a neural network to identify groups of images that looked similar, and then remove all but one image from each group.[2] However, this would require checking, for each image, whether it is a duplicate of every other image in the dataset. Since our whole dataset contains hundreds of millions of images, we would naively need to check hundreds of quadrillions of image pairs to find all the duplicates. While this is technically within reach, especially on a large compute cluster, we found a much more efficient alternative that works almost as well at a small fraction of the cost.

Consider what happens if we cluster our dataset before performing deduplication. Since nearby samples often fall into the same cluster, most of the duplicate pairs would not cross cluster decision boundaries. We could then deduplicate samples within each cluster without checking for duplicates outside of the cluster, while only missing a small fraction of all duplicate pairs. This is much faster than the naive approach, since we no longer have to check every single pair of images.[3] When we tested this approach empirically on a small subset of our data, it found 85% of all duplicate pairs when using K=1024 clusters.

To improve the success rate of the above algorithm, we leveraged one key observation: when you cluster different random subsets of a dataset, the resulting cluster decision boundaries are often quite different. Therefore, if a duplicate pair crosses a cluster boundary for one clustering of the data, the same pair might fall inside a single cluster in a different clustering. The more clusterings you try, the more likely you are to discover a given duplicate pair. In practice, we settled on using five clusterings, which means that we search for duplicates of each image in the union of five different clusters. In practice, this found 97% of all duplicate pairs on a subset of our data.

Surprisingly, almost a quarter of our dataset was removed by deduplication. When we looked at the near-duplicate pairs that were found, many of them included meaningful changes. Recall the clock example from above: the dataset might include many images of the same clock at different times of day. While these images are likely to make the model memorize this particular clock’s appearance, they might also help the model learn to distinguish between times of day on a clock. Given how much data was removed, we were worried that removing images like this might have hurt the model’s performance.

To test the effect of deduplication on our models, we trained two models with identical hyperparameters: one on the full dataset, and one on the deduplicated version of the dataset. To compare the models, we used the same human evaluations we used to evaluate our original GLIDE model. Surprisingly, we found that human evaluators slightly preferred the model trained on deduplicated data, suggesting that the large amount of redundant images in the dataset was actually hurting performance.

Once we had a model trained on deduplicated data, we reran the regurgitation search we had previously done over 50k prompts from the training dataset. We found that the new model never regurgitated a training image when given the exact prompt for the image from the training dataset. To take this test another step further, we also performed a nearest neighbor search over the entire training dataset for each of the 50k generated images. This way, we thought we might catch the model regurgitating a different image than the one associated with a given prompt. Even with this more thorough check, we never found a case of image regurgitation.

Next Steps

While all of the mitigations discussed above represent significant progress towards our goal of reducing the risks associated with DALL·E 2, each mitigation still has room to improve:

  • Better pre-training filters could allow us to train DALL·E 2 on more data and potentially further reduce bias in the model. Our current filters are tuned for a low miss-rate at the cost of many false positives. As a result, we filtered out roughly 5% of our entire dataset even though most of these filtered images do not violate our content policy at all. Improving our filters could allow us to reclaim some of this training data.
  • Bias is introduced and potentially amplified at many stages of system development and deployment. Evaluating and mitigating the bias in systems like DALL·E 2 and the harm induced by this bias is an important interdisciplinary problem that we continue to study at OpenAI as part of our broader mission. Our work on this includes building evaluations to better understand the problem, curating new datasets, and applying techniques like human feedback and fine-tuning to build more robust and representative technologies.
  • It is also crucial that we continue to study memorization and generalization in deep learning systems. While deduplication is a good first step towards preventing memorization, it does not tell us everything there is to learn about why or how models like DALL·E 2 memorize training data.


Alex Nichol, Aditya Ramesh, Pamela Mishkin, Prafulla Dariwal, Joanne Jang, Mark Chen

Writing contributions from
Greg Brockman, Aditya Ramesh, Pamela Mishkin, Mark Chen, Pranav Shyam, Casey Chu, Che Chang, Miles Brundage


  1. When we parametrize P(unfiltered|image) as sigmoid(f(x)), the weight is then exp(f(x)). This can be derived using the definition of the sigmoid:
    $ 1/(1+e^{-f(x)}) / (1-1/(1+e^{-f(x)}))$
    $= 1/(1+e^{-f(x)}) / ((1+e^{-f(x)} – 1)/(1+e^{-f(x)}))$
    $= 1/(1+e^{-f(x)}) / ((e^{-f(x)})/(1+e^{-f(x)}))$
    $= (1+e^{-f(x)})/(1+e^{-f(x)}) / (e^{-f(x)})$
    $= 1 / (e^{-f(x)}) = e^{f(x)}$ ↩︎

  2. To achieve this, we can compute a feature vector $v_i$ for every training image $i$, and then remove all images $j$ such that there exists an $i < j$ where $||v_i – v_j|| < $threshold. To solve this problem naively, we would need to compute every pairwise distance $||v_i – v_j||$, a task that scales quadratically with the size of our dataset. ↩︎

  3. Letting $K$ represent the number of clusters and $N$ the dataset size, this approach only requires $O(K*(N/K)^2) = O(N^2/K)$ pairwise distance calculations, rather than the full $O(N^2)$. Meanwhile, we are still guaranteed that no image will have more than $K$ near-duplicates in the worst possible case ↩︎


Learning to Play Minecraft with Video PreTraining (VPT)

Learning to Play Minecraft with Video PreTraining (VPT)

Learning to Play Minecraft with Video PreTraining (VPT)

We trained a neural network to play Minecraft by Video PreTraining (VPT) on a massive unlabeled video dataset of human Minecraft play, while using only a small amount of labeled contractor data. With fine-tuning, our model can learn to craft diamond tools, a task that usually takes proficient humans over 20 minutes (24,000 actions). Our model uses the native human interface of keypresses and mouse movements, making it quite general, and represents a step towards general computer-using agents.

Read Paper

View Code and model weights

MineRL Competition

The internet contains an enormous amount of publicly available videos that we can learn from. You can watch a person make a gorgeous presentation, a digital artist draw a beautiful sunset, and a Minecraft player build an intricate house. However, these videos only provide a record of what happened but not precisely how it was achieved, i.e. you will not know the exact sequence of mouse movements and keys pressed. If we would like to build large-scale foundation models in these domains as we’ve done in language with GPT, this lack of action labels poses a new challenge not present in the language domain, where “action labels” are simply the next words in a sentence.

In order to utilize the wealth of unlabeled video data available on the internet, we introduce a novel, yet simple, semi-supervised imitation learning method: Video PreTraining (VPT). We start by gathering a small dataset from contractors where we record not only their video, but also the actions they took, which in our case are keypresses and mouse movements. With this data we train an inverse dynamics model (IDM), which predicts the action being taken at each step in the video. Importantly, the IDM can use past and future information to guess the action at each step. This task is much easier and thus requires far less data than the behavioral cloning task of predicting actions given past video frames only, which requires inferring what the person wants to do and how to accomplish it. We can then use the trained IDM to label a much larger dataset of online videos and learn to act via behavioral cloning.

Learning to Play Minecraft with Video PreTraining (VPT)
Learning to Play Minecraft with Video PreTraining (VPT)
VPT method overview

VPT Zero-Shot Results

We chose to validate our method in Minecraft because it (1) is one of the most actively played video games in the world and thus has a wealth of freely available video data and (2) is open-ended with a wide variety of things to do, similar to real-world applications such as computer usage. Unlike prior works in Minecraft that use simplified action spaces aimed at easing exploration, our AI uses the much more generally applicable, though also much more difficult, native human interface: 20Hz framerate with the mouse and keyboard.

Trained on 70,000 hours of IDM-labeled online video, our behavioral cloning model (the “VPT foundation model”) accomplishes tasks in Minecraft that are nearly impossible to achieve with reinforcement learning from scratch. It learns to chop down trees to collect logs, craft those logs into planks, and then craft those planks into a crafting table; this sequence takes a human proficient in Minecraft approximately 50 seconds or 1,000 consecutive game actions.

Learning to Play Minecraft with Video PreTraining (VPT)
Learning to Play Minecraft with Video PreTraining (VPT)
Sequence of items required to craft a crafting table, labeled with the median time it takes proficient humans to reach each step
Crafting of a crafting table “zero shot” (i.e. after pre-training only without additional fine-tuning)

Additionally, the model performs other complex skills humans often do in the game, such as swimming, hunting animals for food, and eating that food. It also learned the skill of “pillar jumping”, a common behavior in Minecraft of elevating yourself by repeatedly jumping and placing a block underneath yourself.

Swimming (zero-shot)

Hunting animals (zero-shot)

Eating food (zero-shot)

Pillar jumping (zero-shot)

Fine-tuning with Behavioral Cloning

Foundation models are designed to have a broad behavior profile and be generally capable across a wide variety of tasks. To incorporate new knowledge or allow them to specialize on a narrower task distribution, it is common practice to fine-tune these models to smaller, more specific datasets. As a case study into how well the VPT foundation model can be fine-tuned to downstream datasets, we asked our contractors to play for 10 minutes in brand new Minecraft worlds and build a house from basic Minecraft materials. We hoped that this would amplify the foundation model’s ability to reliably perform “early game” skills such as building crafting tables. When fine-tuning to this dataset, not only do we see a massive improvement in reliably performing the early game skills already present in the foundation model, but the fine-tuned model also learns to go even deeper into the technology tree by crafting both wooden and stone tools. Sometimes we even see some rudimentary shelter construction and the agent searching through villages, including raiding chests.

Learning to Play Minecraft with Video PreTraining (VPT)
Learning to Play Minecraft with Video PreTraining (VPT)
Sequence of items required to craft a stone pickaxe, labeled with the median time it takes proficient humans to reach each step
Improved early game behavior from BC fine-tuning

Crafting a stone pickaxe

Constructing a rudimentary wooden shelter

Searching through a village

Data Scaling

Perhaps the most important hypothesis of our work is that it is far more effective to use labeled contractor data to train an IDM (as part of the VPT pipeline) than it is to directly train a BC foundation model from that same small contractor dataset. To validate this hypothesis we train foundation models on increasing amounts of data from 1 to 70,000 hours. Those trained on under 2,000 hours of data are trained on the contractor data with ground-truth labels that were originally collected to train the IDM, and those trained on over 2,000 hours are trained on internet data labeled with our IDM. We then take each foundation model and fine-tune it to the house building dataset described in the previous section.

Effect of foundation model training data on fine-tuning

As foundation model data increases, we generally see an increase in crafting ability, and only at the largest data scale do we see the emergence of stone tool crafting.

Fine-Tuning with Reinforcement Learning

When it is possible to specify a reward function, reinforcement learning (RL) can be a powerful method for eliciting high, potentially even super-human, performance. However, many tasks require overcoming hard exploration challenges, and most RL methods tackle these with random exploration priors, e.g. models are often incentivized to act randomly via entropy bonuses. The VPT model should be a much better prior for RL because emulating human behavior is likely much more helpful than taking random actions. We set our model the challenging task of collecting a diamond pickaxe, an unprecedented capability in Minecraft made all the more difficult when using the native human interface.

Crafting a diamond pickaxe requires a long and complicated sequence of subtasks. To make this task tractable, we reward agents for each item in the sequence.

Learning to Play Minecraft with Video PreTraining (VPT)
Learning to Play Minecraft with Video PreTraining (VPT)
RL fine-tuned VPT model crafting a diamond pickaxe

We found that an RL policy trained from a random initialization (the standard RL method) barely achieves any reward, never learning to collect logs and only rarely collecting sticks. In stark contrast, fine-tuning from a VPT model not only learns to craft diamond pickaxes (which it does in 2.5% of 10-minute Minecraft episodes), but it even has a human-level success rate at collecting all items leading up to the diamond pickaxe. This is the first time anyone has shown a computer agent capable of crafting diamond tools in Minecraft, which takes humans over 20 minutes (24,000 actions) on average.

Reward over episodes


VPT paves the path toward allowing agents to learn to act by watching the vast numbers of videos on the internet. Compared to generative video modeling or contrastive methods that would only yield representational priors, VPT offers the exciting possibility of directly learning large scale behavioral priors in more domains than just language. While we only experiment in Minecraft, the game is very open-ended and the native human interface (mouse and keyboard) is very generic, so we believe our results bode well for other similar domains, e.g. computer usage.

For more information, please see our paper. We are also open sourcing our contractor data, Minecraft environment, model code, and model weights, which we hope will aid future research into VPT. Furthermore, we have partnered with the MineRL NeurIPS competition this year. Contestants can use and fine-tune our models to try to solve many difficult tasks in Minecraft. Those interested can check out the competition webpage and compete for a blue-sky prize of $100,000 in addition to a regular prize pool of $20,000. Grants are available to self-identified underrepresented groups and individuals.

This was a large effort by a dedicated team. Each author made huge contributions on many fronts over long time periods. All members were full time on the project for over six months. BB, IA, PZ, and JC were on the original VPT project team, and thus were involved for even longer (over a year). Aside from those original team members, author order is random. It was also randomized between IA and PZ.


AI-Written Critiques Help Humans Notice Flaws

AI-Written Critiques Help Humans Notice Flaws

AI-Written Critiques Help Humans Notice Flaws

We trained “critique-writing” models to describe flaws in summaries. Human evaluators find flaws in summaries much more often when shown our model’s critiques. Larger models are better at self-critiquing, with scale improving critique-writing more than summary-writing. This shows promise for using AI systems to assist human supervision of AI systems on difficult tasks.

Read paperView dataset

We want to ensure that future AI systems performing very difficult tasks remain aligned with human intent. Many previous works on aligning language models rely on human evaluations as a training signal. However, humans struggle at evaluating very difficult tasks—for example, it is hard to spot every bug in a codebase or every factual error in a long essay. Models may then learn to give outputs that look good to humans but have errors we systematically fail to notice.

To mitigate this problem, we want to train AI assistants that help humans provide feedback on hard tasks. These assistants should point out flaws, help humans understand what’s going on, and answer their questions. An example of this is our past work on book summarization: reading the entire book is a lot of work, but humans assisted with chapter summaries have a much easier time evaluating a book summary.

As a proof of concept, we used supervised learning to train language models to write critiques of topic-based summaries of short stories, Wikipedia articles, and other texts from the internet. We use these models to assist human evaluators and study scaling properties of critique writing.

Experiments with AI assistance

We compare human ratings of AI-written summaries between a control group receiving no assistance and an assisted group who get to see 8 AI-written critiques. Summaries are picked from 3 different sources. Assisted humans find about 50% more flaws in summaries than unassisted raters, using model critiques directly for most of the critiques they find.

To see how useful our models are for evaluation assistance, we show labelers 8 model-written critiques of each summary, with a control group that receives no assistance. We use topic-based summaries from three sources: written by our models, written by humans, and written by humans deliberately to have important yet subtle flaws.

New Jersey is in the crosshairs of a major winter storm that could paralyze parts of New England and dump in excess of a foot of snow on the Garden State by Saturday. The forecast remains highly volatile and may change dramatically in the coming 24 hours.

Throughout the day, The Star-Ledger will provide updates here (newest on top) as new information comes in, watches and warnings are issued and the forecast changes.

10:30 P.M. Weather forecasters tonight reiterated warnings for drivers and residents that a potentially dangerous portion of the storm will be hitting much of central and northern New Jersey during Friday’s evening rush-hour. Major travel delays are expected late Friday and Friday night as rain turns into snow, the National Weather Service forecast said.


• Friday, Feb. 8: N.J. snowstorm: Live updates on blizzard, traffic, flooding and more

• Saturday, Feb. 9: N.J. snowstorm update: Power outages, snow totals and other storm news

After periods of rain, heavy snow is expected to be falling in many places by late Friday afternoon , the forecast said. In some places north of Interstate 78, snow is expected to come down between 1 and 2 inches per hour. In counties like Sussex, Morris and Warren, expected snow accumulations range from 6 to 16 inches.

For many towns from Jackson in Ocean County to Somerville in Somerset County and out east to Long Beach Island, snow accumulation is expected to range from 4 to 10 inches. High winds are expected throughout the region, topping out in Monmouth County, with gusts up to 45 mph possible.

By daybreak Saturday, flurries will taper off, giving way to a sunny, blustery day, the latest forecast said.

9:12 P.M. With forecasters still predicting a major winter storm to hit New Jersey, many schools throughout the state are preemptively canceling or delaying classes Friday.

8:45 P.M. In advance of the storm, NJ Transit has announced it will be offering full systemwide cross-honoring all day Friday and all day Saturday, enabling customers to use their ticket or pass on an alternate travel mode — rail, bus or light rail.

5 P.M. The signatures of thunder-snow (which is just what it sounds like — thunder and lightning during heavy snow) are showing up on several models, according to NY NJ PA Weather meteorologistSteven DiMartino.

This indicates the potential for extremely heavy snow to fall in eastern New Jersey tomorrow night, and adds to the unpredictability to totals.

”Where you get some of this convective snow, when it comes down, it’s going to come down very, very hard,” he said. “It’s difficult to pinpoint just where these bands are going to occur. You could end up with a situation where one town has 18 inches of snow and the next town over has three.”

DiMartino stressed the volatility that remains in the forecast, and urged state residents to pay close attention to changing conditions. Many of the details of what ultimately will happen in local areas will not be determined until the storm beings to come together tomorrow.

He said the potential for these heavier snow bands to develop may be why some forecast models (like the NAM, above), are predicting much heavier snowfall totals than the National Weather Service.


The North American Model (NAM), released this afternoon, showed well over a foot of snow falling over many areas in New Jersey.

4:13 P.M. The National Weather Service has issued a blizzard warning for parts of northeastern New Jersey, including Newark and Jersey City, and the five boroughs of New York, where upwards of 14 inches of snow are expected along with howling winds and severely reduced visibility.

The blizzard warnings are in effect from 6 a.m. Friday until 1 p.m. Saturday and warn of 10 to 14 inches of snow, with locally higher amounts and white-out conditions with wind gusts of up to 45 miles per hour. Blizzard conditions are expected in coastal northeastern New Jersey, in southern Bergen and Passaic Counties and Eastern Hudson, Essex and Union counties.

Further north and west, 10 to 14 inches of snow are also expected, but winds are not expected to reach blizzard criteria. Winter storm warnings are in effect there.

3:24 P.M. The National Weather Service at Mount Holly has issued Winter Storm warnings for several counties in northern and central New Jersey and extended further them further south than the areas the previously issued watches covered.

The winter storm warnings have been issued for Sussex, Warren, Morris, Hunterdon, Middlesex, Monmouth, Ocean and northwest Burlington counties. In Sussex, Warren and Morris counties, the National Weather Service is expecting between ten to 16 inches of snow to fall, while other counties in the warning areacould receive six to ten inches. The warnings are in effect from 6 a.m. Friday to 6 a.m. Saturday.

Expect the National Weather Service’s Upton, N.Y. office, which covers northeastern N.J., to follow suit shortly.

Further south, winter weather advisories have been issued for the rest of the state, where between two and five inches of snow is anticipated.

3:07 P.M.The private and public sectors in New Jersey are now bracing for major storm impacts.

More than 350 United Airlines flights, many based out of Newark-Liberty International Airport, have already been canceled, according to flight tracking website FlightAware. NJ Transit announced they will cross-honor tickets across its entire system. Utilities like Jersey Central Power & Light and PSE&G say they will have extra crews on hand to deal with potential power issues caused by heavy snow and wind.

Additionally, several events are being postponed across the state, such as two sectional high school track championships. The state Office of Emergency Management has not yet opened its operations center in Trenton, but it remains a possibility. Mary Goepfert, a spokeswoman for OEM, said the state is monitoring the storm closely and has been in contact with local emergency managers in preparation.

2:07 P.M. The European model is in and it looks snowy, much like many of the other models that ran earlier. Were this to verify, a six to 12-inch plus snowfall is definitely in the cards for north and central New Jersey, particularly north of Interstate-195.

Freehold-based meteorologist and owner of NY NJ PA Weather Steven DiMartino said he likes the European solution best, so far, and agrees with totals.

What does the NAM look like, you ask? Well the snowfall printout is posted below, but Eric Holthaus tweeted a picture of the simulated radar produced by the NAM model for tomorrow night. An absolute monster.

1:50 P.M. The most-affected regions of Hurricane Sandy along the New Jersey coast are about to take another hit. With defenses already weakened, coastal communities could see major impacts from coastal flooding, with the worst coming Saturday morning, according to the National Weather Service.

”I’m really worried about the areas worst hit by Sandy,” said NWS meteorologist Gary Szatkowski. “Time is starting to work against us…We could see substantial beach erosion. I know people have been working hard, but there’s less to erode. We could easily see waves and water coming into areas you typically wouldn’t.”

Szatkowski said he is concerned about the Raritan Bay shore in particular, where a three foot storm surge is possible at high tide Saturday morning, with five to seven foot waves breaking over top of it.

1:22 P.M. Tomorrow night’s commute could be awful in northern New Jersey. By 7 p.m., there is a threat that snowfall rates could reach two inches per hour across large swaths of northern and central New Jersey. Snowfall rates of this magnitude could reduce visibility substantially, wreak havoc on roads and make travel dangerous, if not nearly impossible.

Gary Szatkowski, meteorologist in charge at the National Weather Service’s Mount Holly office, said he is going “very worried” about deteoriorating conditions in the afternoon, and posted a map on Twitter showing where the threat of intense snowfall will be at 7 p.m.

12:34 P.M. An important thing to remember about this storm is the volatility in the forecast remains high, even though models have been trending snowier. State Climatologist David Robinson said the bust potential for this forecast is “tremendous” and the slightest shift in the forecast track could mean the difference between a major snowstorm, and a primarily rain event for much of the state.

Eric Holthaus, of the Wall Street Journal, points out that how much warm air enters region prior to storm will be crucial

12:04 P.M. The National Weather Service at Mount Hollyand Upton, N.Y. both issued briefing packages on the coming storm this morning. Each warned that blizzard conditions may occur Friday night in northern New Jersey. Mount Holly suggested blizzard warnings may be necessary as the storm unfolds.

Blizzard warnings are issued during very specific situations by the National Weather Service. Anticipated winds of at least 35 miles per hour and visibility reduced below a quarter of a mile for a period of three hours is necessary before the agency pulls the trigger on such a warning. Travel would become all but impossible.

11:53 A.M. David Robinson, the state climatologist at Rutgers University, said he does not envy forecasters today, calling this type of storm “the most difficult forecast a New Jersey meteorologist will have to make.” The forecast is complicated for a number of reasons, from New Jersey’s geography to the thermal profile of the atmosphere. More on why New Jersey winter storms are so hard to pin down later.

11:35 A.M. Forecast model guidance on the storm continues to vary but appears to be focusing in on a snowier solution for northern and central New Jersey. Overnight, several reliable models (The European, GFS and NAM) showed very different solutions to the storm, showing everything from minor event to a major winter storm that would have serious impacts on travel in northern sections of the state.

This morning, the GFS and NAM both showed the bulk of New Jersey north of I-195 receiving several inches of snow, perhaps exceeding a foot in some areas. The latest run of the European model, considered one of the most reliable, will be released at approximately 1:30 p.m.


The North American Model (NAM) shows an even snowier solution for New Jersey, with parts of the state easily exceeding a foot of snow.

Keep in mind, each model run is just one of scores of pieces of data the National Weather Service uses to make forecasts and no single model should be viewed as a complete representation of what will happen.

11:30 A.M. A winter storm watch remains in effectfor the vast majority of northern and central New Jersey. Current forecasts call for six to 12 inches of snow, with higher amounts possible in the northern most sections of New Jersey.

Because the storm is highly complex and much remains uncertain, particularly where the rain/snow line will fall, the National Weather Service is holding off on issuing any warnings until this afternoon.

_The Associated Press contributed to this report._

Follow @SStirling

Note to readers: if you purchase something through one of our affiliate links we may earn a commission.

What does the article say about the storm’s effects on daily life?

Daily events are expected to be heavily disrupted during the storm as the conditions for a blizzard are ripe. Travel is expected to be heavily affected, and NJ Transit announced that commuters will be able to use their tickets for any transit option like buses, ferries, etc. will be able to use it for another method. Visibility is expected to be extremely reduced as well. Furthermore, due to the coastal areas having already been affected by the hurricane Sandy, this storm may push waves and water inland to a degree much further than previously seen. The storm can also end up destroying the roads, making travel impossible. Finally, many schools preemptively canceled schools in addition to several events.
Daily events are expected to be heavily disrupted during the storm as the conditions for a blizzard are ripe. Travel is expected to be heavily affected, and NJ Transit announced that commuters will be able to use their tickets for any transit option like buses, ferries, etc. will be able to use it for another method. Visibility is expected to be extremely reduced as well. Furthermore, this storm may push waves and water inland to a degree much further than previously seen as an aftereffect of Hurricane Sandy. The storm can also end up destroying the roads, making travel impossible. Finally, many schools preemptively canceled schools in addition to several events.
The storm could paralyze parts of New England, and dump a foot of snow on New Jersey. Travel will be impacted. Many flights have already been canceled. Schools are preemptively canceling or delaying classes. Events are being postponed. There is a threat that snowfall rates could reach two inches per hour across large swaths of northern and central New Jersey. Snowfall rates of this magnitude could reduce visibility substantially, wreak havoc on roads and make travel dangerous, if not nearly impossible.

The response inaccurately mentions that the ferry is alternative transit option during the storm.

The response contains inaccurate information.
The answer should mention the potential for power outages.
The response should include the potential for power outages.
The answer is missing the part about the potential for power outages.
“etc. will be able to use it for another method” is incoherent
“wil have to” should be “will be able to” because “ferries, etc.” is unnecessary.

The answer is missing the part about the potential for power outages.
The last sentence doesn’t make sense
The answer should make some mention of the expected snow

The response contains inaccurate information.
The answer should mention the potential for power outages.
The last sentence is redundant and should be removed.
“etc. will be able to use it for another method” is unclear and unnecessary.
The response should include that the storm will weaken the defenses of the coast that were weakened by Hurricane Sandy.
The sentence about NJ Transit is poorly constructed.

The answer should mention the potential for power outages.
“etc. will be able to use it for another method” is unclear and unnecessary.
The response should include that the storm will weaken the defenses of the coast that were weakened by Hurricane Sandy.
This should point out that the storm will cause major travel delays as rain turns into snow.

The answer should mention the potential for power outages.
The answer should mention the blizzard warning.
The answer omits the potential for power outages.
The answer omits the expected power outages.
The answer should mention the power outages.
The answer should mention that the Utilities say they will have extra crews on hand to deal with potential power issues caused by heavy snow and wind.
The second and third sentences should be combined to reduce redundancy. Also, the preposition “of” at the end of the third sentence can be removed to reduce redundancy again.

The answer should mention the potential for power outages.

Even though summarization isn’t actually a difficult task for humans and our models aren’t more capable than humans, they already provide meaningful assistance: when asked to evaluate model-written summaries, the assisted group finds 50% more flaws than the control group. For deliberately misleading summaries, assistance increases how often humans spot the intended flaw from 27% to 45%.

Scaling properties of critiques

Assistance on model-written summaries only works if they are able to critique themselves. We ask humans to rate the helpfulness of model-written self-critiques, and find larger models are better at self-critiquing.

Larger models are better at self-critiquing in our topic-based summarization domain: Even though larger models have answers that are more difficult to critique, they generate more helpful critiques of their own outputs. In this plot, model scale is measured in log loss (nats) after fine-tuning. Helpfulness is determined by a human judging whether the model-generated critique of the model-generated answer is valid and useful for understanding summary quality. We filter for summaries that humans found a critique for.

We also find that large models are able to directly improve their outputs, using their self-critiques, which small models are unable to do. Using better critiques helps models make better improvements than they do with worse critiques, or with no critiques.

Do models tell us everything they know?

To provide the best evaluation assistance on difficult tasks, we would like models to communicate all problems that they “know about.” Whenever a model correctly predicts that an answer is flawed, can the model also produce a concrete critique that humans understand?

This is particularly important for supervising models that could attempt to mislead human supervisors or hide information. We would like to train equally smart assistance models to point out what humans don’t notice.

Unfortunately, we found that models are better at discriminating than at critiquing their own answers, indicating they know about some problems that they can’t or don’t articulate. Furthermore, the gap between discrimination and critique ability did not appear to decrease for larger models. Reducing this gap is an important priority for our alignment research.

Next steps

An important limitation of this work is that topic-based summarization is not actually a difficult task: humans understand it quite well and it takes them only about 10 minutes to evaluate a summary. To understand the limits of AI-assisted evaluation better, we need to work with tasks that are much more difficult for humans to evaluate.

Nevertheless, these results make us optimistic that we can train models to provide humans with meaningful feedback assistance. This is an important pillar of our alignment strategy, starting with the work on debate and recursive reward modeling. In the long run, we want to build assistants that can be trusted to take on all of the cognitive labor needed for evaluation, so humans can focus on communicating their preferences.

If you’re interested in this line of research, we’re hiring! Apply for roles under “All teams” and mention your interest in scalable alignment!


Techniques for Training Large Neural Networks

Techniques for Training Large Neural Networks

Techniques for Training Large Neural Networks

Large neural networks are at the core of many recent advances in AI, but training them is a difficult engineering and research challenge which requires orchestrating a cluster of GPUs to perform a single synchronized calculation. As cluster and model sizes have grown, machine learning practitioners have developed an increasing variety of techniques to parallelize model training over many GPUs. At first glance, understanding these parallelism techniques may seem daunting, but with only a few assumptions about the structure of the computation these techniques become much more clear—at that point, you’re just shuttling around opaque bits from A to B like a network switch shuttles around packets.

Data Parallelism

Techniques for Training Large Neural Networks

Pipeline Parallelism

Techniques for Training Large Neural Networks

Tensor Parallelism

Techniques for Training Large Neural Networks

Expert Parallelism

Techniques for Training Large Neural Networks

Data Parallelism

Techniques for Training Large Neural Networks

Pipeline Parallelism

Techniques for Training Large Neural Networks

Tensor Parallelism

Techniques for Training Large Neural Networks

Expert Parallelism

Techniques for Training Large Neural Networks

An illustration of various parallelism strategies on a three-layer model. Each color refers to one layer and dashed lines separate different GPUs.

No Parallelism

Training a neural network is an iterative process. In every iteration, we do a pass forward through a model’s layers to compute an output for each training example in a batch of data. Then another pass proceeds backward through the layers, propagating how much each parameter affects the final output by computing a gradient with respect to each parameter. The average gradient for the batch, the parameters, and some per-parameter optimization state is passed to an optimization algorithm, such as Adam, which computes the next iteration’s parameters (which should have slightly better performance on your data) and new per-parameter optimization state. As the training iterates over batches of data, the model evolves to produce increasingly accurate outputs.

Various parallelism techniques slice this training process across different dimensions, including:

  • Data parallelism—run different subsets of the batch on different GPUs;
  • Pipeline parallelism—run different layers of the model on different GPUs;
  • Tensor parallelism—break up the math for a single operation such as a matrix multiplication to be split across GPUs;
  • Mixture-of-Experts—process each example by only a fraction of each layer.

(In this post, we’ll assume that you are using GPUs to train your neural networks, but the same ideas apply to those using any other neural network accelerator.)

Data Parallelism

Data Parallel training means copying the same parameters to multiple GPUs (often called “workers”) and assigning different examples to each to be processed simultaneously. Data parallelism alone still requires that your model fits into a single GPU’s memory, but lets you utilize the compute of many GPUs at the cost of storing many duplicate copies of your parameters. That being said, there are strategies to increase the effective RAM available to your GPU, such as temporarily offloading parameters to CPU memory between usages.

As each data parallel worker updates its copy of the parameters, they need to coordinate to ensure that each worker continues to have similar parameters. The simplest approach is to introduce blocking communication between workers: (1) independently compute the gradient on each worker; (2) average the gradients across workers; and (3) independently compute the same new parameters on each worker. Step (2) is a blocking average which requires transferring quite a lot of data (proportional to the number of workers times the size of your parameters), which can hurt your training throughput. There are various asynchronous synchronization schemes to remove this overhead, but they hurt learning efficiency; in practice, people generally stick with the synchronous approach.

Pipeline Parallelism

With Pipeline Parallel training, we partition sequential chunks of the model across GPUs. Each GPU holds only a fraction of parameters, and thus the same model consumes proportionally less memory per GPU.

It’s straightforward to split a large model into chunks of consecutive layers. However, there’s a sequential dependency between inputs and outputs of layers, so a naive implementation can lead to a large amount of idle time while a worker waits for outputs from the previous machine to be used as its inputs. These waiting time chunks are known as “bubbles,” wasting the computation that could be done by the idling machines.

Techniques for Training Large Neural Networks
Techniques for Training Large Neural Networks
Techniques for Training Large Neural Networks
Gradient update
Techniques for Training Large Neural Networks
Techniques for Training Large Neural Networks

Illustration of a naive pipeline parallelism setup where the model is vertically split into 4 partitions by layer. Worker 1 hosts model parameters of the first layer of the network (closest to the input), while worker 4 hosts layer 4 (which is closest to the output). “F”, “B”, and “U” represent forward, backward and update operations, respectively. The subscripts indicate on which worker an operation runs. Data is processed by one worker at a time due to the sequential dependency, leading to large “bubbles” of idle time.

We can reuse the ideas from data parallelism to reduce the cost of the bubble by having each worker only process a subset of data elements at one time, allowing us to cleverly overlap new computation with wait time. The core idea is to split one batch into multiple microbatches; each microbatch should be proportionally faster to process and each worker begins working on the next microbatch as soon as it’s available, thus expediting the pipeline execution. With enough microbatches the workers can be utilized most of the time with a minimal bubble at the beginning and end of the step. Gradients are averaged across microbatches, and updates to the parameters happen only once all microbatches have been completed.

The number of workers that the model is split over is commonly known as pipeline depth.

During the forward pass, workers only need to send the output (called activations) of its chunk of layers to the next worker; during the backward pass, it only sends the gradients on those activations to the previous worker. There’s a big design space of how to schedule these passes and how to aggregate the gradients across microbatches. GPipe has each worker process forward and backward passes consecutively and then aggregates gradients from multiple microbatches synchronously at the end. PipeDream instead schedules each worker to alternatively process forward and backward passes.

Techniques for Training Large Neural Networks
Techniques for Training Large Neural Networks
Techniques for Training Large Neural Networks
Techniques for Training Large Neural Networks

Techniques for Training Large Neural Networks


Techniques for Training Large Neural Networks

Comparison of GPipe and PipeDream pipelining schemes, using 4 microbatches per batch. Microbatches 1-8 correspond to two consecutive data batches. In the image, “(number)” indicates on which microbatch an operation is performed and the subscript marks the worker ID. Note that PipeDream gets more efficiency by performing some computations with stale parameters.

Tensor Parallelism

Pipeline parallelism splits a model “vertically” by layer. It’s also possible to “horizontally” split certain operations within a layer, which is usually called Tensor Parallel training. For many modern models (such as the Transformer), the computation bottleneck is multiplying an activation batch matrix with a large weight matrix. Matrix multiplication can be thought of as dot products between pairs of rows and columns; it’s possible to compute independent dot products on different GPUs, or to compute parts of each dot product on different GPUs and sum up the results. With either strategy, we can slice the weight matrix into even-sized “shards”, host each shard on a different GPU, and use that shard to compute the relevant part of the overall matrix product before later communicating to combine the results.

One example is Megatron-LM, which parallelizes matrix multiplications within the Transformer’s self-attention and MLP layers. PTD-P uses tensor, data, and pipeline parallelism; its pipeline schedule assigns multiple non-consecutive layers to each device, reducing bubble overhead at the cost of more network communication.

Sometimes the input to the network can be parallelized across a dimension with a high degree of parallel computation relative to cross-communication. Sequence parallelism is one such idea, where an input sequence is split across time into multiple sub-examples, proportionally decreasing peak memory consumption by allowing the computation to proceed with more granularly-sized examples.

Mixture-of-Experts (MoE)

With the Mixture-of-Experts (MoE) approach, only a fraction of the network is used to compute the output for any one input. One example approach is to have many sets of weights and the network can choose which set to use via a gating mechanism at inference time. This enables many more parameters without increased computation cost. Each set of weights is referred to as “experts,” in the hope that the network will learn to assign specialized computation and skills to each expert. Different experts can be hosted on different GPUs, providing a clear way to scale up the number of GPUs used for a model.

Techniques for Training Large Neural Networks

Illustration of a mixture-of-experts (MoE) layer. Only 2 out of the n experts are selected by the gating network. (Image adapted from: Shazeer et al., 2017)

GShard scales an MoE Transformer up to 600 billion parameters with a scheme where only the MoE layers are split across multiple TPU devices and other layers are fully duplicated. Switch Transformer scales model size to trillions of parameters with even higher sparsity by routing one input to a single expert.

Other Memory Saving Designs

There are many other computational strategies to make training increasingly large neural networks more tractable. For example:

  • To compute the gradient, you need to have saved the original activations, which can consume a lot of device RAM. Checkpointing (also known as activation recomputation) stores any subset of activations, and recomputes the intermediate ones just-in-time during the backward pass. This saves a lot of memory at the computational cost of at most one additional full forward pass. One can also continually trade off between compute and memory cost by selective activation recomputation, which is checkpointing subsets of the activations that are relatively more expensive to store but cheaper to compute.

  • Mixed Precision Training is to train models using lower-precision numbers (most commonly FP16). Modern accelerators can reach much higher FLOP counts with lower-precision numbers, and you also save on device RAM. With proper care, the resulting model can lose almost no accuracy.

  • Offloading is to temporarily offload unused data to the CPU or amongst different devices and later read it back when needed. Naive implementations will slow down training a lot, but sophisticated implementations will pre-fetch data so that the device never needs to wait on it. One implementation of this idea is ZeRO which splits the parameters, gradients, and optimizer states across all available hardware and materializes them as needed.

  • Memory Efficient Optimizers have been proposed to reduce the memory footprint of the running state maintained by the optimizer, such as Adafactor.

  • Compression also can be used for storing intermediate results in the network. For example, Gist compresses activations that are saved for the backward pass; DALL·E compresses the gradients before synchronizing them.

At OpenAI, we are training and improving large models from the underlying infrastructure all the way to deployment them for real-world problems. If you’d like to put the ideas from this post into practice—especially relevant for our Scaling and Applied Research teams—we’re hiring!

Thanks to Nikolas Tezak, Sam Altman, Daniel Gackle, Ilya Sutskever, and Steve Dowling for feedback on drafts. Thanks to Justin Jay Wang, Bianca Martin, and Steve Dowling for communications and design.


Best Practices for Deploying Language Models

Best Practices for Deploying Language Models

Best Practices for Deploying Language Models

Cohere, OpenAI, and AI21 Labs have developed a preliminary set of best practices applicable to any organization developing or deploying large language models. Computers that can read and write are here, and they have the potential to fundamentally impact daily life. The future of human–machine interaction is full of possibility and promise, but any powerful technology needs careful deployment.

The joint statement below represents a step towards building a community to address the global challenges presented by AI progress, and we encourage other organizations who would like to participate to get in touch.

Joint Recommendation for Language Model Deployment

We’re recommending several key principles to help providers of large language models (LLMs) mitigate the risks of this technology in order to achieve its full promise to augment human capabilities.

While these principles were developed specifically based on our experience with providing LLMs through an API, we hope they will be useful regardless of release strategy (such as open-sourcing or use within a company). We expect these recommendations to change significantly over time because the commercial uses of LLMs and accompanying safety considerations are new and evolving. We are actively learning about and addressing LLM limitations and avenues for misuse, and will update these principles and practices in collaboration with the broader community over time.

We’re sharing these principles in hopes that other LLM providers may learn from and adopt them, and to advance public discussion on LLM development and deployment.

Prohibit misuse

Publish usage guidelines and terms of use of LLMs in a way that prohibits material harm to individuals, communities, and society such as through spam, fraud, or astroturfing. Usage guidelines should also specify domains where LLM use requires extra scrutiny and prohibit high-risk use-cases that aren’t appropriate, such as classifying people based on protected characteristics.

Build systems and infrastructure to enforce usage guidelines. This may include rate limits, content filtering, application approval prior to production access, monitoring for anomalous activity, and other mitigations.

Mitigate unintentional harm

Proactively mitigate harmful model behavior. Best practices include comprehensive model evaluation to properly assess limitations, minimizing potential sources of bias in training corpora, and techniques to minimize unsafe behavior such as through learning from human feedback.

Document known weaknesses and vulnerabilities, such as bias or ability to produce insecure code, as in some cases no degree of preventative action can completely eliminate the potential for unintended harm. Documentation should also include model and use-case-specific safety best practices.

Thoughtfully collaborate with stakeholders

Build teams with diverse backgrounds and solicit broad input. Diverse perspectives are needed to characterize and address how language models will operate in the diversity of the real world, where if unchecked they may reinforce biases or fail to work for some groups.

Publicly disclose lessons learned regarding LLM safety and misuse in order to enable widespread adoption and help with cross-industry iteration on best practices.

Treat all labor in the language model supply chain with respect. For example, providers should have high standards for the working conditions of those reviewing model outputs in-house and hold vendors to well-specified standards (e.g., ensuring labelers are able to opt out of a given task).

As LLM providers, publishing these principles represents a first step in collaboratively guiding safer large language model development and deployment. We are excited to continue working with each other and with other parties to identify other opportunities to reduce unintentional harms from and prevent malicious use of language models.

Download as PDF

Support from other organizations

“While LLMs hold a lot of promise, they have significant inherent safety issues which need to be worked on. These best practices serve as an important step in minimizing the harms of these models and maximizing their potential benefits.”


“As large language models (LLMs) have become increasingly powerful and expressive, risk mitigation becomes increasingly important. We welcome these and other efforts to proactively seek to mitigate harms and highlight to users areas requiring extra diligence. The principles outlined here are an important contribution to the global conversation.”

—John Bansemer, Director of the CyberAI Project and Senior Fellow, Center for Security and Emerging Technology (CSET)

“Google affirms the importance of comprehensive strategies in analyzing model and training data to mitigate the risks of harm, bias, and misrepresentation. It is a thoughtful step taken by these AI providers to promote the principles and documentation towards AI safety.”

—Google Cloud Platform (GCP)

“The safety of foundation models, such as large language models, is a growing social concern. We commend Cohere, OpenAI, and AI21 Labs for taking a first step to outline high-level principles for responsible development and deployment from the perspective of model developers. There is still much work to be done, and we believe it is essential to engage more voices from academia, industry, and civil society to develop more detailed principles and community norms. As we state in our recent blog post, it is not just the end result but the legitimacy of the process that matters.”

—Percy Liang, Director of the Stanford Center for Research on Foundation Models (CRFM)

Get involved

If you’re developing language models or are working to mitigate their risks, we’d love to talk with you. Please reach out at bestpractices@openai.com.


Powering Next Generation Applications with OpenAI Codex

Powering Next Generation Applications with OpenAI Codex

Powering Next Generation Applications with OpenAI Codex

OpenAI Codex, a natural language-to-code system based on GPT-3, helps turn simple English instructions into over a dozen popular coding languages. Codex was released last August through our API and is the principal building block of GitHub Copilot.

Our motivation behind Codex is to supplement developers’ work and increase productivity. Codex helps computers to better understand people’s intent, which enables everyone to do more with computers. This is an integral part of our mission to build general-purpose AI that benefits all of humanity.

For enterprise customers, Microsoft’s Azure OpenAI Service provides developers with access to Codex and our other models, like GPT-3 and embeddings, along with enterprise-grade capabilities that are built into Microsoft Azure. At its Build conference today, Microsoft announced that Azure OpenAI Service—previously available by invitation only—is now available in a limited access preview. We’re already seeing new applications of Azure OpenAI Service across many industry verticals, from healthcare to financial services.

Applications and Industries

Since its release via our API, we’ve been working closely with developers to build on top of Codex. These applications utilize the system’s capabilities in a variety of categories including creativity, learning, productivity and problem solving.

Applications using Codex:

Powering Next Generation Applications with OpenAI Codex

GitHub Copilot is an AI pair programmer that provides suggestions for whole lines or entire functions right inside the code editor.

Through tight integration with Codex, GitHub Copilot can convert comments to code, autofill repetitive code, suggest tests and show alternatives.

Available for Visual Studio and Visual Studio Code, among other environments, GitHub Copilot works with a broad set of frameworks and languages, and for some programming languages suggests approximately 35% of the code generated by tens of thousands of developers who use it today.

Microsoft announced at its Build developer conference that GitHub Copilot will move to general availability this summer.

Powering Next Generation Applications with OpenAI Codex

Pygma aims to turn Figma designs into high-quality code.

Pygma utilizes Codex to turn Figma designs into different frontend frameworks and match the coding style and preferences of the developer. Codex enables Pygma to help developers do tasks instantly that previously could have taken hours.

“Codex has allowed me to integrate innovative features into my app with very little coding. As someone without a strong machine learning background, certain features like flexible code-tweaking would be incredibly difficult to build in-house. With Codex, it works almost out of the box.”

—Emile Paffard-Wray, Founder, Pygma

Powering Next Generation Applications with OpenAI Codex

Replit is a programming platform for any programming language that lets users collaborate live on projects, learn about code and share work with a community of learners and builders.

Replit leverages Codex to describe what a selection of code is doing in simple language so everyone can get quality explanation and learning tools. Users can highlight selections of code and click “Explain Code” to use Codex to understand its functionality.

“Codex helps learning on Replit better understand code they encounter. We’ve only scratched the surface of what semantic code understanding can offer those who want to get from idea to working code quickly.”

—Amjad Masad, Founder, Replit

Powering Next Generation Applications with OpenAI Codex

Warp is a Rust-based terminal, reimagined from the ground up to help both individuals and teams be more productive in the command-line.

Terminal commands are typically difficult to remember, find and construct. Users often have to leave the terminal and search the web for answers and even then the results might not give them the right command to execute. Warp uses Codex to allow users to run a natural language command to search directly from within the terminal and get a result they can immediately use.

“Codex allows Warp to make the terminal more accessible and powerful. Developers search for entire commands using natural language rather than trying to remember them or assemble them piecemeal. Codex-powered command search has become one of our game changing features.

—Zach Lloyd, Founder, Warp

Powering Next Generation Applications with OpenAI Codex

Machinet helps professional Java developers write quality code by using Codex to generate intelligent unit test templates.

Machinet was able to accelerate their development several-fold by switching from building their own machine learning systems to using Codex. The flexibility of Codex allows for the ability to easily add new features and capabilities saving their users time and helping them be more productive.

“Codex is an amazing tool in our arsenal. Not only does it allow us to generate more meaningful code, but it has also helped us find a new design of product architecture and got us out of a local maximum.”

—Vladislav Yanchenko, Founder, Machinet


DALL·E 2 Research Preview Update

DALL·E 2 Research Preview Update

DALL·E 2 Research Preview Update

Last month, we started previewing DALL·E 2 to a limited number of trusted users to learn about the technology’s capabilities and limitations.

Since then, we’ve been working with our users to actively incorporate the lessons we learn. As of today:

  • Our users have collectively created over 3 million images with DALL·E.
  • We’ve enhanced our safety system, improving the text filters and tuning the automated detection & response system for content policy violations.
  • Less than 0.05% of downloaded or publicly shared images were flagged as potentially violating our content policy. About 30% of those flagged images were confirmed by human reviewers to be policy violations, leading to an account deactivation.
  • As we work to understand and address the biases that DALL·E has inherited from its training data, we’ve asked early users not to share photorealistic generations that include faces and to flag problematic generations. We believe this has been effective in limiting potential harm, and we plan to continue the practice in the current phase.

Learning from real-world use is an important part of our commitment to develop and deploy AI responsibly, so we’re starting to widen access to users who joined our waitlist, slowly but steadily.

We intend to onboard up to 1,000 people every week as we iterate on our safety system and require all users to abide by our content policy. We hope to increase the rate at which we onboard new users as we learn more and gain confidence in our safety system. We’re inspired by what our users have created with DALL·E so far, and excited to see what new users will create.


OpenAI Leadership Team Update

OpenAI Leadership Team Update

OpenAI Leadership Team Update

Greg Brockman is becoming President, a new role which reflects his unique combination of personal coding contributions on our critical path together with company strategy. He is currently focused on training our flagship AI systems.

Brad Lightcap has been pivotal in OpenAI’s growth, scaling our structure, team, and capital base through his oversight of our Finance, Legal, People, and Operations organizations. He will become Chief Operating Officer and expand his focus, working with our Applied AI teams to sharpen our business and commercial strategies. He will also continue to manage the OpenAI Startup Fund.

Mira Murati has done a tremendous job leading our research, product, and partnership functions over the past 18 months. Most recently, she was instrumental in bringing these functions together for the successful release of our DALL·E research. Mira is taking on the role of Chief Technology Officer, reflecting her leadership across these critical areas within OpenAI.

Chris Clark is becoming Head of Nonprofit and Strategic Initiatives. He will lead the operations of OpenAI’s nonprofit parent and key strategic projects including our relationships with mission-aligned partners.

These executives are supported by world-class teams who are the lifeblood of OpenAI, constantly advancing the state of the art in artificial intelligence research and deployment. It’s a pleasure to work alongside such incredible talent and leadership across our company. We are all very excited for the future. (And we’re hiring!)


Measuring Goodhart’s Law

Goodhart’s law famously says: “When a measure becomes a target, it ceases to be a good measure.” Although originally from economics, it’s something we have to grapple with at OpenAI when figuring out how to optimize objectives that are difficult or costly to measure. It’s often necessary to introduce some proxy objective that’s easier or cheaper to measure, but when we do this, we need to be careful not to optimize it too much.

For example, as part of our work to align models like GPT-3 with human intent and values, we would like to optimize things like “How helpful is this response?”, or “How factually accurate is this claim?”. These are complex objectives that require humans to carefully check things over. For this reason, we train a model to predict these human preferences, known as a reward model, and use the reward model’s predictions as a proxy objective. But it’s important to keep track of how well the true objective is being optimized.

In this post we’ll look at some of the mathematics behind how we do this. We’ll focus on a setting that is particularly clean to analyze, in which we have access to the true objective. In practice, even human preferences can fail to measure what we really care about, but we’re setting that issue aside in this post.

Best-of-$n$ sampling

There are many ways in which one could optimize the proxy objective, but perhaps the simplest is best-of-$n$ sampling, also known as rejection sampling or reranking. We simply sample $n$ times and take the one that scores the highest according to the proxy objective.

Although this method is very simple, it can actually be competitive with more advanced techniques such as reinforcement learning, albeit at the cost of more inference-time compute. For example, in WebGPT, our best-of-$64$ model outperformed our reinforcement learning model, perhaps in part because the best-of-$64$ model got to browse many more websites. Even applying best-of-$4$ provided a significant boost to human preferences.

In addition, best-of-$n$ sampling has reliable performance and is straightforward to analyze mathematically, making it well-suited to empirical studies of Goodhart’s law and related phenomena.

The mathematics of best-of-$n$ sampling

Let’s study best-of-$n$ sampling more formally. Suppose we have some sample space $S$ (such as the set of possible question-answer pairs), some probability distribution $P$ over $S$, a true objective (or “reward”) $R_{text{true}}:Stomathbb R$, and a proxy objective $R_{text{proxy}}:Stomathbb R$. Let’s say that we somehow optimize $R_{text{proxy}}$ and thereby obtain some new distribution $P^prime$. Then:

  • The expectation $mathbb E_{x^primesim P^prime}left[R_{text{true}}left(x^primeright)right]$ measures how well we have optimized the true objective.
  • The KL divergence $D_{text{KL}}left(P^primeparallel Pright)$ measures how much optimization we have done. For example, if $P^prime$ is obtained by taking the first sample from $P$ that lies in some subset $S^primesubseteq S$, then this KL divergence is just the negative log probability that a sample from $P$ lies in $S^prime$.

It turns out that in the case of best-of-$n$ sampling, both of these quantities can be estimated efficiently using samples from $P$.

Let’s look at the expectation first. The naive approach is to use a Monte Carlo estimator: run best-of-$n$ sampling many times, measure the true objective on those samples, and average the results. However, there is a better estimator. If we have $Ngeq n$ samples from $P$ overall, then we can simultaneously consider every possible subset of these samples of size $n$, weight each sample by the number of subsets for which it is the best according to the proxy objective, and then take the weighted average true objective score. This weight is just the binomial coefficient $binom{k-1}{n-1}$, where $k$ is the rank of the sample under the proxy objective, from $1$ (worst) up to $N$ (best).[1] As well as using samples more efficiently, this also allows us to reuse samples for different values of $n$.

As for the KL divergence, surprisingly, this turns out to have an exact formula that works for any continuous probability distribution $P$ (i.e., as long as $P$ has no point masses). One might naively guess that the answer is $log n$, since best-of-$n$ is doing something like taking the top $frac 1n$ of the distribution, and this is roughly correct: the exact answer is $log n-frac{n-1}n$.[2]

Together, these estimators allow us to easily analyze how the true objective varies with the amount of optimization applied to the proxy objective.

Here’s a real-life example from WebGPT:

Best-of-$n$ performance for WebGPT 175B

Best-of-$n$ performance for WebGPT, with shaded regions representing $pm 1$ standard error, and the KL axis following a square root scale. Here, the original distribution ($P$) is given by the 175B model trained using behavior cloning, the proxy objective used to compute best-of-$n$ ($R_{text{proxy}}$) is given by the training reward model, and we consider three putatively “true” objectives ($R_{text{true}}$): the training reward model itself, a validation reward model trained on held-out data, and actual human preferences. There isn’t much over-optimization of the proxy objective, but we would expect there to be at higher KLs.

Going beyond best-of-$n$ sampling

The main limitation of best-of-$n$ sampling is that the KL divergence grows logarithmically with $n$, so it is only suitable for applying a small amount of optimization.

To apply more optimization, we typically use reinforcement learning. In the settings we’ve studied so far, such as summarization, we’ve typically been able to reach a KL of around 10 nats using reinforcement learning before the true objective starts to decrease due to Goodhart’s law. We’d have to take $n$ to be around 60,000 to reach this KL using best-of-$n$, and we hope to be able to reach much larger KLs than this with improvements to our reward modeling and reinforcement learning practices.

However, not all nats are equal. Empirically, for small KL budgets, best-of-$n$ better optimizes both the proxy and the true objectives than reinforcement learning. Intuitively, best-of-$n$ is the “brute force” approach, making it more information-theoretically efficient than reinforcement learning, but less computationally efficient at large KLs.[3]

We’re actively studying the scaling properties of proxy objectives as part of our work to align our models with human intent and values. If you’d like to help us with this research, we’re hiring!


Thanks to Suchir Balaji, Paul Christiano, Leo Gao, William Guss, Vineet Kosaraju, John Schulman, Nisan Stiennon, Jeff Wu, and Daniel Ziegler for discussions related to the ideas in this post. Thanks to Greg Brockman, Leo Gao, Jan Leike, Holly Mandel, John Schulman, and Jeff Wu for feedback on drafts. Thanks to Bianca Martin, Steve Dowling, Natalie Summers and Justin Jay Wang for communications and design.


  1. The sum of these weights is $binom{N}{n}$, giving a proof of the Hockey-stick identity. For a formal derivation of the estimator described here, see Appendix I of the WebGPT paper. ↩︎

  2. Hint: express the PDF of the best-of-$n$ distribution as a function of both the PDF and the CDF of the original distribution. ↩︎

  3. Best-of-$n$ is not necessarily optimal in the information-theoretic sense, however. For example, if $P$ has a heavy right tail, then for any $x>0$ and any $varepsilon>0$, there is a distribution $Q$ such that $mathbb E_{ysim Q}left[yright]>x$ and $D_{text{KL}}left(Qparallel Pright)<varepsilon$ (exercise). ↩︎