CLIP: Connecting Text and Images

CLIP: Connecting Text and Images

We’re introducing a neural network called CLIP which efficiently learns visual concepts from natural language supervision. CLIP (Contrastive Language–Image Pre-training) can be applied to any visual classification benchmark by simply providing the names of the visual categories to be recognized, similar to the ”zero-shot” capabilities of GPT-2 and 3.

Read paperView code

Although deep learning has revolutionized computer vision, current approaches have several major problems: typical vision datasets are labor intensive and costly to create while teaching only a narrow set of visual concepts; standard vision models are good at one task and one task only, and require significant effort to adapt to a new task; and models that perform well on benchmarks have disappointingly poor performance on stress tests, casting doubt on the entire deep learning approach to computer vision.

We present a neural network that aims to address these problems: it is trained on a wide variety of images with a wide variety of natural language supervision that’s abundantly available on the internet. By design, the network can be instructed in natural language to perform a great variety of classification benchmarks, without directly optimizing for the benchmark’s performance, similar to the “zero-shot” capabilities of GPT-2 and GPT-3. This is a key change: by not directly optimizing for the benchmark, we show that it becomes much more representative: our system closes this “robustness gap” by up to 75% while matching the performance of the original ResNet50 on ImageNet zero-shot without using any of the original 1.28M labeled examples.

Although both models have the same accuracy on the ImageNet test set, CLIP’s performance is much more representative of how it will fare on datasets that measure accuracy in different, non-ImageNet settings. For instance, ObjectNet checks a model’s ability to recognize objects in many different poses and with many different backgrounds inside homes while ImageNet Rendition and ImageNet Sketch check a model’s ability to recognize more abstract depictions of objects.

Background and related work

CLIP builds on a large body of work on zero-shot transfer, natural language supervision, and multimodal learning. The idea of zero-data learning dates back over a decade but until recently was mostly studied in computer vision as a way of generalizing to unseen object categories. A critical insight was to leverage natural language as a flexible prediction space to enable generalization and transfer. In 2013, Richer Socher and co-authors at Stanford developed a proof of concept by training a model on CIFAR-10 to make predictions in a word vector embedding space and showed this model could predict two unseen classes. The same year DeVISE scaled this approach and demonstrated that it was possible to fine-tune an ImageNet model so that it could generalize to correctly predicting objects outside the original 1000 training set.

Most inspirational for CLIP is the work of Ang Li and his co-authors at FAIR who in 2016 demonstrated using natural language supervision to enable zero-shot transfer to several existing computer vision classification datasets, such as the canonical ImageNet dataset. They achieved this by fine-tuning an ImageNet CNN to predict a much wider set of visual concepts (visual n-grams) from the text of titles, descriptions, and tags of 30 million Flickr photos and were able to reach 11.5% accuracy on ImageNet zero-shot.

Finally, CLIP is part of a group of papers revisiting learning visual representations from natural language supervision in the past year. This line of work uses more modern architectures like the Transformer and includes VirTex, which explored autoregressive language modeling, ICMLM, which investigated masked language modeling, and ConVIRT, which studied the same contrastive objective we use for CLIP but in the field of medical imaging.

Approach

We show that scaling a simple pre-training task is sufficient to achieve competitive zero-shot performance on a great variety of image classification datasets. Our method uses an abundantly available source of supervision: the text paired with images found across the internet. This data is used to create the following proxy training task for CLIP: given an image, predict which out of a set of 32,768 randomly sampled text snippets, was actually paired with it in our dataset.

In order to solve this task, our intuition is that CLIP models will need to learn to recognize a wide variety of visual concepts in images and associate them with their names. As a result, CLIP models can then be applied to nearly arbitrary visual classification tasks. For instance, if the task of a dataset is classifying photos of dogs vs cats we check for each image whether a CLIP model predicts the text description “a photo of a dog” or “a photo of a cat” is more likely to be paired with it.

CLIP: Connecting Text and Images
CLIP: Connecting Text and Images

CLIP pre-trains an image encoder and a text encoder to predict which images were paired with which texts in our dataset. We then use this behavior to turn CLIP into a zero-shot classifier. We convert all of a dataset’s classes into captions such as “a photo of a dog” and predict the class of the caption CLIP estimates best pairs with a given image.

CLIP was designed to mitigate a number of major problems in the standard deep learning approach to computer vision:

Costly datasets: Deep learning needs a lot of data, and vision models have traditionally been trained on manually labeled datasets that are expensive to construct and only provide supervision for a limited number of predetermined visual concepts. The ImageNet dataset, one of the largest efforts in this space, required over 25,000 workers to annotate 14 million images for 22,000 object categories. In contrast, CLIP learns from text–image pairs that are already publicly available on the internet. Reducing the need for expensive large labeled datasets has been extensively studied by prior work, notably self-supervised learning, contrastive methods, self-training approaches, and generative modeling.

Narrow: An ImageNet model is good at predicting the 1000 ImageNet categories, but that’s all it can do “out of the box.” If we wish to perform any other task, an ML practitioner needs to build a new dataset, add an output head, and fine-tune the model. In contrast, CLIP can be adapted to perform a wide variety of visual classification tasks without needing additional training examples. To apply CLIP to a new task, all we need to do is “tell” CLIP’s text-encoder the names of the task’s visual concepts, and it will output a linear classifier of CLIP’s visual representations. The accuracy of this classifier is often competitive with fully supervised models.

We show random, non-cherry picked, predictions of zero-shot CLIP classifiers on examples from various datasets below.

Show more
Show less

Poor real-world performance: Deep learning systems are often reported to achieve human or even superhuman performance[1] on vision benchmarks, yet when deployed in the wild, their performance can be far below the expectation set by the benchmark. In other words, there is a gap between “benchmark performance” and “real performance.” We conjecture that this gap occurs because the models “cheat” by only optimizing for performance on the benchmark, much like a student who passed an exam by studying only the questions on past years’ exams. In contrast, the CLIP model can be evaluated on benchmarks without having to train on their data, so it can’t “cheat” in this manner. This results in its benchmark performance being much more representative of its performance in the wild. To verify the “cheating hypothesis”, we also measure how CLIP’s performance changes when it is able to “study” for ImageNet. When a linear classifier is fitted on top of CLIP’s features, it improves CLIP’s accuracy on the ImageNet test set by almost 10%. However, this classifier does no better on average across an evaluation suite of 7 other datasets measuring “robust” performance.

Key takeaways

1. CLIP is highly efficient

CLIP learns from unfiltered, highly varied, and highly noisy data, and is intended to be used in a zero-shot manner. We know from GPT-2 and 3 that models trained on such data can achieve compelling zero shot performance; however, such models require significant training compute. To reduce the needed compute, we focused on algorithmic ways to improve the training efficiency of our approach.

We report two algorithmic choices that led to significant compute savings. The first choice is the adoption of a contrastive objective for connecting text with images. We originally explored an image-to-text approach, similar to VirTex, but encountered difficulties scaling this to achieve state-of-the-art performance. In small to medium scale experiments, we found that the contrastive objective used by CLIP is 4x to 10x more efficient at zero-shot ImageNet classification. The second choice was the adoption of the Vision Transformer, which gave us a further 3x gain in compute efficiency over a standard ResNet. In the end, our best performing CLIP model trains on 256 GPUs for 2 weeks which is similar to existing large scale image models.


We originally explored training image-to-caption language models but found this approach struggled at zero-shot transfer. In this 16 GPU day experiment, a language model only achieves 16% accuracy on ImageNet after training for 400 million images. CLIP is much more efficient and achieves the same accuracy roughly 10x faster.

2. CLIP is flexible and general

Because they learn a wide range of visual concepts directly from natural language, CLIP models are significantly more flexible and general than existing ImageNet models. We find they are able to zero-shot perform many different tasks. To validate this we have measured CLIP’s zero-shot performance on over 30 different datasets including tasks such as fine-grained object classification, geo-localization, action recognition in videos, and OCR.[2] In particular, learning OCR is an example of an exciting behavior that does not occur in standard ImageNet models. Above, we visualize a random non-cherry picked prediction from each zero-shot classifier.

This finding is also reflected on a standard representation learning evaluation using linear probes. The best CLIP model outperforms the best publicly available ImageNet model, the Noisy Student EfficientNet-L2, on 20 out of 26 different transfer datasets we tested.


Across a suite of 27 datasets measuring tasks such as fine-grained object classification, OCR, activity recognition in videos, and geo-localization, we find that CLIP models learn more widely useful image representations. CLIP models are also more compute efficient than the models from 10 prior approaches that we compare with.

Limitations

While CLIP usually performs well on recognizing common objects, it struggles on more abstract or systematic tasks such as counting the number of objects in an image and on more complex tasks such as predicting how close the nearest car is in a photo. On these two datasets, zero-shot CLIP is only slightly better than random guessing. Zero-shot CLIP also struggles compared to task specific models on very fine-grained classification, such as telling the difference between car models, variants of aircraft, or flower species.

CLIP also still has poor generalization to images not covered in its pre-training dataset. For instance, although CLIP learns a capable OCR system, when evaluated on handwritten digits from the MNIST dataset, zero-shot CLIP only achieves 88% accuracy, well below the 99.75% of humans on the dataset. Finally, we’ve observed that CLIP’s zero-shot classifiers can be sensitive to wording or phrasing and sometimes require trial and error “prompt engineering” to perform well.

Broader impacts

CLIP allows people to design their own classifiers and removes the need for task-specific training data. The manner in which these classes are designed can heavily influence both model performance and model biases. For example, we find that when given a set of labels including Fairface race labels[3] and a handful of egregious terms such as “criminal”, “animal” etc. the model tends to classify images of people aged 0–20 in the egregious category at a rate of ~32.3%. However, when we add the class “child” to the list of possible classes, this behaviour drops to ~8.7%.

Additionally, given that CLIP does not need task-specific training data it can unlock certain niche tasks with greater ease. Some of these tasks may raise privacy or surveillance related risks and we explore this concern by studying the performance of CLIP on celebrity identification. CLIP has a top-1 accuracy of 59.2% for “in the wild” celebrity image classification when choosing from 100 candidates and a top-1 accuracy of 43.3% when choosing from 1000 possible choices. Although it’s noteworthy to achieve these results with task agnostic pre-training, this performance is not competitive when compared to widely available production level models. We further explore challenges that CLIP poses in our paper and we hope that this work motivates future research on the characterization of the capabilities, shortcomings, and biases of such models. We are excited to engage with the research community on such questions.

Conclusion

With CLIP, we’ve tested whether task agnostic pre-training on internet scale natural language, which has powered a recent breakthrough in NLP, can also be leveraged to improve the performance of deep learning for other fields. We are excited by the results we’ve seen so far applying this approach to computer vision. Like the GPT family, CLIP learns a wide variety of tasks during pre-training which we demonstrate via zero-shot transfer. We are also encouraged by our findings on ImageNet that suggest zero-shot evaluation is a more representative measure of a model’s capability.

OpenAI

DALL·E: Creating Images from Text


DALL·E: Creating Images from Text

DALL·E[1] is a 12-billion parameter version of GPT-3 trained to generate images from text descriptions, using a dataset of text–image pairs. We’ve found that it has a diverse set of capabilities, including creating anthropomorphized versions of animals and objects, combining unrelated concepts in plausible ways, rendering text, and applying transformations to existing images.

Text prompt
an illustration of a baby daikon radish in a tutu walking a dog
AI-generated images
DALL·E: Creating Images from Text
DALL·E: Creating Images from Text
DALL·E: Creating Images from Text
DALL·E: Creating Images from Text
View more images or edit prompt
Text prompt
a store front that has the word ‘openai’ written on it […]
AI-generated images
DALL·E: Creating Images from Text
DALL·E: Creating Images from Text
DALL·E: Creating Images from Text
DALL·E: Creating Images from Text
View more images or edit prompt
Text prompt
an armchair in the shape of an avocado […]
AI-generated images
DALL·E: Creating Images from Text
DALL·E: Creating Images from Text
DALL·E: Creating Images from Text
DALL·E: Creating Images from Text
View more images or edit prompt
Text and image prompt
the exact same cat on the top as a sketch on the bottom
AI-generated images
DALL·E: Creating Images from Text
DALL·E: Creating Images from Text
DALL·E: Creating Images from Text
DALL·E: Creating Images from Text
View more images or edit prompt

GPT-3 showed that language can be used to instruct a large neural network to perform a variety of text generation tasks. Image GPT showed that the same type of neural network can also be used to generate images with high fidelity. We extend these findings to show that manipulating visual concepts through language is now within reach.

Overview

Like GPT-3, DALL·E is a transformer language model. It receives both the text and the image as a single stream of data containing up to 1280 tokens, and is trained using maximum likelihood to generate all of the tokens, one after another.[2] This training procedure allows DALL·E to not only generate an image from scratch, but also to regenerate any rectangular region of an existing image that extends to the bottom-right corner, in a way that is consistent with the text prompt.

We recognize that work involving generative models has the potential for significant, broad societal impacts. In the future, we plan to analyze how models like DALL·E relate to societal issues like economic impact on certain work processes and professions, the potential for bias in the model outputs, and the longer term ethical challenges implied by this technology.

Capabilities

We find that DALL·E is able to create plausible images for a great variety of sentences that explore the compositional structure of language. We illustrate this using a series of interactive visuals in the next section. The samples shown for each caption in the visuals are obtained by taking the top 32 of 512 after reranking with CLIP, but we do not use any manual cherry-picking, aside from the thumbnails and standalone images that appear outside.[3]

Controlling attributes

We test DALL·E’s ability to modify several of an object’s attributes, as well as the number of times that it appears.

Click to edit text prompt or view more AI-generated images
a pentagonal green clock. a green clock in the shape of a pentagon.
DALL·E: Creating Images from Text
navigatedownwide

navigateupwide
Text prompt

AI-generated
images
We find that DALL·E can render familiar objects in polygonal shapes that are sometimes unlikely to occur in the real world. For some objects, such as “picture frame” and “plate,” DALL·E can reliably draw the object in any of the polygonal shapes except heptagon. For other objects, such as “manhole cover” and “stop sign,” DALL·E’s success rate for more unusual shapes, such as “pentagon,” is considerably lower.

For several of the visuals in this post, we find that repeating the caption, sometimes with alternative phrasings, improves the consistency of the results.

navigateupwide


a cube made of porcupine. a cube with the texture of a porcupine.
DALL·E: Creating Images from Text
navigatedownwide

navigateupwide
Text prompt

AI-generated
images
We find that DALL·E can map the textures of various plants, animals, and other objects onto three dimensional solids. As in the preceding visual, we find that repeating the caption with alternative phrasing improves the consistency of the results.

navigateupwide


a collection of glasses is sitting on a table
DALL·E: Creating Images from Text
navigatedownwide

navigateupwide
Text prompt

AI-generated
images
We find that DALL·E is able to draw multiple copies of an object when prompted to do so, but is unable to reliably count past three. When prompted to draw nouns for which there are multiple meanings, such as “glasses,” “chips,” and “cups” it sometimes draws both interpretations, depending on the plural form that is used.

navigateupwide

Drawing multiple objects

Simultaneously controlling multiple objects, their attributes, and their spatial relationships presents a new challenge. For example, consider the phrase “a hedgehog wearing a red hat, yellow gloves, blue shirt, and green pants”. To correctly interpret this sentence, DALL·E must not only correctly compose each piece of apparel with the animal, but also form the associations (hat, red), (gloves, yellow), (shirt, blue), and (pants, green) without mixing them up.[4] We test DALL·E’s ability to do this for relative positioning, stacking objects, and controlling multiple attributes.

a small red block sitting on a large green block
DALL·E: Creating Images from Text
navigatedownwide

navigateupwide
Text prompt

AI-generated
images
We find that DALL·E correctly responds to some types of relative positions, but not others. The choices “sitting on” and “standing in front of” sometimes appear to work, “sitting below,” “standing behind,” “standing left of,” and “standing right of” do not. DALL·E also has a lower success rate when asked to draw a large object sitting on top of a smaller one, when compared to the other way around.

navigateupwide


a stack of 3 cubes. a red cube is on the top, sitting on a green cube. the green cube is in the middle, sitting on a blue cube. the blue cube is on the bottom.
DALL·E: Creating Images from Text
navigatedownwide

navigateupwide
Text prompt

AI-generated
images
We find that DALL·E typically generates an image with one or two of the objects having the correct colors. However, only a few samples for each setting tend to have exactly three objects colored precisely as specified.

navigateupwide


an emoji of a baby penguin wearing a blue hat, red gloves, green shirt, and yellow pants
DALL·E: Creating Images from Text
navigatedownwide

navigateupwide
Text prompt

AI-generated
images
We find that DALL·E typically generates an image with two or three articles of clothing having the correct colors. However, only a few of the samples for each setting tend to have all four articles of clothing with the specified colors.

navigateupwide

While DALL·E does offer some level of controllability over the attributes and positions of a small number of objects, the success rate can depend on how the caption is phrased. As more objects are introduced, DALL·E is prone to confusing the associations between the objects and their colors, and the success rate decreases sharply. We also note that DALL·E is brittle with respect to rephrasing of the caption in these scenarios: alternative, semantically equivalent captions often yield no correct interpretations.

Visualizing perspective and three-dimensionality

We find that DALL·E also allows for control over the viewpoint of a scene and the 3D style in which a scene is rendered.

an extreme close-up view of a capybara sitting in a field
DALL·E: Creating Images from Text
navigatedownwide

navigateupwide
Text prompt

AI-generated
images
We find that DALL·E can draw each of the animals in a variety of different views. Some of these views, such as “aerial view” and “rear view,” require knowledge of the animal’s appearance from unusual angles. Others, such as “extreme close-up view,” require knowledge of the fine-grained details of the animal’s skin or fur.

navigateupwide


a capybara made of voxels sitting in a field
DALL·E: Creating Images from Text
navigatedownwide

navigateupwide
Text prompt

AI-generated
images
We find that DALL·E is often able to modify the surface of each of the animals according to the chosen 3D style, such as “claymation” and “made of voxels,” and render the scene with plausible shading depending on the location of the sun. The “x-ray” style does not always work reliably, but it shows that DALL·E can sometimes orient the bones within the animal in plausible (though not anatomically correct) configurations.

navigateupwide

To push this further, we test DALL·E’s ability to repeatedly draw the head of a well-known figure at each angle from a sequence of equally spaced angles, and find that we can recover a smooth animation of the rotating head.

a photograph of a bust of homer
DALL·E: Creating Images from Text
navigatedownwide

navigateupwide
Text prompt

Image prompt
AI-generated
images
We prompt DALL·E with both a caption describing a well-known figure and the top region of an image showing a hat drawn at a particular angle. Then, we ask DALL·E to complete the remaining part of the image given this contextual information. We do this repeatedly, each time rotating the hat a few more degrees, and find that we are able to recover smooth animations of several well-known figures, with each frame respecting the precise specification of angle and ambient lighting.

navigateupwide

DALL·E appears to be able to apply some types of optical distortions to scenes, as we see with the options “fisheye lens view” and “a spherical panorama.” This motivated us to explore its ability to generate reflections.

a plain white cube looking at its own reflection in a mirror. a plain white cube gazing at itself in a mirror.
DALL·E: Creating Images from Text
navigatedownwide

navigateupwide
Text prompt

Image prompt
AI-generated
images
Similar to what was done before, we prompt DALL·E to complete the bottom-right corners of a sequence of frames, each of which contains a mirror and reflective floor. While the reflection in the mirror usually resembles the object outside of it, it often does not render the reflection in a physically correct way. By contrast, the reflection of an object drawn on a reflective floor is typically more plausible.

navigateupwide

Visualizing internal and external structure

The samples from the “extreme close-up view” and “x-ray” style led us to further explore DALL·E’s ability to render internal structure with cross-sectional views, and external structure with macro photographs.

a cross-section view of a walnut
DALL·E: Creating Images from Text
navigatedownwide

navigateupwide
Text prompt

AI-generated
images
We find that DALL·E is able to draw the interiors of several different kinds of objects.

navigateupwide


a macro photograph of brain coral
DALL·E: Creating Images from Text
navigatedownwide

navigateupwide
Text prompt

AI-generated
images
We find that DALL·E is able to draw the fine-grained external details of several different kinds of objects. These details are only apparent when the object is viewed up close.

navigateupwide

Inferring contextual details

The task of translating text to images is underspecified: a single caption generally corresponds to an infinitude of plausible images, so the image is not uniquely determined. For instance, consider the caption “a painting of a capybara sitting on a field at sunrise.” Depending on the orientation of the capybara, it may be necessary to draw a shadow, though this detail is never mentioned explicitly. We explore DALL·E’s ability to resolve underspecification in three cases: changing style, setting, and time; drawing the same object in a variety of different situations; and generating an image of an object with specific text written on it.

a painting of a capybara sitting in a field at sunrise
DALL·E: Creating Images from Text
navigatedownwide

navigateupwide
Text prompt

AI-generated
images
We find that DALL·E is able to render the same scene in a variety of different styles, and can adapt the lighting, shadows, and environment based on the time of day or season.

navigateupwide


a stained glass window with an image of a blue strawberry
DALL·E: Creating Images from Text
navigatedownwide

navigateupwide
Text prompt

AI-generated
images
We find that DALL·E is able to flexibly adapt the representation of the object based on the medium on which it is being drawn. For “a mural,” “a soda can,” and “a teacup,” DALL·E must change how it draws the object based on the angle and curvature of the drawing surface. For “a stained glass window” and “a neon sign,” it must alter the appearance of the object from how it usually appears.

navigateupwide


a store front that has the word ‘openai’ written on it. a store front that has the word ‘openai’ written on it. a store front that has the word ‘openai’ written on it. ‘openai’ store front.
DALL·E: Creating Images from Text
navigatedownwide

navigateupwide
Text prompt

AI-generated
images
We find that DALL·E is sometimes able to render text and adapt the writing style to the context in which it appears. For example, “a bag of chips” and “a license plate” each requires different types of fonts, and “a neon sign” and “written in the sky” require the appearance of the letters to be changed.

Generally, the longer the string that DALL·E is prompted to write, the lower the success rate. We find that the success rate improves when parts of the caption are repeated. Additionally, the success rate sometimes improves as the sampling temperature for the image is decreased, although the samples become simpler and less realistic.

navigateupwide

With varying degrees of reliability, DALL·E provides access to a subset of the capabilities of a 3D rendering engine via natural language. It can independently control the attributes of a small number of objects, and to a limited extent, how many there are, and how they are arranged with respect to one another. It can also control the location and angle from which a scene is rendered, and can generate known objects in compliance with precise specifications of angle and lighting conditions.

Unlike a 3D rendering engine, whose inputs must be specified unambiguously and in complete detail, DALL·E is often able to “fill in the blanks” when the caption implies that the image must contain a certain detail that is not explicitly stated.

Applications of preceding capabilities

Next, we explore the use of the preceding capabilities for fashion and interior design.

a male mannequin dressed in an orange and black flannel shirt
DALL·E: Creating Images from Text
navigatedownwide

navigateupwide
Text prompt

Image prompt
DALL·E: Creating Images from Text
AI-generated
images
We explore DALL·E’s ability to render male mannequins in a variety of different outfits. When prompted with two colors, e.g., “an orange and white bomber jacket” and “an orange and black turtleneck sweater,” DALL·E often exhibits a range of possibilities for how both colors can be used for the same article of clothing.

DALL·E also seems to occasionally confuse less common colors with other neighboring shades. For example, when prompted to draw clothes in “navy,” DALL·E sometimes uses lighter shades of blue, or shades very close to black. Similarly, DALL·E sometimes confuses “olive” with shades of brown or brighter shades of green.

navigateupwide


a female mannequin dressed in a black leather jacket and gold pleated skirt
DALL·E: Creating Images from Text
navigatedownwide

navigateupwide
Text prompt

Image prompt
DALL·E: Creating Images from Text
AI-generated
images
We explore DALL·E’s ability to render female mannequins in a variety of different outfits. We find that DALL·E is able to portray unique textures such as the sheen of a “black leather jacket” and “gold” skirts and leggings. As before, we see that DALL·E occasionally confuses less common colors, such as “navy” and “olive,” with other neighboring shades.

navigateupwide


a living room with two white armchairs and a painting of the colosseum. the painting is mounted above a modern fireplace.
DALL·E: Creating Images from Text
navigatedownwide

navigateupwide
Text prompt

Image prompt
DALL·E: Creating Images from Text
AI-generated
images
We explore DALL·E’s ability to generate images of rooms with several details specified. We find that it can generate paintings of a wide range of different subjects, including real-world locations such as “the colosseum” and fictional characters like “yoda.” For each subject, DALL·E exhibits a variety of interpretations. While the painting is almost always present in the scene, DALL·E sometimes fails to draw the fireplace or the correct number of armchairs.

navigateupwide


a loft bedroom with a white bed next to a nightstand. there is a fish tank beside the bed.
DALL·E: Creating Images from Text
navigatedownwide

navigateupwide
Text prompt

Image prompt
DALL·E: Creating Images from Text
AI-generated
images
We explore DALL·E’s ability to generate bedrooms with several details specified. Despite the fact that we do not tell DALL·E what should go on top of the nightstand or shelf beside the bed, we find that it sometimes decides to place the other specified object on top. As before, we see that it often fails to draw one or more of the specified objects.

navigateupwide

Combining unrelated concepts

The compositional nature of language allows us to put together concepts to describe both real and imaginary things. We find that DALL·E also has the ability to combine disparate ideas to synthesize objects, some of which are unlikely to exist in the real world. We explore this ability in two instances: transferring qualities from various concepts to animals, and designing products by taking inspiration from unrelated concepts.

a snail made of harp. a snail with the texture of a harp.
DALL·E: Creating Images from Text
navigatedownwide

navigateupwide
Text prompt

AI-generated
images
We find that DALL·E can generate animals synthesized from a variety of concepts, including musical instruments, foods, and household items. While not always successful, we find that DALL·E sometimes takes the forms of the two objects into consideration when determining how to combine them. For example, when prompted to draw “a snail made of harp,” it sometimes relates the pillar of the harp to the spiral of the snail’s shell.

In a previous section, we saw that as more objects are introduced into the scene, DALL·E is liable to confuse the associations between the objects and their specified attributes. Here, we see a different sort of failure mode: sometimes, rather than binding some attribute of the specified concept (say, “a faucet”) to the animal (say, “a snail”), DALL·E just draws the two as separate items.

navigateupwide


an armchair in the shape of an avocado. an armchair imitating an avocado.
DALL·E: Creating Images from Text
navigatedownwide

navigateupwide
Text prompt

AI-generated
images
In the preceding visual, we explored DALL·E’s ability to generate fantastical objects by combining two unrelated ideas. Here, we explore its ability to take inspiration from an unrelated idea while respecting the form of the thing being designed, ideally producing an object that appears to be practically functional. We found that prompting DALL·E with the phrases “in the shape of,” “in the form of,” and “in the style of” gives it the ability to do this.

When generating some of these objects, such as “an armchair in the shape of an avocado”, DALL·E appears to relate the shape of a half avocado to the back of the chair, and the pit of the avocado to the cushion. We find that DALL·E is susceptible to the same kinds of mistakes mentioned in the previous visual.

navigateupwide

Animal illustrations

In the previous section, we explored DALL·E’s ability to combine unrelated concepts when generating images of real-world objects. Here, we explore this ability in the context of art, for three kinds of illustrations: anthropomorphized versions of animals and objects, animal chimeras, and emojis.

an illustration of a baby daikon radish in a tutu walking a dog
DALL·E: Creating Images from Text
navigatedownwide

navigateupwide
Text prompt

AI-generated
images
We find that DALL·E is sometimes able to transfer some human activities and articles of clothing to animals and inanimate objects, such as food items. We include “pikachu” and “wielding a blue lightsaber” to explore DALL·E’s ability to incorporate popular media.

We find it interesting how DALL·E adapts human body parts onto animals. For example, when asked to draw a daikon radish blowing its nose, sipping a latte, or riding a unicycle, DALL·E often draws the kerchief, hands, and feet in plausible locations.

navigateupwide


a professional high quality illustration of a giraffe turtle chimera. a giraffe imitating a turtle. a giraffe made of turtle.
DALL·E: Creating Images from Text
navigatedownwide

navigateupwide
Text prompt

AI-generated
images
We find that DALL·E is sometimes able to combine distinct animals in plausible ways. We include “pikachu” to explore DALL·E’s ability to incorporate knowledge of popular media, and “robot” to explore its ability to generate animal cyborgs. Generally, the features of the second animal mentioned in the caption tend to be dominant.

We also find that inserting the phrase “professional high quality” before “illustration” and “emoji” sometimes improves the quality and consistency of the results.

navigateupwide


a professional high quality emoji of a lovestruck cup of boba
DALL·E: Creating Images from Text
navigatedownwide

navigateupwide
Text prompt

AI-generated
images
We find that DALL·E is sometimes able to transfer some emojis to animals and inanimate objects, such as food items. As in the preceding visual, we find that inserting the phrase “professional high quality” before “emoji” sometimes improves the quality and consistency of the results.

navigateupwide

Zero-shot visual reasoning

GPT-3 can be instructed to perform many kinds of tasks solely from a description and a cue to generate the answer supplied in its prompt, without any additional training. For example, when prompted with the phrase “here is the sentence ‘a person walking his dog in the park’ translated into French:”, GPT-3 answers “un homme qui promène son chien dans le parc.” This capability is called zero-shot reasoning. We find that DALL·E extends this capability to the visual domain, and is able to perform several kinds of image-to-image translation tasks when prompted in the right way.

the exact same cat on the top as a sketch on the bottom
DALL·E: Creating Images from Text
navigatedownwide

navigateupwide
Text prompt

Image prompt
AI-generated
images
We find that DALL·E is able to apply several kinds of image transformations to photos of animals, with varying degrees of reliability. The most straightforward ones, such as “photo colored pink” and “photo reflected upside-down,” also tend to be the most reliable, although the photo is often not copied or reflected exactly. The transformation “animal in extreme close-up view” requires DALL·E to recognize the breed of the animal in the photo, and render it up close with the appropriate details. This works less reliably, and for several of the photos, DALL·E only generates plausible completions in one or two instances.

Other transformations, such as “animal with sunglasses” and “animal wearing a bow tie,” require placing the accessory on the correct part of the animal’s body. Those that only change the color of the animal, such as “animal colored pink,” are less reliable, but show that DALL·E is sometimes capable of segmenting the animal from the background. Finally, the transformations “a sketch of the animal” and “a cell phone case with the animal” explore the use of this capability for illustrations and product design.

navigateupwide


the exact same teapot on the top with ’gpt’ written on it on the bottom
DALL·E: Creating Images from Text
navigatedownwide

navigateupwide
Text prompt

Image prompt
AI-generated
images
We find that DALL·E is able to apply several different kinds of image transformations to photos of teapots, with varying degrees of reliability. Aside from being able to modify the color of the teapot (e.g., “colored blue”) or its pattern (e.g., “with stripes”), DALL·E can also render text (e.g., “with ‘gpt’ written on it”) and map the letters onto the curved surface of the teapot in a plausible way. With much less reliability, it can also draw the teapot in a smaller size (for the “tiny” option) and in a broken state (for the “broken” option).

navigateupwide

We did not anticipate that this capability would emerge, and made no modifications to the neural network or training procedure to encourage it. Motivated by these results, we measure DALL·E’s aptitude for analogical reasoning problems by testing it on Raven’s progressive matrices, a visual IQ test that saw widespread use in the 20th century.

a sequence of geometric shapes.
DALL·E: Creating Images from Text
navigatedownwide

navigateupwide
Text prompt

Example Image prompt
AI-generated
images
Rather than treating the IQ test a multiple-choice problem as originally intended, we ask DALL·E to complete the bottom-right corner of each image using argmax sampling, and consider its completion to be correct if it is a close visual match to the original.

DALL·E is often able to solve matrices that involve continuing simple patterns or basic geometric reasoning, such as those in sets B and C. It is sometimes able to solve matrices that involve recognizing permutations and applying boolean operations, such as those in set D. The instances in set E tend to be the most difficult, and DALL·E gets almost none of them correct.

For each of the sets, we measure DALL·E’s performance on both the original images, and the images with the colors inverted. The inversion of colors should pose no additional difficulty for a human, yet does generally impair DALL·E’s performance, suggesting its capabilities may be brittle in unexpected ways.

navigateupwide

Geographic knowledge

We find that DALL·E has learned about geographic facts, landmarks, and neighborhoods. Its knowledge of these concepts is surprisingly precise in some ways and flawed in others.

a photo of the food of china
DALL·E: Creating Images from Text
navigatedownwide

navigateupwide
Text prompt

AI-generated
images
We test DALL·E’s understanding of simple geographical facts, such as country flags, cuisines, and local wildlife. While DALL·E successfully answers many of these queries, such as those involving national flags, it often reflects superficial stereotypes for choices like “food” and “wildlife,” as opposed to representing the full diversity encountered in the real world.

navigateupwide


a photo of alamo square, san francisco, from a street at night
DALL·E: Creating Images from Text
navigatedownwide

navigateupwide
Text prompt

AI-generated
images
We find that DALL·E is sometimes capable of rendering semblances of certain locations in San Francisco. For locations familiar to the authors, such as San Francisco, they evoke a sense of déjà vu—eerie simulacra of streets, sidewalks and cafes that remind us of very specific locations that do not exist.

navigateupwide


a photo of san francisco’s golden gate bridge
DALL·E: Creating Images from Text
navigatedownwide

navigateupwide
Text prompt

Image prompts
AI-generated
images
We can also prompt DALL·E to draw famous landmarks. In fact, we can even dictate when the photo was taken by specifying the first few rows of the sky. When the sky is dark, for example, DALL·E recognizes it is night, and turns on the lights in the buildings.

navigateupwide

Temporal knowledge

In addition to exploring DALL·E’s knowledge of concepts that vary over space, we also explore its knowledge of concepts that vary over time.

a photo of a phone from the 20s
DALL·E: Creating Images from Text
navigatedownwide

navigateupwide
Text prompt

Image prompt
AI-generated
images
We find that DALL·E has learned about basic stereotypical trends in design and technology over the decades. Technological artifacts appear to go through periods of explosion of change, dramatically shifting for a decade or two, then changing more incrementally, becoming refined and streamlined.

navigateupwide

Summary of approach and prior work

DALL·E is a simple decoder-only transformer that receives both the text and the image as a single stream of 1280 tokens—256 for the text and 1024 for the image—and models all of them autoregressively. The attention mask at each of its 64 self-attention layers allows each image token to attend to all text tokens. DALL·E uses the standard causal mask for the text tokens, and sparse attention for the image tokens with either a row, column, or convolutional attention pattern, depending on the layer. We plan to provide more details about the architecture and training procedure in an upcoming paper.

Text-to-image synthesis has been an active area of research since the pioneering work of Reed et. al, whose approach uses a GAN conditioned on text embeddings. The embeddings are produced by an encoder pretrained using a contrastive loss, not unlike CLIP. StackGAN and StackGAN++ use multi-scale GANs to scale up the image resolution and improve visual fidelity. AttnGAN incorporates attention between the text and image features, and proposes a contrastive text-image feature matching loss as an auxiliary objective. This is interesting to compare to our reranking with CLIP, which is done offline. Other work incorporates additional sources of supervision during training to improve image quality. Finally, work by Nguyen et. al and Cho et. al explores sampling-based strategies for image generation that leverage pretrained multimodal discriminative models.

Similar to the rejection sampling used in VQVAE-2, we use CLIP to rerank the top 32 of 512 samples for each caption in all of the interactive visuals. This procedure can also be seen as a kind of language-guided search, and can have a dramatic impact on sample quality.

an illustration of a baby daikon radish in a tutu walking a dog
DALL·E: Creating Images from Text
navigatedownwide

navigateupwide
Text prompt

AI-generated
images
Reranking the samples from DALL·E using CLIP can dramatically improve consistency and quality of the samples.

navigateupwide



OpenAI

Just desserts: Baking with AI-made recipes

It’s winter, it’s the holidays and it’s quarantine-times: It’s the perfect recipe for doing a ton of baking. In fact, U.S. search interest in “baking” spiked in both November and December 2020.

But being in the AI field, we decided to dive a little deeper into the trend and 

try to understand the science behind what makes cookies crunchy, cake spongy and bread fluffy — and we decided to do it with the help of machine learning. Plus, we used our ML model to come up with two completely new baking recipes: a cakie (cake-cookie hybrid) and a breakie (bread-cookie hybrid). (Don’t worry, recipes included below.)

We started off by collecting hundreds of cookie, cake and bread recipes. Then we converted all of their ingredients to ounces and whittled them down to a few essential ingredients (yeast, flour, sugar, eggs, butter and a few other things). Next we did a bit of reorganizing, since according to Paul Hollywood, treats like banana, zucchini and pumpkin bread are really more cake than they are bread.

Then we used a Google Cloud tool called AutoML Tables to build a machine learning model that analyzed a recipe’s ingredient amounts and predicted whether it was a recipe for cookies, cake or bread. If you’ve never tried AutoML Tables, it’s a code-free way to build models from the type of data you’d find in a spreadsheet like numbers and categories – no data science background required. 

Our model was able to accurately tag breads, cookies and cakes, but could also identify recipes it deemed “hybrids” — something that’s, say, 50% cake and 50% bread, or something that’s 50% cake and 50% cookie. We named two such combinations the “breakie” (a bread-cookie — “brookie” was already taken) and the “cakie” (a cake-cookie) respectively. 

Being science-minded bakers, we had to experimentally verify if these hybrid treats could really be made. You know, for science.

Behold the cakie: It has the crispiness of a cookie and the, well, “cakiness” of a cake.

Image showing a cake-like cookie with a slice cut out of it.

We also made breakies, which were more like fluffy cookies, almost the consistency of a muffin.

Image showing a woman with dark brown hair looking into the camera while holding up a tray of puffy-looking cookies, which are actually bread-like cookies.

Sara’s first batch of breakies.

Beyond just generating recipes, we also used our model to understand what made the consistency of cookies, cakes and breads so different. For that, we used a metric called  “feature importance,” which is automatically calculated by AutoML Tables.

In our case, the amount of butter, sugar, yeast and egg in a recipe all seemed to be important indicators of “cookieness” (or cakiness or breadiness). AutoML Tables lets you look at feature importance both for your model as a whole and for individual predictions. Below are the most important features for our model as a whole, meaning these ingredients were the biggest signals for our model across many different cake, cookie and bread recipes:

A chart showing the feature importance of items like butter, sugar, yeast, egg, and so on in each of the recipes.

If you find yourself with extra time and an experimental spirit, try out our recipes and let us know what you think. And you can find all the details of what we learned from our ML model in the technical blog post.

A recipe card for a cakie.
A recipe card for a breakie.

Most importantly, if you come up with an even better cakie or breakie recipe, please let us know.

Read More

The Successor Representation, $gamma$-Models, and Infinite-Horizon Prediction

The Successor Representation, Gamma-Models, and Infinite-Horizon Prediction



Standard single-step models have a horizon of one. This post describes a method for training predictive dynamics models in continuous state spaces with an infinite, probabilistic horizon.

Reinforcement learning algorithms are frequently categorized by whether they predict future states at any point in their decision-making process. Those that do are called model-based, and those that do not are dubbed model-free. This classification is so common that we mostly take it for granted these days; I am guilty of using it myself. However, this distinction is not as clear-cut as it may initially seem.

In this post, I will talk about an alternative view that emphases the mechanism of prediction instead of the content of prediction. This shift in focus brings into relief a space between model-based and model-free methods that contains exciting directions for reinforcement learning. The first half of this post describes some of the classic tools in this space, including
generalized value functions and the successor representation. The latter half is based on our recent paper about infinite-horizon predictive models, for which code is available here.

Implementing a custom labeling GUI with built-in processing logic with Amazon SageMaker Ground Truth

Amazon SageMaker Ground Truth is a fully managed data labeling service that makes it easy to build highly accurate training datasets for machine learning. It offers easy access to Amazon Mechanical Turk and private human labelers, and provides them with built-in workflows and interfaces for common labeling tasks.

A labeling team may wish to use the powerful customization features in Ground Truth to modify:

  • The look and feel of the workers’ graphical user interface (GUI)
  • The backend AWS Lambda functions that perform the preprocessing and postprocessing logic.

Depending on the nature of your labeling job and your use case, your customization requirements may vary.

In this post, via a custom workflow, I show you how to implement a text classification labeling job consisting of a custom GUI, built-in preprocessing and postprocessing logic, and encrypted output.  (For our example, workers are tasked to determine whether a sentence references a person, animal, or plant.)  I also provide you with an overview of the prerequisites, the code, and estimated costs of implementing the solution.

Understanding task types and processing logic

In this section, I’ll discuss the use cases surrounding built-in vs custom task types and processing logic.

Built-in task types that implement built-in GUIs and built-in processing logic

Ground Truth provides several built-in task types that cover many image, text, video, video frame, and 3D point cloud labeling use cases.

If you want to implement one of these built-in task types, along with a default labeling GUI, creating a labeling job requires no customization steps.

Custom task types that implement custom GUIs and custom processing logic

If the built-in task types don’t satisfy your labeling job requirements, the options for customizing the GUI as well as the preprocessing and postprocessing logic are nearly endless by way of the custom labeling workflow feature.

With this feature, instead of choosing a built-in task type, you define the preprocessing and postprocessing logic via your own Lambda functions. You also have full control over the labeling GUI using HTML elements and the Liquid-based template system. This enables you to do some really cool customization, including Angular framework integration. For more information, see Building a custom Angular application for labeling jobs with Amazon SageMaker Ground Truth.

For more details on custom workflows, see Creating Custom Labeling Workflows and Creating custom labeling jobs with AWS Lambda and Amazon SageMaker Ground Truth.

Built-in task types that implement custom GUIs and built-in processing logic

So far, I’ve discussed the built-in (100% out-of-the-box) option and the custom workflow (100% custom GUI and logic) option for running a job.

What if you wanted to implement a custom GUI, but implement the built-in preprocessing and postprocessing logic that the built-in task types provide? This way, we can adjust the GUI just the way we want, while still relying on the latest AWS-based preprocessing and postprocessing logic (not to mention not having to maintain another codebase).

You can, and I’ll show you how, step-by-step.

Prerequisites

To complete this solution, you need to set up the following prerequisites:

Setting up an AWS account

In this post, you work directly with IAM, SageMaker, AWS KMS, and Amazon S3, so if you haven’t already, create an AWS account. Following along with this post incurs AWS usage charges, so be sure to shut down and delete resources when you’re finished.

Setting up the AWS CLI

Because we use some parameters not available (as of this writing) on the AWS Management Console, you need access to the AWS CLI. For more information, see Installing, updating, and uninstalling the AWS CLI.

All Ground Truth, Amazon S3, and Lambda configurations for this post must be set up within the same Region. This post assumes you’re operating all services out of the us-west-2 region. If you’re operating within another Region, be sure to modify your setup accordingly for a same-Region setup.

Setting up IAM permissions

If you created labeling jobs in the past with Ground Truth, you may already have the permissions needed to implement this solution. Those permissions include the following policies:

  • SageMakerFullAccess – To have access to the SageMaker GUI and S3 buckets to perform the steps outlined in this post, you need the SageMakerFullAccess policy applied to the user, group, or role assumed for this post.
  • AmazonSageMakerGroundTruthExecution – The Ground Truth labeling jobs you create in this post need to run with an execution role that has the AmazonSageMakerGroundTruthExecution policy attached.

If you have the permissions required to create these roles yourself, the SageMaker GUI walks you through a wizard to set them up. If you don’t have access to create these roles, ask your administrator to create them for you to use during job creation and management.

Setting up an S3 bucket

You need an S3 bucket in the us-west-2 Region to host the SageMaker manifest and categories files for the labeling job. By default, the SageMakerFullAccess and AmazonSageMakerGroundTruthExecution policies only grant access to S3 buckets containing sagemaker or groundtruth in their name (for example, buckets named my-awesome-bucket-sagemaker or marketing-groundtruth-datasets).

Be sure to name your buckets accordingly, or modify the policy accordingly to provide the appropriate access.

For more information on creating a bucket, see Step 1: Create an Amazon S3 Bucket. There is no need for public access to this bucket, so don’t grant it.

As mentioned earlier, all the Ground Truth, Amazon S3, and Lambda configurations for this solution must be in the same Region. For this post, we use us-west-2.

Setting up the Ground Truth work team

When you create a labeling job, you need to assign it to a predefined work team that works on it. If you haven’t created a work team already (or want to create a specific one just for this post), see Create and Manage Workforces.

Setting up AWS KMS

With security as job zero, make sure to encrypt the output manifest file created by the job’s output. To do this, at job creation time, you need to reference a KMS key ID to encrypt the output of the custom Ground Truth job in your S3 bucket.

By default, each account has an AWS managed key (aws/s3) created automatically. For this post, you can use the key ID of the AWS managed key, or you create and use your own customer managed key ID.

For more information about creating and using keys with AWS KMS, see Getting started.

Estimated costs

Running this solution incurs costs for the following:

  • Ground Truth labeling – Labeling costs for each job are $0.56 when using your own private workforce (other workforce types, including Mechanical Turk, may have additional costs). For more information, see Amazon SageMaker Ground Truth pricing.
  • Amazon S3 storage, retrieval, and data transfer – These costs are less than $0.05 (this assumes you delete all files when you’re finished, and operate the solution for a day or less). For more information, see Amazon S3 pricing.
  • Key usage – The cost of an AWS managed KMS key is less than $0.02 for a day’s worth of usage. Storage and usage costs for a customer managed key may be higher. For more information, see AWS Key Management Service pricing.

Setting up the manifest, category, and GUI files

Now that you have met the prerequisites, you can create the manifest, categories, GUI files.

Creating the files

We first create the dataset.manifest file, which we use as the input dataset for the labeling job.

Each object in dataset.manifest contains a line of text describing a person, animal, or plant. One or more of these lines of text is presented as tasks to your workers; they’re responsible for correctly identifying which of the three classifications the line of text best fits.

For this post, dataset.manifest only has seven lines (workers can label up to seven objects), but this input dataset file could have up to 100,000 entries.

Create a file locally named dataset.manifest that contains the following text:

{"source":"His nose could detect over 1 trillion odors!"}
{"source":"Why do fish live in salt water? Because pepper makes them sneeze!"}
{"source":"What did the buffalo say to his son when he went away on a trip? Bison!"}
{"source":"Why do plants go to therapy? To get to the roots of their problems!"}
{"source":"What do you call a nervous tree? A sweaty palm!"}
{"source":"Some kids in my family really like birthday cakes and stars!"}
{"source":"A small portion of the human population carries a fabella bone."}

Next, we create the categories.json file. This file is used by Ground Truth to define the categories used to label the data objects.

Create a file locally named categories.json that contains the following code:

{
    "document-version": "2018-11-28",
    "labels": [{
            "label": "person"
        },
        {
            "label": "animal"
        },
        {
            "label": "plant"
        }
    ]
}

Finally, we create the worker_gui.html file. This file, when rendered, provides the GUI for the workers’ labeling tasks. The options are endless, but for this post, we create a custom GUI that adds the following custom features:

  • An additional Submit button that is styled larger than the default.
  • Shortcut keys for submitting and resetting the form.
  • JavaScript logic to programatically modify a CSS style (break-all) on the task text output.

Make this custom GUI by creating a file locally named worker_gui.html containing the following code:

<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>

<crowd-form>
  <crowd-classifier
    name="crowd-classifier"
    categories="{{ task.input.labels | to_json | escape }}"
    header="Please classify"
  >

    <classification-target>
      <strong>{{ task.input.taskObject }}</strong>
    </classification-target>
    <full-instructions header="Full Instructions">
      <div>
        <p>Based on the general subject or topic of each sentence presented, please classify it as only one of the following: person, animal, or plant. </p>
      </div>
    </full-instructions>

    <short-instructions>
      Complete tasks
    </short-instructions>
  </crowd-classifier>
</crowd-form>

<script>

  document.addEventListener('all-crowd-elements-ready', () => {
    // Creating new button to inject in label pane
    const button = document.createElement('button');
    button.textContent = 'Submit';
    button.classList.add('awsui-button', 'awsui-button-variant-primary', 'awsui-hover-child-icons');

    // Editing styling to make it larger
    button.style.height = '60px';
    button.style.width = '100px';
    button.style.margin = '15px';

    // Adding onclick for submission
    const crowdForm = document.querySelector('crowd-form');
    button.onclick = () => crowdForm.submit();

    // Injecting
    const crowdClassifier = document.querySelector('crowd-classifier').shadowRoot;
    const labelPane = crowdClassifier.querySelector('.category-picker-wrapper');
    labelPane.appendChild(button);

    // Adding a Enter hotkey
    document.addEventListener('keydown', e => {
      if (e.key === 'Enter') {
        crowdForm.submit();
      }
      if (e.key === 'r') {
        crowdForm.reset();
      }

    })

    // Implement break-all style in the layout to handle long text tasks
    const annotationTarget = crowdClassifier.querySelector('.annotation-area.target');
    annotationTarget.style.wordBreak = 'break-all';
  });
</script>

Previewing the GUI in your web browser

While working on the worker.gui.html file, you may find it useful to preview what you’re building.

At any time you can open the worker_gui.html file from your local file system on your browser for a limited preview of the GUI you’re creating. Some dynamic data, such as that provided by the Lamdba preprocessing functions, may not be visible until you run the job from the job status preview page or worker portal.

To preview with real data, you can create a custom job with Lambda functions. For instructions, see Creating custom labeling jobs with AWS Lambda and Amazon SageMaker Ground Truth. You can preview live from the Ground Truth console’s Create labeling job flow.

For more information about the Liquid-based template system, see Step 2: Creating your custom labeling task template.

Uploading the files to Amazon S3

You can now upload all three files to the root directory of your S3 bucket. When uploading these files to Amazon S3, accept all defaults. For more information, see How Do I Upload Files and Folders to an S3 Bucket?

Creating the custom labeling job

After you upload the files to Amazon S3, you can create your labeling job. For some use cases, the SageMaker console provides the needed interface for creating both built-in and custom workflows. In our use case, we use the AWS CLI because it provides additional options not yet available (as of this writing) on the SageMaker console.

The following scripting instructions assume you’re on MacOS or Linux. If you’re on Windows, you may need to modify the extension and contents of the script for it to work, depending on your environment.

Create a file called createCustom.sh (provide your bucket name, execution role ARN, KMS key ID, and work team ARN):

aws sagemaker create-labeling-job 
--labeling-job-name $1 
--label-attribute-name "aws-blog-demo" 
--label-category-config-s3-uri "s3://YOUR_BUCKET_NAME/categories.json" 
--role-arn "YOUR_SAGEMAKER_GROUNDTRUTH_EXECUTION_ROLE_ARN" 
--input-config '{
  "DataSource": {
    "S3DataSource": {
      "ManifestS3Uri": "s3://YOUR_BUCKET_NAME/dataset.manifest"
    }
  }
}' 
--output-config '{
        "KmsKeyId": "YOUR_KMS_KEY_ID",
        "S3OutputPath": "s3://YOUR_BUCKET_NAME/output"
}' 
--human-task-config '{
        "AnnotationConsolidationConfig": {
            "AnnotationConsolidationLambdaArn": "arn:aws:lambda:us-west-2:081040173940:function:ACS-TextMultiClass"
        },
        "TaskAvailabilityLifetimeInSeconds": 21600,
        "TaskTimeLimitInSeconds": 3600,
        "NumberOfHumanWorkersPerDataObject": 1,
        "PreHumanTaskLambdaArn":  "arn:aws:lambda:us-west-2:081040173940:function:PRE-TextMultiClass",
        "WorkteamArn": "YOUR_WORKTEAM_ARN",
        "TaskDescription": "Select all labels that apply",
        "MaxConcurrentTaskCount": 1000,
        "TaskTitle": "Text classification task",
        "UiConfig": {
            "UiTemplateS3Uri": "s3://YOUR_BUCKET_NAME/worker_gui.html"
        }
    }'

Make sure to use your work team ARN, not your workforce ARN. For your KMS key, use the key ID or the AWS managed or customer managed key you want to encrypt the output with. For instructions on retrieving your key, see Finding the key ID and ARN. For more information about types of KMS keys, see Customer master keys (CMKS).

Make the file executable via the command chmod 700 createCustom.sh.

Almost done! But before we run the script, let’s step through what this script is doing in more detail. The script runs the aws sagemaker create-lableing-job CLI command with the following parameters:

  • –labeling-job-name – We set this value to $1, which translates to the argument we pass on the command line when we run it.
  • –label-attribute-name – The attribute name to use for the label in the output manifest file.
  • –label-category-config-s3-url – The path to the categories.json file we previously uploaded to Amazon S3.
  • –role-arn – The ARN of the IAM role SageMaker runs the job under. If you aren’t sure what this value is, your administrator should be able to provide it to you.
  • –input-config – Points to the location of the input dataset manifest file.
  • –output-config – Points to a KMS key ID and the job’s output path.
  • –human-task-config – Provides the following parameters:
    • PreHumanTaskLambdaArn – The built-in AWS-provided Lambda function that performs the same preprocessing logic as that found in the built-in text classification job type. It handles reading the dataset manifest file in Amazon S3, parsing it, and providing the GUI with the appropriate task data.
    • AnnotationConsolidationLambdaArn – The built-in AWS-provided Lambda function that performs the same postprocessing logic as that found in the built-in text classification job type. It handles postprocessing of the data after each labeler submits an answer. As a reminder, all Ground Truth, Amazon S3, and Lambda configurations for this post must be set up within the same Region (for this post, us-west-2). For non us-west-2 Lambda ARN options, see create-labeling-job.
    • TaskAvailabilityLifetimeInSeconds – The length of time that a task remains available for labeling by human workers.
    • TaskTimeLimitInSeconds – The amount of time that a worker has to complete a task.
    • NumberOfHumanWorkersPerDataObject – The number of human workers that label an object.
    • WorkteamArn – The ARN of the work team assigned to complete the tasks. Make sure to use your work team ARN and not your workforce ARN in the script.
    • TaskDescription – A description of the task for your human workers.
    • MaxConcurrentTaskCount – Defines the maximum number of data objects that can be labeled by human workers at the same time.
    • TaskTitle – A title for the task for your human workers.
    • UiTemplateS3Uri – The S3 bucket location of the GUI template that we uploaded earlier. This is the HTML template used to render the worker GUI for labeling job tasks.

For more information about the options available when creating a labeling job from the AWS CLI, see create-labeling-job.

Running the job

Now that you’ve created the script with all the proper parameters, its time to run it! To run the script, enter ./createCustom.sh JOBNAME from the command line, providing a unique name for the job.

In my example, I named the job gec-custom-template-300, and my command line looked like the following:

gcohen $: ./createCustom.sh gec-custom-template-300

{
"LabelingJobArn": "arn:aws:sagemaker:us-west-2:xxyyzz:labeling-job/gec-custom-template-300"
}

Checking the job status and previewing the GUI

Now that we’ve submitted the job, we can easily check its status on the console.

  1. On the SageMaker console, under Ground Truth, choose Labeling jobs.

You should see the job we just submitted.

  1. Choose the job to get more details.

  1. Choose View labeling tool to preview what our labeling workers see when they take the job.

In addition, by using AWS KMS encryption, you can specify authorized users who can decrypt the output manifest file. Who exactly is authorized to decrypt this file varies depending on whether the key is customer managed or AWS managed. For specifics on access permissions for a given key, review the key’s key policy.

Conclusion

In this post, I demonstrated how to implement a custom labeling GUI with built-in preprocessing and postprocessing logic by way of a custom workflow. I also demonstrated how to encrypt the output with AWS KMS. The prerequisites, code, and estimated costs of running it all were also provided.

The code was provided to get you running quickly, but don’t stop there! Try experimenting by adding additional functionality to your workers’ labeling GUIs, either with your own custom libraries or third-party logic. If you get stuck, don’t hesitate to reach out directly, or post an issue on our GitHub repo issues page.


About the Author

Geremy Cohen is a Solutions Architect with AWS where he helps customers build cutting-edge, cloud-based solutions. In his spare time, he enjoys short walks on the beach, exploring the bay area with his family, fixing things around the house, breaking things around the house, and BBQing.

Read More

Building a secure search application with access controls using Amazon Kendra

For many enterprises, critical business information is often stored as unstructured data scattered across multiple content repositories. Not only is it challenging for organizations to make this information available to employees when they need it, but it’s also difficult to do so securely so relevant information is available to the right employees or employee groups.

Amazon Kendra is a highly accurate and easy-to-use intelligent search service powered by machine learning (ML). Amazon Kendra delivers secure search for enterprise applications and can make sure the results of a user’s search query only include documents the user is authorized to read. In this post, we illustrate how to build an Amazon Kendra-powered search application supporting access controls that reflect the security model of an example organization.

Amazon Kendra supports search filtering based on user access tokens that are provided by your search application, as well as document access control lists (ACLs) collected by the Amazon Kendra connectors. When user access tokens are applied, search results return links to the original document repositories and include a short description. Access control to the full document is still enforced by the original repository.

In this post, we demonstrate token-based user access control in Amazon Kendra with Open ID. We use Amazon Cognito user pools to authenticate users and provide Open ID tokens. You can use a similar approach with other Open ID providers.

Application overview

This application is designed for guests and registered users to make search queries to a document repository, and results are returned only from those documents that are authorized for access by the user. Users are grouped based on their roles, and access control is at a group level. The following table outlines which documents each user is authorized to access for our use case. The documents being used in this example are a subset of AWS public documents.

User Role Group Document Type Authorized for Access
Guest Blogs
Patricia IT Architect Customer Blogs, user guides
James Sales Rep Sales Blogs, user guides, case studies
John Marketing Exec Marketing Blogs, user guides, case studies, analyst reports
Mary Solutions Architect Solutions Architect Blogs, user guides, case studies, analyst reports, whitepapers

Architecture

The following diagram illustrates our solution architecture.

The following diagram illustrates our solution architecture.

The documents being queried are stored in an Amazon Simple Storage Service (Amazon S3) bucket. Each document type has a separate folder: blogs, case-studies, analyst-reports, user-guides, and white-papers. This folder structure is contained in a folder named Data. Metadata files including the ACLs are included in a folder named Meta.

We use the Amazon Kendra S3 connector to configure this S3 bucket as the data source. When the data source is synced with the Amazon Kendra index, it crawls and indexes all documents as well as collects the ACLs and document attributes from the metadata files. For this example, we use a custom attribute DocumentType to denote the type of the document.

We use an Amazon Cognito user pool to authenticate registered users, and use an identity pool to authorize the application to use Amazon Kendra and Amazon S3. The user pool is configured as an Open ID provider in the Amazon Kendra index by configuring the signing URL of the user pool.

When a registered user authenticates and logs in to the application to perform a query, the application sends the user’s access token provided by the user pool to the Amazon Kendra index as a parameter in the query API call. For guest users, there is no authentication and therefore no access token is sent as a parameter to the query API. The results of a query API call without the access token parameter only return the documents without access control restrictions.

When an Amazon Kendra index receives a query API call with a user access token, it decrypts the access token using the user pool signing URL and gets parameters such as cognito:username and cognito:groups associated with the user. The Amazon Kendra index filters the search results based on the stored ACLs and the information received in the user access token. These filtered results are returned in response to the query API call made by the application.

The application, which the users can download with its source, is written in ReactJS using components from the AWS Amplify framework. We use the AWS Amplify console to implement the continuous integration and continuous deployment pipelines. We use an AWS CloudFormation template to deploy the AWS infrastructure, which includes the following:

In this post, we provide a step-by-step walkthrough to configure the backend infrastructure, build and deploy the application code, and use the application.

Prerequisites

To complete the steps in this post, make sure you have the following:

Preparing your S3 bucket as a data source

To prepare an S3 bucket as a data source, create an S3 bucket. In the terminal with the AWS CLI or AWS CloudShell, run the following commands to upload the documents and the metadata to the data source bucket:

aws s3 cp s3://aws-ml-blog/artifacts/building-a-secure-search-application-with-access-controls-kendra/docs.zip .
unzip docs.zip
aws s3 cp Data/ s3://<REPLACE-WITH-NAME-OF-S3-BUCKET>/Data/ --recursive
aws s3 cp Meta/ s3://<REPLACE-WITH-NAME-OF-S3-BUCKET>/Meta/ --recursive

Deploying the infrastructure as a CloudFormation stack

In a separate browser tab open the AWS Management Console, and make sure that you are logged in to your AWS account. Click the button below to launch the CloudFormation stack to deploy the infrastructure.

You should see a page similar to the image below:

For S3DataSourceBucket, enter your data source bucket name without the s3:// prefix, select I acknowledge that AWS CloudFormation might create IAM resources with custom names, and then choose Create stack.

Stack creation can take 30–45 minutes to complete. While you wait, you can look at the different tabs, such as Events, Resources, and Template. You can monitor the stack creation status on the Stack info tab.

You can monitor the stack creation status on the Stack info tab.

When stack creation is complete, keep the Outputs tab open. We need values from the Outputs and Resources tabs in subsequent steps.

Reviewing Amazon Kendra configuration and starting the data source sync

In the following steps, we configure Amazon Kendra to enable secure token access and start the data source sync to begin crawling and indexing documents.

  1. On the Amazon Kendra console, choose the index AuthKendraIndex, which was created as part of the CloudFormation stack.

On the Amazon Kendra console, choose the index AuthKendraIndex, which was created as part of the CloudFormation stack.

Under User access control, token-based user access control is enabled, the signing key object is set to the Open ID provider URL of the Amazon Cognito user pool, and the user name and group are set to cognito:username and cognito:groups, respectively.

Under User access control, token-based user access control is enabled.

  1. In the navigation pane, choose Data sources.
  2. On the Settings tab, you can see the data source bucket being configured.
  3. Select the radio button for the data source and choose Sync now.

Choose Sync now.

The data source sync can take 10–15 minutes to complete, but you don’t have to wait to move to the next step.

Creating users and groups in the Amazon Cognito user pool

In the terminal with the AWS CLI or AWS CloudShell, run the following commands to create users and groups in the Amazon Cognito user pool to use for our application. You need to copy the contents of the Physical ID column in the UserPool row from the Resources tab of the CloudFormation stack. This is the user pool ID to use in the following steps. We set AmazonKendra@2020 as the temporary password for all the users. This password is required when logging in for the first time, and Amazon Cognito enforces a password reset.

USER_POOL_ID=<PASTE-USER-POOL-ID-HERE>
aws cognito-idp create-group --group-name customer --user-pool-id ${USER_POOL_ID}
aws cognito-idp create-group --group-name AWS-Sales --user-pool-id ${USER_POOL_ID}
aws cognito-idp create-group --group-name AWS-Marketing --user-pool-id ${USER_POOL_ID}
aws cognito-idp create-group --group-name AWS-SA --user-pool-id ${USER_POOL_ID}
aws cognito-idp admin-create-user --user-pool-id ${USER_POOL_ID} --username patricia --temporary-password AmazonKendra@2020
aws cognito-idp admin-create-user --user-pool-id ${USER_POOL_ID} --username james  --temporary-password AmazonKendra@2020
aws cognito-idp admin-create-user --user-pool-id ${USER_POOL_ID} --username john  --temporary-password AmazonKendra@2020
aws cognito-idp admin-create-user --user-pool-id ${USER_POOL_ID} --username mary  --temporary-password AmazonKendra@2020
aws cognito-idp admin-add-user-to-group --user-pool-id ${USER_POOL_ID} --username patricia --group-name customer
aws cognito-idp admin-add-user-to-group --user-pool-id ${USER_POOL_ID} --username james --group-name AWS-Sales
aws cognito-idp admin-add-user-to-group --user-pool-id ${USER_POOL_ID} --username john --group-name AWS-Marketing
aws cognito-idp admin-add-user-to-group --user-pool-id ${USER_POOL_ID} --username mary --group-name AWS-SA

Building and deploying the app

Now we build and deploy the app using the following steps:

  1. On the AWS Amplify console, choose the app AWSKendraAuthApp.
  2. Choose Run build.

Choose Run build.

You can monitor the build progress on the console.

You can monitor the build progress on the console.

Let the build continue and complete the steps: Provision, Build, Deploy, and Verify. After this, the application is deployed and ready to use.

You can browse through the source code by opening up the CodeCommit repository. The important file to look at is src/App.tsx.

  1. Choose the link on the left to start the application in a new browser tab.

Choose the link on the left to start the application in a new browser tab.

Trial run

We can now take a trial run of our app.

  1. On the login page, sign in with the username patricia and the temporary password AmazonKendra@2020.

On the login page, sign in with the username patricia and the temporary password AmazonKendra@2020.

Amazon Cognito requires you to reset your password the first time you log in. After you log in, you can see the search field.

Amazon Cognito requires you to reset your password the first time you log in. After you log in, you can see the search field.

  1. In the search field, enter a query, such as what is serverless?
  2. Expand Filter search results to see different document types.

You can select different document types to filter the search results.

You can select different document types to filter the search results.

  1. Sign out and repeat this process for other users that are created in the Cognito user pool, namely, james, john, and mary.

You can also choose Continue as Guest to use the app without authenticating. However, this option only shows results from blogs.

You can also choose Continue as Guest to use the app without authenticating. However, this option only shows results from blogs.

You can return back to the login screen by choosing Welcome Guest! Click here to sign up or sign in.

Using the application

You can use the application we developed by making a few search queries logged in as different users. To experience how access control works, issue the same query from different user accounts and observe the difference in the search results. The following users get results from different sources:

  • Guests and anonymous users – Only blogs
  • Patricia (Customer) – Blogs and user guides
  • James (Sales) – Blogs, user guides, and case studies
  • John (Marketing) – Blogs, user guides, case studies, and analyst reports
  • Mary (Solutions Architect) – Blogs, user guides, case studies, analyst reports, and whitepapers

We can make additional queries and observe the results. Some suggested queries include “What is machine learning?”, “What is serverless?”, and “Databases”.

Cleaning up

To delete the infrastructure that was deployed as part of the CloudFormation stack, delete the stack from the AWS CloudFormation console. Stack deletion can take 20–30 minutes.

When the stack status shows as Delete Complete, go to the Events tab and confirm that each of the resources has been removed. You can also cross-verify by checking on the respective management consoles for Amazon Kendra, Amazon Amplify, and the Amazon Cognito user pool and identity pool.

You must delete your data source bucket separately, because it was not created as part of the CloudFormation stack.

Conclusion

In this post, we demonstrated how you can create a secure search application using Amazon Kendra. Organizations who use an Open ID-compliant identity management system with a new or pre-existing Amazon Kendra index can now enable secure token access to make sure your intelligent search applications are aligned with your organizational security model. For more information about access control in Amazon Kendra, see Controlling access to documents in an index.


About the Author

Abhinav JawadekarAbhinav Jawadekar is a Senior Partner Solutions Architect at Amazon Web Services. Abhinav works with AWS partners to help them in their cloud journey.

Read More