Image captioning is the machine learning task of automatically generating a fluent natural language description for a given image. This task is important for improving accessibility for visually impaired users and is a core task in multimodal research encompassing both vision and language modeling.
However, datasets for image captioning are primarily available in English. Beyond that, there are only a few datasets covering a limited number of languages that represent just a small fraction of the world’s population. Further, these datasets feature images that severely under-represent the richness and diversity of cultures from across the globe. These aspects have hindered research on image captioning for a wide variety of languages, and directly hamper the deployment of accessibility solutions for a large potential audience around the world.
Today we present and make publicly available the Crossmodal 3600 (XM3600) image captioning evaluation dataset as a robust benchmark for multilingual image captioning that enables researchers to reliably compare research contributions in this emerging field. XM3600 provides 261,375 human-generated reference captions in 36 languages for a geographically diverse set of 3600 images. We show that the captions are of high quality and the style is consistent across languages.
|The Crossmodal 3600 dataset includes reference captions in 36 languages for each of a geographically diverse set of 3600 images. All images used with permission under the CC-BY 2.0 license.|
Overview of the Crossmodal 3600 Dataset
Creating large training and evaluation datasets in multiple languages is a resource-intensive endeavor. Recent work has shown that it is feasible to build multilingual image captioning models trained on machine-translated data with English captions as the starting point. However, some of the most reliable automatic metrics for image captioning are much less effective when applied to evaluation sets with translated image captions, resulting in poorer agreement with human evaluations compared to the English case. As such, trustworthy model evaluation at present can only be based on extensive human evaluation. Unfortunately, such evaluations usually cannot be replicated across different research efforts, and therefore do not offer a fast and reliable mechanism to automatically evaluate multiple model parameters and configurations (e.g., model hill climbing) or to compare multiple lines of research.
XM3600 provides 261,375 human-generated reference captions in 36 languages for a geographically diverse set of 3600 images from the Open Images dataset. We measure the quality of generated captions by comparing them to the manually provided captions using the CIDEr metric, which ranges from 0 (unrelated to the reference captions) to 10 (perfectly matching the reference captions). When comparing pairs of models, we observed strong correlations between the differences in the CIDEr scores of the model outputs, and side-by-side human evaluations comparing the model outputs. , making XM3600 is a reliable tool for high-quality automatic comparisons between image captioning models on a wide variety of languages beyond English.
We chose 30 languages beyond English, roughly based on their percentage of web content. In addition, we chose an additional five languages that include under-resourced languages that have many native speakers or major native languages from continents that would not be covered otherwise. Finally, we also included English as a baseline, thus resulting in a total of 36 languages, as listed in the table below.
|List of languages used in XM3600. *Low-resource languages with many native speakers, or major native languages from continents that would not be covered otherwise.|
The images were selected from among those in the Open Images dataset that have location metadata. Since there are many regions where more than one language is spoken, and some areas are not well covered by these images, we designed an algorithm to maximize the correspondence between selected images and the regions where the targeted languages are spoken. The algorithm starts with the selection of images with geo-data corresponding to the languages for which we have the smallest pool (e.g., Persian) and processes them in increasing order of their candidate image pool size. If there aren’t enough images in an area where a language is spoken, then we gradually expand the geographic selection radius to: (i) a country where the language is spoken; (ii) a continent where the language is spoken; and, as last resort, (iii) from anywhere in the world. This strategy succeeded in providing our target number of 100 images from an appropriate region for most of the 36 languages, except for Persian (where 14 continent-level images are used) and Hindi (where all 100 images are at the global level, because the in-region images were assigned to Bengali and Telugu).
Photo by Chris Sampson
Photo by Henrik Palm
Photo by rojypala
Photo by McKay Savage
Photo by Simon Schoeters
Photo by Stefan Krasowski
|Sample images showcasing the geographical diversity of the annotated images. Images used under CC BY 2.0 license.|
In total, all 3600 images (100 images per language) are annotated in all 36 languages, each with an average of two annotations per language, yielding a total of 261,375 captions.
Annotators work in batches of 15 images. The first screen shows all 15 images with their captions in English as generated by a captioning model trained to output a consistent style of the form “<main salient objects> doing <activities> in the <environment>”, often with object attributes, such as a “smiling” person, “red” car, etc. The annotators are asked to rate the caption quality given guidelines for a 4-point scale from “excellent” to “bad”, plus an option for “not_enough_information”. This step forces the annotators to carefully assess caption quality and it primes them to internalize the style of the captions. The following screens show the images again but individually and without the English captions, and the annotators are asked to produce descriptive captions in the target language for each image.
The image batch size of 15 was chosen so that the annotators would internalize the style without remembering the exact captions. Thus, we expect the raters to generate captions based on the image content only and lacking translation artifacts. For example in the example shown below, the Spanish caption mentions “number 42” and the Thai caption mentions “convertibles”, none of which are mentioned in the English captions. The annotators were also provided with a protocol to use when creating the captions, thus achieving style consistency across languages.
Photo by Brian Solis
|English||• A vintage sports car in a showroom with many other vintage sports cars|
|• The branded classic cars in a row at display|
|Spanish||• Automóvil clásico deportivo en exhibición de automóviles de galería — (Classic sports car in gallery car show)|
|• Coche pequeño de carreras color plateado con el número 42 en una exhibición de coches — (Small silver racing car with the number 42 at a car show)|
|Thai||• รถเปิดประทุนหลายสีจอดเรียงกันในที่จัดแสดง — (Multicolored convertibles line up in the exhibit)|
|• รถแข่งวินเทจจอดเรียงกันหลายคันในงานจัดแสดง — (Several vintage racing cars line up at the show.)|
|Sample captions in three different languages (out of 36 — see full list of captions in Appendix A of the Crossmodal-3600 paper), showcasing the creation of annotations that are consistent in style across languages, while being free of direct-translation artifacts (e.g., the Spanish “number 42” or the Thai “convertibles” would not be possible when directly translating from the English versions). Image used under CC BY 2.0 license.|
Caption Quality and Statistics
We ran two to five pilot studies per language to troubleshoot the caption generation process and to ensure high quality captions. We then manually evaluated a random subset of captions. First we randomly selected a sample of 600 images. Then, to measure the quality of captions in a particular language, for each image, we selected for evaluation one of the manually generated captions. We found that:
- For 25 out of 36 languages, the percentage of captions rated as “Good” or “Excellent” is above 90%, and the rest are all above 70%.
- For 26 out of 36 languages, the percentage of captions rated as “Bad” is below 2%, and the rest are all below 5%.
For languages that use spaces to separate words, the number of words per caption can be as low as 5 or 6 for some agglutinative languages like Cusco Quechua and Czech, and as high as 18 for an analytic language like Vietnamese. The number of characters per caption also varies drastically — from mid-20s for Korean to mid-90s for Indonesian — depending on the alphabet and the script of the language.
Empirical Evaluation and Results
We empirically measured the ability of the XM3600 annotations to rank image captioning model variations by training four variations of a multilingual image captioning model and comparing the CIDEr differences of the models’ outputs over the XM3600 dataset for 30+ languages, to side-by-side human evaluations. We observed strong correlations between the CIDEr differences and the human evaluations. These results support the use of the XM3600 references as a means to achieve high-quality automatic comparisons between image captioning models on a wide variety of languages beyond English.
Recently PaLI used XM3600 to evaluate model performance beyond English for image captioning, image-to-text retrieval and text-to-image retrieval. The key takeaways they found when evaluating on XM3600 were that multilingual captioning greatly benefits from scaling the PaLI models, especially for low-resource languages.
We would like to acknowledge the coauthors of this work: Xi Chen and Radu Soricut.