May 2023 – Page 10

Large-language models for automatic cloud incident management

This research was accepted by the IEEE/ACM International Conference on Software Engineering (ICSE), which is a forum for researchers, practitioners, and educators to gather, present, and discuss the most recent innovations, trends, experiences, and issues in the field of software engineering.

The Microsoft 365 Systems Innovation research group has a paper accepted at the 45^thInternational Conference on Software Engineering (ICSE), widely recognized as one of the most prestigious research conferences on software engineering. This paper, Recommending Root-Cause and Mitigation Steps for Cloud Incidents using Large Language Models, focuses on using state-of-the-art large language models (LLMs) to help generate recommendations for cloud incident root cause analysis and mitigation plans. With a rigorous study on real production incidents and analysis of several LLMs in different settings using semantic and lexical metrics as well as human evaluation, the research shows the efficacy and future potential of using AI for resolving cloud incidents.

Challenges of building reliable cloud services

Building highly reliable hyperscale cloud services such as Microsoft 365 (M365), which supports the productivity of hundreds of thousands of organizations, is very challenging. This includes the challenge of quickly detecting incidents, then performing root cause analysis and mitigation.

Our recent research starts with understanding the fundamentals of production incidents: we analyze the life cycle of incidents, then determine the common root causes, mitigations, and engineering efforts for resolution. In a previous paper: How to Fight Production Incidents? An Empirical Study on a Large-scale Cloud Service, which won a Best Paper award at SoCC’22, we provide a comprehensive, multi-dimensional empirical study of production incidents from Microsoft Teams. From this study, we envision that automation should support incident diagnosis and help identify the root cause and mitigation steps to quickly resolve an incident and minimize customer impact. We should also leverage past lessons to build resilience for future incidents. We posit that adopting AIOps and using state-of-the-art AI/ML technologies can help achieve both goals, as we show in the ICSE paper.

Adapting large-language models for automated incident management

Recent breakthroughs in AI have enabled LLMs to develop a rich understanding of natural language. They can understand and reason over large volumes of data and complete a diverse set of tasks, such as code completion, translation, and Q&A. Given the complexities of incident management, we sought to evaluate the effectiveness of LLMs in analyzing the root cause of production incidents and generating mitigation steps.

A block diagram that shows using title and summary of the incidents as input through GPT-3.x models that generate root cause and mitigation recommendations. — Figure 1: Leveraging GPT-3.x for root cause analysis and mitigation

In our recently published ICSE paper, we demonstrated the usefulness of LLMs for production incident diagnosis for the first time. When an incident ticket is created, the author specifies a title for each incident created and describes any relevant details, such as error messages, anomalous behavior, and other details which might help with resolution. We used the title and the summary of a given incident as the input for LLMs and generated root cause and mitigation steps, as shown in Figure 1.

We did a rigorous study on more than 40,000 incidents generated from more than 1000 services and compared several LLMs in zero-shot, fine-tuned, and multi-task settings. We find that fine-tuning the GPT-3 and GPT-3.5 models significantly improves the effectiveness of LLMs for incident data.

Effectiveness of GPT-3.x models at finding root causes

Model	BLEU-4		ROUGE-L		METEOR		BERTScore		BLEURT		NUBIA
Model	Top1	Top5	Top1	Top5	Top1	Top5	Top1	Top5	Top1	Top5	Top1	Top5
RoBERTa	4.21	NA	12.83	NA	9.89	NA	85.38	NA	35.66	NA	33.94	NA
CodeBERT	3.38	NA	10.17	NA	6.58	NA	84.88	NA	33.19	NA	39.05	NA
Curie	3.40	6.29	19.04	15.44	7.21	13.65	84.90	86.36	32.62	40.08	33.52	49.76
Codex	3.44	6.25	8.98	15.51	7.33	13.82	84.85	86.33	32.50	40.11	33.64	49.77
Davinci	3.34	5.94	8.53	15.10	6.67	12.95	83.13	84.41	31.06	38.61	35.28	50.79
Davinci-002	4.24	7.15	11.43	17.2	10.42	16.8	85.42	86.78	36.77	42.87	32.3	51.34
%gain for Davinci-002	23.26	13.67	26.44	10.90	42.16	21.56	0.61	0.49	12.72	6.88	-8.45	1.08

Table 1: Lexical and semantic performance of different LLMs

In our offline evaluation, we compared the performance of GPT-3.5 against three GPT-3 models by computing several semantic and lexical metrics (which measures the text similarity) between the generated recommendations and the ground truth of root cause or mitigation steps mentioned in incident management (IcM) portal. The average gains for GPT-3.5 metrics for different tasks were as follows:

For root cause and mitigation recommendation tasks, Davinci-002 (GPT-3.5) provided at least 15.38% and 11.9% gains over all the GPT-3 models, respectively, as shown in Table 1.
When we generated mitigation plans by adding root cause as input to the model, GPT-3.5 model provided at least an 11.16% gain over the GPT-3 models.
LLMs performed better on machine reported incidents (MRIs) as opposed to customer reported incidents (CRIs), due to the repetitive nature of the MRIs.
Fine-tuning LLMs with incident data improved performance significantly. A fine-tuned GPT-3.5 model improved the average lexical similarity score by 45.5% for root cause generation and 131.3% for mitigation generation tasks over zero-shot (i.e., inferencing directly on pretrained GPT-3 or GPT-3.5 model) setting.

Looking through the incident owners’ eyes

In addition to analysis with semantic and lexical metrics, we also interviewed the incident owners to evaluate the effectiveness of the generated recommendations. Overall, GPT-3.5 outperforms GPT-3 in a majority of the metrics. More than 70% of on-call engineers gave a rating of 3 out of 5 or better for the usefulness of recommendations in a real-time production setting.

Looking forward

With future versions of LLMs coming, we expect the performance for automatic incident resolution will further improve, and the need for fine-tuning may decrease. Yet we are in the initial stage, with many open research questions in this field. For instance, how can we incorporate additional context about the incident, such as discussion entries, logs, service metrics, and even dependency graphs of the impacted services to improve the diagnosis? Another challenge is staleness since the models would need to be frequently retrained with the latest incident data. To solve these challenges, we are working on leveraging the latest LLMs combined with retrieval augmented approaches to improve incident diagnosis via a conversational interface, as shown in Figure 2.

A workflow diagram that shows how to use retrieval augmentation approach to recommend root causes. This approach including a retriever and corpus to retrieve relevant information from historical incidents, troubleshooting guides, and engineering hub, to add context into LLMs. — Figure 2: Workflow of retrieval-augmented root cause analysis

Moreover, ChatGPT can be actively integrated into the “discussion” of the incident diagnosis. By collecting evidence from available documents and logs, the model can generate coherent, contextual, natural-sounding responses to inquiries and offer corresponding suggestions, thereby facilitating the discussion and accelerating the incident resolution process. We believe this could deliver a step function improvement in the overall incident management process with contextual and meaningful root causes analysis and mitigation, thereby reducing significant human effort required and bolstering reliability and customer satisfaction.

Acknowledgement

This post includes contributions from Toufique Ahmed during his internship at Microsoft.

The post Large-language models for automatic cloud incident management appeared first on Microsoft Research.

Mammoth Mission: How Colossal Biosciences Aims to ‘De-Extinct’ the Woolly Mammoth

Ten thousand years after the last woolly mammoths vanished with the last Ice Age, a team of computational biologists is on a mission to bring them back within five years.

Led by synthetic biology pioneer George Church, Colossal Biosciences is also seeking to return the dodo bird and Tasmanian tiger, as well as help save current-day endangered species.

“The woolly mammoth is a very iconic species to bring back,” said Eriona Hysolli, head of biological sciences at Colossal Biosciences, which is based in Austin, Texas. “In addition, we see that pipeline as a proxy for conservation, given that elephants are endangered and much of this work directly benefits them.”

There’s plenty of work to be done on endangered species, as well.

Critically endangered, the African forest elephant has declined by nearly 90% in the past three decades, according to Colossal. Poaching took more than 100,000 African elephants between 2010 and 2012 alone, according to the company.

“We might lose these elephant species in our lifetime if their numbers continue to dwindle,” said Hysolli.

Humans caused the extinction of many species, but computational biologists are now trying to bring them back with CRISPR and other gene-editing technologies, leaps in AI, and bioinformatics tools and technology, such as the NVIDIA Parabricks software suite for genomic analysis.

To bring back a woolly mammoth, scientists at Colossal start with mammoth and elephant genome sequencing and identify what makes them similar and different. Then they use Asian elephant cells to engineer mammoth changes responsible for cold adaptation traits, transferring the nuclei of edited cells into elephant enucleated eggs before implanting them into a healthy Asian elephant surrogate.

Tech Advances Drive Genomics Leaps

It took enormous effort over two decades, not to mention $3 billion in funding, to first sequence the human genome. But that’s now been reduced to mere hours and under $200 per whole genome, thanks to the transformative impact of AI and accelerated computing.

It’s a story well known to Colossal co-founder Church. The Harvard Medical School professor and co-founder of roughly 50 biotech startups has been at the forefront of genetics research for decades.

“There’s been about a 20 millionfold reduction in price, and a similar improvement in quality in a little over a decade, or a decade and a half,” Church said in a recent interview on the TWiT podcast.

Research to Complete Reference Genome Puzzle

Colossal’s work to build a reference genome of the woolly mammoth is similar to trying to complete a puzzle.

DNA sequences from bone samples are assembled in silico. But degradation of the DNA over time means that not all the pieces are there. The gaps to be filled can be guided with the genome from an Asian elephant, the closest living relative for the mammoth.

Once a rough representative genome sequence is configured, secondary analysis takes place, which is where GPU acceleration with Parabricks comes in.

The suite of bioinformatic tools in Parabricks can provide more than 100x acceleration of industry-standard tools used for alignment and variant calling. In the alignment step, the short fragments, or reads, from the sequenced sample are aligned in the correct order, using the reference genome, which in this case is the genome of the Asian elephant. Then, in the variant-calling step, Parabricks tools identify the variants, or differences, between the sequenced whole genome mammoth samples and the Asian elephant reference.

In September, Colossal Biosciences spun out Form Bio, which offers a breakthrough computational life sciences platform, to aid its efforts and commercialize scientific innovations. Form Bio is a member of NVIDIA Inception, a program that provides companies with technology support and AI platforms guidance.

Parabricks includes some of the same tools as the open-source ones that Form Bio was using, making it easy to replace them with NVIDIA GPU-accelerated versions of those tools, said Brandi Cantarel, vice president of bioinformatics at Form Bio.

Compared with the open-source software on CPUs, Parabricks running on GPUs enables Colossal to complete their end-to-end sequence analysis 12x faster and at one-quarter the cost, accelerating the research.

“We’re getting very comparable or exactly the same outputs, and it was faster and cheaper,” said Cantarel.

Analysis Targeting Cold Tolerance for Woolly Mammoth

A lot is at stake in the sequencing and analysis.

The Form Bio platform hosts tools that can assess whether researchers make the right CRISPR edits and assist in analysis for whether cells are edited.

“Can we identify what are the targets that we need to actually go after and edit and engineer? The answer is absolutely yes, and we’ve gotten very good at selecting impactful genetic differences,” said Hysolli.

Another factor to consider is human contamination to samples. So for each sample researchers examine, they must do analysis against human cell references to discard those contaminants.

Scientists have gathered multiple specimens of woolly mammoths over the years, and the best are tooth or bone samples found in permafrost. “We benefit from the fact that woolly mammoths were well-preserved because they lived in an Arctic environment,” said Hysolli.

An Asian elephant is 99.6% the same as a mammoth genetically, according Ben Lamm, Colossal CEO and co-founder.

“We’re just targeting about 65 genes that represent the cold tolerance, the core phenotypes that we’re looking for,” he recently said on stage at South by Southwest in Austin.

Benefits to Biodiversity, Conservation and Humanity

Colossal aims to create reference genomes for species, like the mammoth, that represent broad population samples. They’re looking at mammoths from different regions of the globe and periods in time. And it’s necessary to parse the biodiversity and do more sequencing, according to researchers at the company.

“As we lose biodiversity, it’s important to bring back or restore species and their ecosystems, which in turn positively impacts ecology and supports conservation,” said Hysolli.

Population genetics is important. Researchers need to understand how different and similar these animals are to each other so that in the future they can create thriving populations, she said.

That ensures better chances of survival. “We need to make sure — that’s what makes a thriving population when you rewild,” said Hysolli, referring to when the team introduces the species back into an Arctic habitat.

It’s also been discovered that elephants are more resistant to cancer — so researchers are looking at the genetic factors and how that might translate for humans.

“This work does not only benefit Colossal’s de-extinction efforts and conservation, but these technologies we build can be applied to bettering human health and treating diseases,” said Hysolli.

Learn more about NVIDIA Parabricks for accelerated genomic sequencing analysis.

Chip Manufacturing ‘Ideal Application’ for AI, NVIDIA CEO Says

Chip manufacturing is an “ideal application” for NVIDIA accelerated and AI computing, NVIDIA founder and CEO Jensen Huang said Tuesday.

Detailing how the latest advancements in computing are accelerating “the world’s most important industry,” Huang spoke at ITF World 2023 semiconductor conference in Antwerp, Belgium.

Huang delivered his remarks via video to a gathering of leaders from across the semiconductor, technology and communications industries.

“I am thrilled to see NVIDIA accelerated computing and AI in service of the world’s chipmaking industry,” Huang said as he detailed how advancements in accelerated computing, AI and semiconductor manufacturing intersect.

AI, Accelerated Computing Step Up

The exponential performance increase of the CPU has been the governing dynamic of the technology industry for nearly four decades, Huang said.

But over the past few years CPU design has matured, he said. The rate at which semiconductors become more powerful and efficient is slowing, even as demand for computing capability soars.

“As a result, global demand for cloud computing is causing data center power consumption to skyrocket,” Huang said.

Huang said that striving for net zero while supporting the “invaluable benefits” of more computing power requires a new approach.

The challenge is a natural fit for NVIDIA, which pioneered accelerated computing, coupling the parallel processing capabilities of GPUs with CPUs.

This acceleration, in turn, sparked the AI revolution. A decade ago, deep learning researchers such as Alex Krizhevsky, Ilya Sutskever and Geoffrey Hinton discovered that GPUs could be cost-effective supercomputers.

Since then, NVIDIA reinvented its computing stack for deep learning, opening up “multi trillion-dollar opportunities in robotics, autonomous vehicles and manufacturing,” Huang said.

By offloading and accelerating compute-intensive algorithms, NVIDIA routinely speeds up applications by 10-100x while reducing power and cost by an order of magnitude, Huang explained.

Together, AI and accelerated computing are transforming the technology industry. “We are experiencing two simultaneous platform transitions — accelerated computing and generative AI,” Huang said.

AI, Accelerated Computing Come to Chip Manufacturing

Huang explained that advanced chip manufacturing requires over 1,000 steps, producing features the size of a biomolecule. Each step must be nearly perfect to yield functional output.

“Sophisticated computational sciences are performed at every stage to compute the features to be patterned and to do defect detection for in-line process control,” Huang said. “Chip manufacturing is an ideal application for NVIDIA accelerated and AI computing.”

Huang outlined several examples of how NVIDIA GPUs are becoming increasingly integral to chip manufacturing.

Companies like D2S, IMS Nanofabrication, and NuFlare build mask writers — machines that create photomasks, stencils that transfer patterns onto wafers — using electron beams. NVIDIA GPUs accelerate the computationally demanding tasks of pattern rendering and mask process correction for these mask writers.

Semiconductor manufacturer TSMC and equipment providers KLA and Lasertech use extreme ultraviolet light, known as EUV, and deep ultraviolet light, or DUV, for mask inspection. NVIDIA GPUs play a crucial role here, too, in processing classical physics modeling and deep learning to generate synthetic reference images and detect defects.

KLA, Applied Materials, and Hitachi High-Tech use NVIDIA GPUs in their e-beam and optical wafer inspection and review systems.

And in March, NVIDIA announced that it is working with TSMC, ASML and Synopsys to accelerate computational lithography.

Computational lithography simulates Maxwell’s equations of light behavior passing through optics and interacting with photoresists, Huang explained.

Computational lithography is the largest computational workload in chip design and manufacturing, consuming tens of billions of CPU hours annually. Massive data centers run 24/7 to create reticles for new chips.

Introduced in March, NVIDIA cuLitho is a software library with optimized tools and algorithms for GPU-accelerated computational lithography.

“We have already accelerated the processing by 50 times,” Huang said. “Tens of thousands of CPU servers can be replaced by a few hundred NVIDIA DGX systems, reducing power and cost by an order of magnitude.”

The savings will reduce carbon emissions or enable new algorithms to push beyond 2 nanometers, Huang said.

What’s Next?

What’s the next wave of AI? Huang described a new kind of AI — “embodied AI,” or intelligent systems that can understand, reason about and interact with the physical world.

He said examples include robotics, autonomous vehicles and even chatbots that are smarter because they understand the physical world.

Huang offered his audience a look at NVIDIA VIMA, a multimodal embodied AI. VIMA, Huang said, can perform tasks from visual text prompts, such as “rearranging objects to match this scene.”

It can learn concepts and act accordingly, such as “This is a widget,” “That’s a thing” and then “Put this widget in that thing.” It can also learn from demonstrations and stay within specified boundaries, Huang said.

VIMA runs on NVIDIA AI, and its digital twin runs in NVIDIA Omniverse, a 3D development and simulation platform. Huang said that physics-informed AI could learn to emulate physics and make predictions that obey physical laws.

Researchers are building systems that mesh information from real and virtual worlds on a vast scale.

NVIDIA is building a digital twin of our planet, called Earth-2, which will first predict the weather, then long-range weather, and eventually climate. NVIDIA’s Earth-2 team has created FourCastNet, a physics-AI model that emulates global weather patterns 50-100,000x faster.

FourCastNet runs on NVIDIA AI, and the Earth-2 digital twin is built in NVIDIA Omniverse.

Such systems promise to address the greatest challenge of our time, such as the need for cheap, clean energy.

For example, researchers at the U.K.’s Atomic Energy Authority and the University of Manchester are creating a digital twin of their fusion reactor, using physics-AI to emulate plasma physics and robotics to control the reactions and sustain the burning plasma.

Huang said scientists could explore hypotheses by testing them in the digital twin before activating the physical reactor, improving energy yield, predictive maintenance and reducing downtime. “The reactor plasma physics-AI runs on NVIDIA AI, and its digital twin runs in NVIDIA Omniverse,“ Huang said.

Such systems hold promise for further advancements in the semiconductor industry. “I look forward to physics-AI, robotics and Omniverse-based digital twins helping to advance the future of chip manufacturing,” Huang said.

PyTorch Conference 2023: Join us in San Francisco October 16-17

We’re thrilled to announce the upcoming PyTorch Conference 2023! On October 16-17, the conference will showcase PyTorch 2.0, the next-generation release of the popular machine learning framework. As part of the Linux Foundation, the PyTorch Foundation Conference continues the tradition of bringing together leading researchers, developers, and academic communities to advance the education and development of end-to-end machine learning.

The conference agenda features an engaging lineup of events, including an opening reception, engaging community and partner discussions, informative panels, poster sessions, enlightening use cases and community stories, as well as discussions on the latest trends in machine learning and deep learning development and deployment.

Call for Proposals

We are now accepting speaker proposals for the conference until July 21. The program committee will carefully review all submissions, and selected speakers will be notified by August 8. We strongly encourage both experienced and first-time speakers to submit their proposals. This conference provides an excellent opportunity to connect with the PyTorch community, share your ideas, and showcase your work.

When preparing your proposal, please consider the following guidelines:

What are you hoping to get from your presentation?
What do you expect the audience to gain from your presentation?
How will your presentation help better the open source ecosystem?

To help you shape your proposal, here are some suggested topics for the conference:

Deployments on AWS, Azure
Use cases and real-world applications
Foundational models
AI practices
Production considerations
PyTorch 2.X features and updates
Training techniques and best practices
Inference methodologies
Hardware advancements and optimizations
Edge computing applications
Scalability solutions
Latest research breakthroughs
Optimization strategies
Extending PyTorch through customizations and plugins

We kindly request that you refrain from submitting sales or marketing pitches and avoid discussing unlicensed or closed-source technologies. Such talks tend to detract from the integrity of our events and are not well-received by conference attendees.

Register Today

Registration is now open! Get your ticket today and secure your spot: https://events.linuxfoundation.org/pytorch-conference/register/

Thank you for your interest, and we look forward to a successful PyTorch Conference 2023!

Larger language models do in-context learning differently

Posted by Jerry Wei, Student Researcher, and Denny Zhou, Principal Scientist, Google Research

There have recently been tremendous advances in language models, partly because they can perform tasks with strong performance via in-context learning (ICL), a process whereby models are prompted with a few examples of input-label pairs before performing the task on an unseen evaluation example. In general, models’ success at in-context learning is enabled by:

Their use of semantic prior knowledge from pre-training to predict labels while following the format of in-context examples (e.g., seeing examples of movie reviews with “positive sentiment” and “negative sentiment” as labels and performing sentiment analysis using prior knowledge).
Learning the input-label mappings in context from the presented examples (e.g., finding a pattern that positive reviews should be mapped to one label, and negative reviews should be mapped to a different label).

In “Larger language models do in-context learning differently”, we aim to learn about how these two factors (semantic priors and input-label mappings) interact with each other in ICL settings, especially with respect to the scale of the language model that’s used. We investigate two settings to study these two factors — ICL with flipped labels (flipped-label ICL) and ICL with semantically-unrelated labels (SUL-ICL). In flipped-label ICL, labels of in-context examples are flipped so that semantic priors and input-label mappings disagree with each other. In SUL-ICL, labels of in-context examples are replaced with words that are semantically unrelated to the task presented in-context. We found that overriding prior knowledge is an emergent ability of model scale, as is the ability to learn in-context with semantically-unrelated labels. We also found that instruction tuning strengthens the use of prior knowledge more than it increases the capacity to learn input-label mappings.

An overview of flipped-label ICL and semantically-unrelated label ICL (SUL-ICL), compared with regular ICL, for a sentiment analysis task. Flipped-label ICL uses flipped labels, forcing the model to override semantic priors in order to follow the in-context examples. SUL-ICL uses labels that are not semantically related to the task, which means that models must learn input-label mappings in order to perform the task because they can no longer rely on the semantics of natural language labels.

Experiment design

For a diverse dataset mixture, we experiment on seven natural language processing (NLP) tasks that have been widely used: sentiment analysis, subjective/objective classification, question classification, duplicated-question recognition, entailment recognition, financial sentiment analysis, and hate speech detection. We test five language model families, PaLM, Flan-PaLM, GPT-3, InstructGPT, and Codex.

Flipped labels

In this experiment, labels of in-context examples are flipped, meaning that prior knowledge and input-label mappings disagree (e.g., sentences containing positive sentiment labeled as “negative sentiment”), thereby allowing us to study whether models can override their priors. In this setting, models that are able to override prior knowledge and learn input-label mappings in-context should experience a decrease in performance (since ground-truth evaluation labels are not flipped).

The ability to override semantic priors when presented with flipped in-context example labels emerges with model scale. Smaller models cannot flip predictions to follow flipped labels (performance only decreases slightly), while larger models can do so (performance decreases to well below 50%).

We found that when no labels are flipped, larger models have better performance than smaller models (as expected). But when we flip more and more labels, the performance of small models stays relatively flat, but large models experience large performance drops to well-below random guessing (e.g., 90% → 22.5% for code-davinci-002).

These results indicate that large models can override prior knowledge from pre-training when contradicting input-label mappings are presented in-context. Small models can’t do this, making this ability an emergent phenomena of model scale.

Semantically-unrelated labels

In this experiment, we replace labels with semantically-irrelevant ones (e.g., for sentiment analysis, we use “foo/bar” instead of “negative/positive”), which means that the model can only perform ICL by learning from input-label mappings. If a model mostly relies on prior knowledge for ICL, then its performance should decrease after this change since it will no longer be able to use semantic meanings of labels to make predictions. A model that can learn input–label mappings in-context, on the other hand, would be able to learn these semantically-unrelated mappings and should not experience a major drop in performance.

Small models rely more on semantic priors than large models do, as indicated by the greater decrease in performance for small models than for large models when using semantically-unrelated labels (i.e., targets) instead of natural language labels. For each plot, models are shown in order of increasing model size (e.g., for GPT-3 models, a is smaller than b, which is smaller than c).

Indeed, we see that using semantically-unrelated labels results in a greater performance drop for small models. This suggests that smaller models primarily rely on their semantic priors for ICL rather than learning from the presented input-label mappings. Large models, on the other hand, have the ability to learn input-label mappings in-context when the semantic nature of labels is removed.

We also find that including more in-context examples (i.e., exemplars) results in a greater performance improvement for large models than it does for small models, indicating that large models are better at learning from in-context examples than small models are.

In the SUL-ICL setup, larger models benefit more from additional examples than smaller models do.

Instruction tuning

Instruction tuning is a popular technique for improving model performance, which involves tuning models on various NLP tasks that are phrased as instructions (e.g., “Question: What is the sentiment of the following sentence, ‘This movie is great.’ Answer: Positive”). Since the process uses natural language labels, however, an open question is whether it improves the ability to learn input-label mappings or whether it strengthens the ability to recognize and apply semantic prior knowledge. Both of these would lead to an improvement in performance on standard ICL tasks, so it’s unclear which of these occur.

We study this question by running the same two setups as before, only this time we focus on comparing standard language models (specifically, PaLM) with their instruction-tuned variants (Flan-PaLM).

First, we find that Flan-PaLM is better than PaLM when we use semantically-unrelated labels. This effect is very prominent in small models, as Flan-PaLM-8B outperforms PaLM-8B by 9.6% and almost catches up to PaLM-62B. This trend suggests that instruction tuning strengthens the ability to learn input-label mappings, which isn’t particularly surprising.

Instruction-tuned language models are better at learning input–label mappings than pre-training–only language models are.

More interestingly, we saw that Flan-PaLM is actually worse than PaLM at following flipped labels, meaning that the instruction tuned models were unable to override their prior knowledge (Flan-PaLM models don’t reach below random guessing with 100% flipped labels, but PaLM models without instruction tuning can reach 31% accuracy in the same setting). These results indicate that instruction tuning must increase the extent to which models rely on semantic priors when they’re available.

Instruction-tuned models are worse than pre-training–only models at learning to override semantic priors when presented with flipped labels in-context.

Combined with the previous result, we conclude that although instruction tuning improves the ability to learn input-label mappings, it strengthens the usage of semantic prior knowledge more.

Conclusion

We examined the extent to which language models learn in-context by utilizing prior knowledge learned during pre-training versus input-label mappings presented in-context.

We first showed that large language models can learn to override prior knowledge when presented with enough flipped labels, and that this ability emerges with model scale. We then found that successfully doing ICL using semantically-unrelated labels is another emergent ability of model scale. Finally, we analyzed instruction-tuned language models and saw that instruction tuning improves the capacity to learn input-label mappings but also strengthens the use of semantic prior knowledge even more.

Future work

These results underscore how the ICL behavior of language models can change depending on their scale, and that larger language models have an emergent ability to map inputs to many types of labels, a form of reasoning in which input-label mappings can potentially be learned for arbitrary symbols. Future research could help provide insights on why these phenomena occur with respect to model scale.

Demand forecasting at Getir built with Amazon Forecast

This is a guest post co-authored by Nafi Ahmet Turgut, Mutlu Polatcan, Pınar Baki, Mehmet İkbal Özmen, Hasan Burak Yel, and Hamza Akyıldız from Getir.

Getir is the pioneer of ultrafast grocery delivery. The tech company has revolutionized last-mile delivery with its “groceries in minutes” delivery proposition. Getir was founded in 2015 and operates in Turkey, the UK, the Netherlands, Germany, France, Spain, Italy, Portugal, and the United States. Today, Getir is a conglomerate incorporating nine verticals under the same brand.

Predicting future demand is one of the most important insights for Getir and one of the biggest challenges we face. Getir relies heavily on accurate demand forecasts at a SKU level when making business decisions in a wide range of areas, including marketing, production, inventory, and finance. Accurate forecasts are necessary for supporting inventory holding and replenishment decisions. Having a clear and reliable picture of predicted demand for the next day or week allows us to adjust our strategy and increase our ability to meet sales and revenue goals.

Getir used Amazon Forecast, a fully managed service that uses machine learning (ML) algorithms to deliver highly accurate time series forecasts, to increase revenue by four percent and reduce waste cost by 50 percent. In this post, we describe how we used Forecast to achieve these benefits. We outline how we built an automated demand forecasting pipeline using Forecast and orchestrated by AWS Step Functions to predict daily demand for SKUs. This solution led to highly accurate forecasting for over 10,000 SKUs across all countries where we operate, and contributed significantly to our ability to develop high scalable internal supply chain processes.

Forecast automates much of the time-series forecasting process, enabling you to focus on preparing your datasets and interpreting your predictions.

Step Functions is a fully managed service that makes it easier to coordinate the components of distributed applications and microservices using visual workflows. Building applications from individual components that each perform a discrete function helps you scale more easily and change applications more quickly. Step Functions automatically triggers and tracks each step and retries when there are errors, so your application executes in order and as expected.

Solution overview

Six people from Getir’s data science team and infrastructure team worked together on this project. The project was completed in 3 months and deployed to production after 2 months of testing.

The following diagram shows the solution’s architecture.

The model pipeline is executed separately for each country. The architecture includes four Airflow cron jobs running on a defined schedule. The pipeline starts with feature creation which first creates the features and loads them to Amazon Redshift. Next, a feature processing job prepares daily features stored in Amazon Redshift and unloads the time series data to Amazon Simple Storage Service (Amazon S3). A second Airflow job is responsible for triggering the Forecast pipeline via Amazon EventBridge. The pipeline consists of Amazon Lambda functions, which create predictors and forecasts based on parameters stored in Amazon S3. Forecast reads data from Amazon S3, trains the model with hyperparameter optimization (HPO) to optimize model performance, and produces future predictions for product sales. Then the Step Functions “WaitInProgress” pipeline is triggered for each country, which enables parallel execution of a pipeline for each country.

Algorithm Selection

Amazon Forecast has six built-in algorithms (ARIMA, ETS, NPTS, Prophet, DeepAR+, CNN-QR), which are clustered into two groups: statististical and deep/neural network. Among those algorithms, deep/neural networks are more suitable for e-commerce forecasting problems as they accept item metadata features, forward-looking features for campaign and marketing activities, and – most importantly – related time series features. Deep/neural network algorithms also perform very well on sparse data set and in cold-start (new item introduction) scenarios.

Overall, in our experimentations, we observed that deep/neural network models performed significantly better than the statistical models. We therefore focused our deep-dive testing on DeepAR+ and CNN-QR

One of the most important benefits of Amazon Forecast is scalability and accurate results for many product and country combinations. In our testing both DeepAR+ and CNN-QR algorithms brought success in capturing trends and seasonality, allowing us to obtain efficient results in products whose demand changes very frequently.

Deep AutoRegressive Plus (DeepAR+) is a supervised univariate forecasting algorithm based on recurrent neural networks (RNNs) created by Amazon Research. Its main advantages are that it is easily scalable, able to incorporate relevant co-variates into the data (such as related data and metadata), and able to forecast cold-start items. Instead of fitting separate models for each time series, it creates a global model from related time series to handle widely-varying scales through rescaling and velocity-based sampling. The RNN architecture incorporates binomial likelihood to produce probabilistic forecasting and is advocated to outperform traditional single-item forecasting methods (like Prophet) by the authors of DeepAR: Probabilistic Forecasting with Autoregressive Recurrent Networks.

We ultimately selected the Amazon CNN-QR (Convolutional Neural Network – Quantile Regression) algorithm for our forecasting due to its high performance in the backtest process. CNN-QR is a proprietary ML algorithm developed by Amazon for forecasting scalar (one-dimensional) time series using causal Convolutional Neural Networks (CNNs).

As previously mentioned, CNN-QR can employ related time series and metadata about the items being forecasted. Metadata must include an entry for all unique items in the target time series, which in our case are the products whose demand we are forecasting. To improve accuracy, we used category and subcategory metadata, which helped the model understand the relationship between certain products, including complementary and substitutes. For example, for beverages, we provide an additional flag for snacks since the two categories are complementary to each other.

One significant advantage of CNN-QR is its ability to forecast without future related time series, which is important when you can’t provide related features for the forecast window. This capability, along with its forecast accuracy, meant that CNN-QR produced the best results with our data and use case.

Forecast Output

Forecasts created through the system are written to separate S3 buckets after they are received on a country basis. Then, forecasts are written to Amazon Redshift based on SKU and country with daily jobs. We then carry out daily product stock planning based on our forecasts.

On an ongoing basis, we calculate mean absolute percentage error (MAPE) ratios with product-based data, and optimize model and feature ingestion processes.

Conclusion

In this post, we walked through an automated demand forecasting pipeline we built using Amazon Forecast and AWS Step Functions.

With Amazon Forecast we improved our country-specific MAPE by 10 percent. This has driven a four percent revenue increase, and decreased our waste costs by 50 percent. In addition, we achieved an 80 percent improvement in our training times in daily forecasts in terms of scalability. We are able to forecast over 10,000 SKUs daily in all the countries we serve.

For more information about how to get started building your own pipelines with Forecast, see Amazon Forecast resources. You can also visit AWS Step Functions to get more information about how to build automated processes and orchestrate and create ML pipelines. Happy forecasting, and start improving your business today!

About the Authors

Nafi Ahmet Turgut finished his Master’s Degree in Electrical & Electronics Engineering and worked as graduate research scientist. His focus was building machine learning algorithms to simulate nervous network anomalies. He joined Getir in 2019 and currently works as a Senior Data Science & Analytics Manager. His team is responsible for designing, implementing, and maintaining end-to-end machine learning algorithms and data-driven solutions for Getir.

Mutlu Polatcan is a Staff Data Engineer at Getir, specializing in designing and building cloud-native data platforms. He loves combining open-source projects with cloud services.

Pınar Baki received her Master’s Degree from the Computer Engineering Department at Boğaziçi University. She worked as a data scientist at Arcelik, focusing on spare-part recommendation models and age, gender, emotion analysis from speech data. She then joined Getir in 2022 as a Senior Data Scientist working on forecasting and search engine projects.

Mehmet İkbal Özmen received his Master’s Degree in Economics and worked as Graduate Research Assistant. His research area was mainly economic time series models, Markov simulations, and recession forecasting. He then joined Getir in 2019 and currently works as Data Science & Analytics Manager. His team is responsible for optimization and forecast algorithms to solve the complex problems experienced by the operation and supply chain businesses.

Hasan Burak Yel received his Bachelor’s Degree in Electrical & Electronics Engineering at Boğaziçi University. He worked at Turkcell, mainly focused on time series forecasting, data visualization, and network automation. He joined Getir in 2021 and currently works as a Lead Data Scientist with the responsibility of Search & Recommendation Engine and Customer Behavior Models.

Hamza Akyıldız received his Bachelor’s Degree of Mathematics and Computer Engineering at Boğaziçi University. He focuses on optimizing machine learning algorithms with their mathematical background. He joined Getir in 2021, and has been working as a Data Scientist. He has worked on Personalization and Supply Chain related projects.

Esra Kayabalı is a Senior Solutions Architect at AWS, specializing in the analytics domain including data warehousing, data lakes, big data analytics, batch and real-time data streaming and data integration. She has 12 years of software development and architecture experience. She is passionate about learning and teaching cloud technologies.

Introducing Amazon Textract Bulk Document Uploader for enhanced evaluation and analysis

Amazon Textract is a machine learning (ML) service that automatically extracts text, handwriting, and data from any document or image. To make it simpler to evaluate the capabilities of Amazon Textract, we have launched a new Bulk Document Uploader feature on the Amazon Textract console that enables you to quickly process your own set of documents without writing any code.

In this post, we walk through when and how to use the Amazon Textract Bulk Document Uploader to evaluate how Amazon Textract performs on your documents.

Overview of solution

The Bulk Document Uploader should be used for quick evaluation of Amazon Textract for predetermined use cases. By uploading multiple documents simultaneously through an intuitive UI, you can easily gauge how well Amazon Textract performs on your documents.

You can upload and process up to 150 documents at once. Unlike the existing Amazon Textract console demos, which impose artificial limits on the number of documents, document size, and maximum allowed number of pages, the Bulk Document Uploader supports processing up to 150 documents per request and has the same document size and page limits as the Amazon Textract APIs. This makes it more efficient for you to evaluate a larger set of documents.

The Bulk Document Uploader outputs a standard Amazon Textract JSON response and CSV file. The results are provided in JSON format for easy programmatic analysis. Additionally, a human-readable CSV file with confidence scores is provided for simple comparison and evaluation of the extracted information.

When using this feature, keep in mind the following:

The Bulk Document Uploader processes documents via asynchronous operations. You can track the status of the processing on the Amazon Textract console. Only DetectDocumentText (OCR), AnalyzeDocument (Tables, Queries, Forms, and Signatures), and AnalyzeExpense APIs are currently supported.
The Bulk Document Uploader provides JSON results of the API operations and formatted CSV reports. You may need to rely on external tools for visualization of the data, such as displaying bounding box highlights on the document using the JSON results.
Using this feature to process documents incurs the same charges as regular Amazon Textract usage (depending on which feature is used), and is subject to the TPS (transactions per second) limits for APIs that are set for the account and Region. For more information on pricing, refer to Amazon Textract pricing. To learn more about Amazon Textract limits, refer to Quotas in Amazon Textract.
Accepted file formats for bulk uploader are JPEG, PNG, TIF, and PDF. JPEG 2000-encoded images within PDFs are also supported. JPEG and PNG files have a 10 MB size limit, whereas PDF and TIF files have a 500 MB size limit. Multi-page PDF and TIF files have a 3,000 page limit.

Use the Bulk Document Uploader

The Bulk Document Uploader is intended to help you quickly evaluate how Amazon Textract performs on a set of your own documents, without needing to write any code. You can use the Bulk Document Uploader to process as many as 150 documents instead of uploading and processing documents individually. You can bulk upload documents directly from your computer or import documents from an existing Amazon Simple Storage Service (Amazon S3) bucket.

The Bulk Document Uploader provides results that you can download later for offline review. Each downloadable ZIP file contains the Amazon Textract API response in JSON file format and a human-readable CSV file of the output containing the extracted data and confidence scores. The output results are available for download for 7 days after processing. After 14 days, documents are cleared from the Submitted documents section. To use the Bulk Document Uploader, complete the following steps:

On the Amazon Textract console, under Demos in the navigation pane, choose Bulk Document Uploader.
Choose Upload documents.
Specify the source of your documents.

You have two options to upload documents:

Import documents from S3 bucket – If you’re using an S3 bucket for your documents, provide the bucket URL and (optionally) the prefix where your documents reside, in s3://your-bucket/prefix/ format. Alternatively, choose Browse S3 to browse and select the desired location of your documents. If the Amazon S3 location you specified contains more than 150 documents, then only the first 150 documents will be sent to Amazon Textract for processing.
Upload documents from your computer – If you’re uploading documents from your computer, you can upload up to 50 documents at a time by choosing Upload Documents. To upload additional documents (up to the maximum of 150), choose Add documents after your initial documents are uploaded.

In this case, your documents are first uploaded to an S3 bucket in your account that is created on your behalf, therefore it’s important to ensure that you have permissions to access and upload documents to Amazon S3. This is a one-time action, and the same bucket will be used for all subsequent uploads from your computer. If you want to upload and process the same set of documents, you can use the path to this S3 bucket using the Import documents from S3 bucket option. The S3 bucket created on your behalf will be visible after the bucket gets created.

Next, specify the Amazon Textract feature you want to use to process your documents.

You may select only one feature at a time to process your documents. If you need to evaluate additional features, you must create a separate request by selecting the desired feature and uploading the documents again. If the AnalyzeDocument – Queries feature is selected, you need to provide the queries you want to test against your documents. You can specify up to 30 queries at a time. If the uploaded documents contain multi-page (PDF or TIF) files, queries are only applied to the first page of each document. Refer to Best Practices for Queries to learn about how to construct queries.

Choose Start processing to submit the documents to Amazon Textract for processing.

You can track the document status and download the output results of processed documents in the Submitted documents section. This section updates periodically, and you can manually refresh it to see if the processing is complete. Each document is processed individually, so you can either select the document with Ready to download status or wait for all documents to complete processing to download the results. The output of the processed documents will remain available for up to 7 days for download, after which they will expire. Expired documents will be cleared from the Submitted documents section after 7 additional days (14 days from the processed date). We suggest downloading and preserving the outputs within the 7-day period.

Conclusion

In this post, we announced the new Amazon Textract Bulk Document Uploader feature, which allows you to quickly process a large number of documents for evaluation purposes. You can use this feature to evaluate Amazon Textract for a predetermined use case with your documents. To learn more about how you can use Amazon Textract in your intelligent document processing workload, visit Amazon Textract features and Getting started with Amazon Textract.

About the Authors

Shashwat Sapre is a Senior Technical Product Manager with the Amazon Textract team. He is focused on building machine learning-based services for AWS customers. In his spare time, he likes reading about new technologies, traveling and exploring different cuisines.

Anjan Biswas is a Senior AI Services Solutions Architect with a focus on AI/ML and Data Analytics. Anjan is part of the world-wide AI services team and works with customers to help them understand and develop solutions to business problems with AI and ML. Anjan has over 14 years of experience working with global supply chain, manufacturing, and retail organizations, and is actively helping customers get started and scale on AWS AI services.

Highlights from CHI 2023

The ways in which people are able to interact with technologies can have a profound effect on a technology’s utility and adoptability. Building computing tools and services around people’s natural styles of work, communication, and play can give technology the value it needs to have meaningful impact. For decades, human-computer interaction (HCI) has examined the relationship between people and computers to help maximize the capabilities of each across a range of experiences and situations.

The ACM CHI Conference on Human Factors in Computing Systems (CHI) is a renowned meeting ground for top talent in the HCI field and a showcase for some of its most compelling work. Hosted April 23 through April 28, this year’s conference drew more than 4,500 participants from 79 countries. Contributions from Microsoft researchers and their collaborators demonstrated the breadth of work inspired by the myriad and diverse ways people use computing today and will in the future.

Check out a few highlights from this year’s conference below, including researchers’ efforts to better understand the role of wellbeing in work, to augment memory through our sense of smell, and to bridge the gap between programmers and code-generating models, which received honorable mention at the conference.

“What It Wants Me To Say”: Bridging the Abstraction Gap Between End-User Programmers and Code-Generating Large Language Models
CHI 2023 Honorable Mention

Michael Xieyang Liu, Advait Sarkar, Carina Negreanu, Ben Zorn, Jack Williams, Neil Toronto, Andy Gordon

Programming languages are an extremely powerful form of user interface. They also happen to be extremely difficult to learn, especially for non-expert end-user programmers who lack training in computing. What if end-user programmers could instead use a natural language they already know? This prospect can be realized through large language models (LLM): deep neural networks using the transformer architecture, trained on large corpora, and fine-tuned to generate code from natural language. Despite impressive benchmark performance, LLMs are beset with issues in practical use. Lab and field studies have shown that the mapping between natural language and code is poorly understood, that generated code can contain subtle bugs, and that generated code can be difficult to verify.

In their paper, researchers consider the specific problem of abstraction matching: when the user has well-formed intent, how do they select an utterance from the near infinite space of naturalistic utterances that they believe the system will reliably map to a satisfactory solution? This involves “matching” the utterance to the right level of “abstraction” by specifying the utterance at a level of granularity and detail that matches the set of actions the system can take and selecting suitable words and grammar.

View publication

Workplace Rhythm Variability and Emotional Distress in Information Workers

Subigya Kumar Nepal, Javier Hernandez, Judith Amores, Mehrab Bin Morshed, Robert Lewis, Hemma Prafullchandra, Mary Czerwinski

Regularity in daily activities has been linked to positive wellbeing outcomes, but previous studies have mainly focused on clinical populations and traditional daily activities such as sleep and exercise. This research extends prior work by examining the regularity of both self-reported and digital activities of 49 information workers in a four-week naturalistic study. Findings suggest that greater variability in self-reported mood, job demands, lunch time, and sleep quality may be associated with increased stress, anxiety, and depression. However, when it comes to digital activity–based measures, greater variability in rhythm is associated with reduced emotional distress. This study expands our understanding of workers and the potential insights that can be gained from analyzing technology interactions and wellbeing.

View publication

Olfactory Wearables for Targeted Memory Reactivation

Judith Amores, Nirmita Mehra, Bjoern Rasch, Pattie Maes

This paper investigates how a smartphone-controlled olfactory wearable might improve memory recall. Researchers conducted a within-subjects experiment with 32 participants using the device and not using the device (control). In the experimental condition, bursts of odor were released during visuo-spatial memory navigation tasks, which also had a language learning component, and rereleased during sleep the following night in the subjects’ home. The researchers found that compared with control, there was an improvement in memory performance when using the scent wearable in memory tasks that involved walking in a physical space. Furthermore, participants recalled more objects and translations when re-exposed to the same scent during the recall test in addition to during sleep. These effects were statistically significant, and in the object recall task, they also persisted for more than a week. This experiment demonstrates a potential practical application of olfactory interfaces that can interact with a user during wake, as well as sleep, to support memory.

View publication

AdHocProx: Sensing Mobile, Ad-Hoc Collaborative Device Formations using Dual Ultra-Wideband Radios

Richard Li, Teddy Seyed, Nicolai Marquardt, Eyal Ofek, Steve Hodges, Mike Sinclair, Hugo Romat, Michel Pahud, Jatin Sharma, William A. S. Buxton, Ken Hinckley, Nathalie Henry Riche

In their paper, researchers present AdHocProx, a system that uses device-relative, inside-out sensing to augment co-located collaboration across multiple devices without recourse to externally anchored beacons or even reliance on Wi-Fi connectivity.

AdHocProx achieves this via sensors, including dual ultra-wideband (UWB) radios for sensing distance and angle to other devices in dynamic, ad-hoc arrangements and capacitive grip to determine where the user’s hands hold the device and to partially correct for the resulting UWB signal attenuation. All spatial sensing and communication take place via the side-channel capability of the UWB radios, suitable for small-group collaboration across up to four devices (eight UWB radios).

Together, these sensors detect proximity and natural, socially meaningful device movements to enable contextual interaction techniques. Researchers find that AdHocProx can obtain 95 percent accuracy recognizing various ad-hoc device arrangements in an offline evaluation, with participants particularly appreciative of interaction techniques that automatically leverage proximity-awareness and relative orientation among multiple devices.

View publication

Escapement: A Tool for Interactive Prototyping with Video via Sensor-Mediated Abstraction of Time

Molly Jane Nicholas, Nicolai Marquardt, Michel Pahud, Nathalie Henry Riche, Hugo Romat, Christopher Collins, David Ledo, Rohan Kadekodi, Badrish Chandramouli, Ken Hinckley

This paper introduces Escapement, a video prototyping tool that introduces a powerful new concept for prototyping screen-based interfaces by flexibly mapping sensor values to dynamic playback control of videos. This recasts the time dimension of video mockups as sensor-mediated interaction.

This abstraction of time as interaction, which the researchers dub video-escapement prototyping, empowers designers to rapidly explore and viscerally experience direct touch or sensor-mediated interactions across one or more device displays. The system affords cross-device and bidirectional remote (telepresent) experiences via cloud-based state sharing across multiple devices. This makes Escapement especially potent for exploring multi-device, dual-screen, or remote-work interactions for screen-based applications. Researchers share the results of observations of long-term usage of video-escapement techniques with experienced interaction designers and articulate design choices for supporting a reflective, iterative, and open-ended creative design process.

View publication

Your Mileage May Vary: Case Study of a Robotic Telepresence Pilot Roll-out for a Hybrid Knowledge Work Organization

Andriana Boudouraki, Joel E. Fischer, Stuart Reeves, Sean Rintel

Organizations wishing to maintain employee satisfaction for hybrid collaboration need to explore flexible solutions that provide value for both remote and on-site employees. This case study reports on the roll-out of a telepresence robot pilot at Microsoft Research Cambridge to test whether robots would provide enjoyable planned and unplanned encounters between remote and on-site employees. Researchers describe the work that was undertaken to prepare for the roll-out, including the occupational health and safety assessment, systems for safety and security, and the information for employees on safe and effective use practices. The pilot ended after three months, and robot use has been discontinued after weighing the opportunities against low adoption and other challenges. The researchers discuss the pros and cons within this organizational setting and make suggestions for future work and roll-outs.

View publication

Focus Time for Wellbeing and Work Engagement of Information Workers

Koustuv Saha, Shamsi Iqbal

Having little time for focused work is a major challenge in information work. While research has explored computing-assisted user-facing solutions for protecting time for focused work, there’s limited empirical evidence about the effectiveness of these features on wellbeing and work engagement. Toward this problem, researchers study the effects of automatically scheduling time for focused work on people’s work calendars using the “focus time” feature on Outlook calendars. The researchers conducted an experimental study over six weeks with 15 treatment and 10 control participants, who responded to survey questions on wellbeing and work engagement throughout the study. The researchers found that the treatment participants showed higher wellbeing, including increased excitement, relaxation, and satisfaction, and decreased anger, frustration, tiredness, and stress. The researchers study the needs, benefits, and challenges of scheduling focus time and discuss the importance of and design recommendations for enabling mechanisms and tools supporting focused work.

View publication

The post Highlights from CHI 2023 appeared first on Microsoft Research.

Consensus and subjectivity of skin tone annotation for ML fairness

Posted by Candice Schumann, Software Engineer, and Gbolahan O. Olanubi, User Experience Researcher, Google Research

Skin tone is an observable characteristic that is subjective, perceived differently by individuals (e.g., depending on their location or culture) and thus is complicated to annotate. That said, the ability to reliably and accurately annotate skin tone is highly important in computer vision. This became apparent in 2018, when the Gender Shades study highlighted that computer vision systems struggled to detect people with darker skin tones, and performed particularly poorly for women with darker skin tones. The study highlights the importance for computer researchers and practitioners to evaluate their technologies across the full range of skin tones and at intersections of identities. Beyond evaluating model performance on skin tone, skin tone annotations enable researchers to measure diversity and representation in image retrieval systems, dataset collection, and image generation. For all of these applications, a collection of meaningful and inclusive skin tone annotations is key.

Monk Skin Tone (MST) Scale See more at skintone.google.

Last year, in a step toward more inclusive computer vision systems, Google’s Responsible AI and Human-Centered Technology team in Research partnered with Dr. Ellis Monk to openly release the Monk Skin Tone (MST) Scale, a skin tone scale that captures a broad spectrum of skin tones. In comparison to an industry standard scale like the Fitzpatrick Skin-Type Scale designed for dermatological use, the MST offers a more inclusive representation across the range of skin tones and was designed for a broad range of applications, including computer vision.

Today we’re announcing the Monk Skin Tone Examples (MST-E) dataset to help practitioners understand the MST scale and train their human annotators. This dataset has been made publicly available to enable practitioners everywhere to create more consistent, inclusive, and meaningful skin tone annotations. Along with this dataset, we’re providing a set of recommendations, noted below, around the MST scale and MST-E dataset so we can all create products that work well for all skin tones.

Since we launched the MST, we’ve been using it to improve Google’s computer vision systems to make equitable image tools for everyone and to improve representation of skin tone in Search. Computer vision researchers and practitioners outside of Google, like the curators of MetaAI’s Casual Conversations dataset, are recognizing the value of MST annotations to provide additional insight into diversity and representation in datasets. Incorporation into widely available datasets like these are essential to give everyone the ability to ensure they are building more inclusive computer vision technologies and can test the quality of their systems and products across a wide range of skin tones.

Our team has continued to conduct research to understand how we can continue to advance our understanding of skin tone in computer vision. One of our core areas of focus has been skin tone annotation, the process by which human annotators are asked to review images of people and select the best representation of their skin tone. MST annotations enable a better understanding of the inclusiveness and representativeness of datasets across a wide range of skin tones, thus enabling researchers and practitioners to evaluate quality and fairness of their datasets and models. To better understand the effectiveness of MST annotations, we’ve asked ourselves the following questions:

How do people think about skin tone across geographic locations?
What does global consensus of skin tone look like?
How do we effectively annotate skin tone for use in inclusive machine learning (ML)?

The MST-E dataset

The MST-E dataset contains 1,515 images and 31 videos of 19 subjects spanning the 10 point MST scale, where the subjects and images were sourced through TONL, a stock photography company focusing on diversity. The 19 subjects include individuals of different ethnicities and gender identities to help human annotators decouple the concept of skin tone from race. The primary goal of this dataset is to enable practitioners to train their human annotators and test for consistent skin tone annotations across various environment capture conditions.

The MST-E image set contains 1,515 images and 31 videos featuring 19 models taken under various lighting conditions and facial expressions. Images by TONL. Copyright TONL.CO 2022 ALL RIGHTS RESERVED. Used with permission.

All images of a subject were collected in a single day to reduce variation of skin tone due to seasonal or other temporal effects. Each subject was photographed in various poses, facial expressions, and lighting conditions. In addition, Dr. Monk annotated each subject with a skin tone label and then selected a “golden” image for each subject that best represents their skin tone. In our research we compare annotations made by human annotators to those made by Dr. Monk, an academic expert in social perception and inequality.

Terms of use

Each model selected as a subject provided consent for their images and videos to be released. TONL has given permission for these images to be released as part of MST-E and used for research or human-annotator-training purposes only. The images are not to be used to train ML models.

Challenges with forming consensus of MST annotations

Although skin tone is easy for a person to see, it can be challenging to systematically annotate across multiple people due to issues with technology and the complexity of human social perception.

On the technical side, things like the pixelation, lighting conditions of an image, or a person’s monitor settings can affect how skin tone appears on a screen. You might notice this yourself the next time you change the display setting while watching a show. The hue, saturation, and brightness could all affect how skin tone is displayed on a monitor. Despite these challenges, we find that human annotators are able to learn to become invariant to lighting conditions of an image when annotating skin tone.

On the social perception side, aspects of a person’s life like their location, culture, and lived experience may affect how they annotate various skin tones. We found some evidence for this when we asked photographers in the United States and photographers in India to annotate the same image. The photographers in the United States viewed this person as somewhere between MST-5 & MST-7. However, the photographers in India viewed this person as somewhere between MST-3 & MST-5.

The distribution of Monk Skin Tone Scale annotations for this image from a sample of 5 photographers in the U.S. and 5 photographers in India.

Continuing this exploration, we asked trained annotators from five different geographical regions (India, Philippines, Brazil, Hungary, and Ghana) to annotate skin tone on the MST scale. Within each market each image had 5 annotators who were drawn from a broader pool of annotators in that region. For example, we could have 20 annotators in a market, and select 5 to review a particular image.

With these annotations we found two important details. First, annotators within a region had similar levels of agreement on a single image. Second, annotations between regions were, on average, significantly different from each other. (p<0.05). This suggests that people from the same geographic region may have a similar mental model of skin tone, but this mental model is not universal.

However, even with these regional differences, we also find that the consensus between all five regions falls close to the MST values supplied by Dr. Monk. This suggests that a geographically diverse group of annotators can get close to the MST value annotated by an MST expert. In addition, after training, we find no significant difference between annotations on well-lit images, versus poorly-lit images, suggesting that annotators can become invariant to different lighting conditions in an image — a non-trivial task for ML models.

The MST-E dataset allows researchers to study annotator behavior across curated subsets controlling for potential confounders. We observed similar regional variation when annotating much larger datasets with many more subjects.

Skin Tone annotation recommendations

Our research includes four major findings. First, annotators within a similar geographical region have a consistent and shared mental model of skin tone. Second, these mental models differ across different geographical regions. Third, the MST annotation consensus from a geographically diverse set of annotators aligns with the annotations provided by an expert in social perception and inequality. And fourth, annotators can learn to become invariant to lighting conditions when annotating MST.

Given our research findings, there are a few recommendations for skin tone annotation when using the MST.

Having a geographically diverse set of annotators is important to gain accurate, or close to ground truth, estimates of skin tone.
Train human annotators using the MST-E dataset, which spans the entire MST spectrum and contains images in a variety of lighting conditions. This will help annotators become invariant to lighting conditions and appreciate the nuance and differences between the MST points.
Given the wide range of annotations we suggest having at least two annotators in at least five different geographical regions (10 ratings per image).

Skin tone annotation, like other subjective annotation tasks, is difficult but possible. These types of annotations allow for a more nuanced understanding of model performance, and ultimately help us all to create products that work well for every person across the broad and diverse spectrum of skin tones.

Acknowledgements

We wish to thank our colleagues across Google working on fairness and inclusion in computer vision for their contributions to this work, especially Marco Andreetto, Parker Barnes, Ken Burke, Benoit Corda, Tulsee Doshi, Courtney Heldreth, Rachel Hornung, David Madras, Ellis Monk, Shrikanth Narayanan, Utsav Prabhu, Susanna Ricco, Sagar Savla, Alex Siegman, Komal Singh, Biao Wang, and Auriel Wright. We also would like to thank Annie Jean-Baptiste, Florian Koenigsberger, Marc Repnyek, Maura O’Brien, and Dominique Mungin and the rest of the team who help supervise, fund, and coordinate our data collection.

Cracking the code of how diseases affect the body

ARA recipient Marinka Zitnik is focused on how machine learning can enable accurate diagnoses and the development of new treatments and therapies.Read More

Challenges of building reliable cloud services

AI Explainer: Foundation models ​and the next era of AI

Adapting large-language models for automated incident management

Effectiveness of GPT-3.x models at finding root causes

Looking through the incident owners’ eyes

Looking forward

Acknowledgement

Tech Advances Drive Genomics Leaps

Research to Complete Reference Genome Puzzle

Analysis Targeting Cold Tolerance for Woolly Mammoth

Benefits to Biodiversity, Conservation and Humanity

AI, Accelerated Computing Step Up

AI, Accelerated Computing Come to Chip Manufacturing

What’s Next?

Call for Proposals

Register Today

Experiment design

Flipped labels

Semantically-unrelated labels

Instruction tuning

Conclusion

Future work

Solution overview

Algorithm Selection

Forecast Output

Conclusion

About the Authors

Overview of solution

Use the Bulk Document Uploader

Conclusion

About the Authors

AI and Microsoft Research

The MST-E dataset

Terms of use

Challenges with forming consensus of MST annotations

Skin Tone annotation recommendations

Acknowledgements

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.

AI Explainer: Foundation models and the next era of AI