Using AI to expand global access to reliable flood forecasts

Posted by Yossi Matias, VP Engineering & Research, and Grey Nearing, Research Scientist, Google Research

Floods are the most common natural disaster, and are responsible for roughly $50 billion in annual financial damages worldwide. The rate of flood-related disasters has more than doubled since the year 2000 partly due to climate change. Nearly 1.5 billion people, making up 19% of the world’s population, are exposed to substantial risks from severe flood events. Upgrading early warning systems to make accurate and timely information accessible to these populations can save thousands of lives per year.

Driven by the potential impact of reliable flood forecasting on people’s lives globally, we started our flood forecasting effort in 2017. Through this multi-year journey, we advanced research over the years hand-in-hand with building a real-time operational flood forecasting system that provides alerts on Google Search, Maps, Android notifications and through the Flood Hub. However, in order to scale globally, especially in places where accurate local data is not available, more research advances were required.

In “Global prediction of extreme floods in ungauged watersheds”, published in Nature, we demonstrate how machine learning (ML) technologies can significantly improve global-scale flood forecasting relative to the current state-of-the-art for countries where flood-related data is scarce. With these AI-based technologies we extended the reliability of currently-available global nowcasts, on average, from zero to five days, and improved forecasts across regions in Africa and Asia to be similar to what are currently available in Europe. The evaluation of the models was conducted in collaboration with the European Center for Medium Range Weather Forecasting (ECMWF).

These technologies also enable Flood Hub to provide real-time river forecasts up to seven days in advance, covering river reaches across over 80 countries. This information can be used by people, communities, governments and international organizations to take anticipatory action to help protect vulnerable populations.

Flood forecasting at Google

The ML models that power the FloodHub tool are the product of many years of research, conducted in collaboration with several partners, including academics, governments, international organizations, and NGOs.

In 2018, we launched a pilot early warning system in the Ganges-Brahmaputra river basin in India, with the hypothesis that ML could help address the challenging problem of reliable flood forecasting at scale. The pilot was further expanded the following year via the combination of an inundation model, real-time water level measurements, the creation of an elevation map and hydrologic modeling.

In collaboration with academics, and, in particular, with the JKU Institute for Machine Learning we explored ML-based hydrologic models, showing that LSTM-based models could produce more accurate simulations than traditional conceptual and physics-based hydrology models. This research led to flood forecasting improvements that enabled the expansion of our forecasting coverage to include all of India and Bangladesh. We also worked with researchers at Yale University to test technological interventions that increase the reach and impact of flood warnings.

Our hydrological models predict river floods by processing publicly available weather data like precipitation and physical watershed information. Such models must be calibrated to long data records from streamflow gauging stations in individual rivers. A low percentage of global river watersheds (basins) have streamflow gauges, which are expensive but necessary to supply relevant data, and it’s challenging for hydrological simulation and forecasting to provide predictions in basins that lack this infrastructure. Lower gross domestic product (GDP) is correlated with increased vulnerability to flood risks, and there is an inverse correlation between national GDP and the amount of publicly available data in a country. ML helps to address this problem by allowing a single model to be trained on all available river data and to be applied to ungauged basins where no data are available. In this way, models can be trained globally, and can make predictions for any river location.

There is an inverse (log-log) correlation between the amount of publicly available streamflow data in a country and national GDP. Streamflow data from the Global Runoff Data Center.

Our academic collaborations led to ML research that developed methods to estimate uncertainty in river forecasts and showed how ML river forecast models synthesize information from multiple data sources. They demonstrated that these models can simulate extreme events reliably, even when those events are not part of the training data. In an effort to contribute to open science, in 2023 we open-sourced a community-driven dataset for large-sample hydrology in Nature Scientific Data.

The river forecast model

Most hydrology models used by national and international agencies for flood forecasting and river modeling are state-space models, which depend only on daily inputs (e.g., precipitation, temperature, etc.) and the current state of the system (e.g., soil moisture, snowpack, etc.). LSTMs are a variant of state-space models and work by defining a neural network that represents a single time step, where input data (such as current weather conditions) are processed to produce updated state information and output values (streamflow) for that time step. LSTMs are applied sequentially to make time-series predictions, and in this sense, behave similarly to how scientists typically conceptualize hydrologic systems. Empirically, we have found that LSTMs perform well on the task of river forecasting.

A diagram of the LSTM, which is a neural network that operates sequentially in time. An accessible primer can be found here.

Our river forecast model uses two LSTMs applied sequentially: (1) a “hindcast” LSTM ingests historical weather data (dynamic hindcast features) up to the present time (or rather, the issue time of a forecast), and (2) a “forecast” LSTM ingests states from the hindcast LSTM along with forecasted weather data (dynamic forecast features) to make future predictions. One year of historical weather data are input into the hindcast LSTM, and seven days of forecasted weather data are input into the forecast LSTM. Static features include geographical and geophysical characteristics of watersheds that are input into both the hindcast and forecast LSTMs and allow the model to learn different hydrological behaviors and responses in various types of watersheds.

Output from the forecast LSTM is fed into a “head” layer that uses mixture density networks to produce a probabilistic forecast (i.e., predicted parameters of a probability distribution over streamflow). Specifically, the model predicts the parameters of a mixture of heavy-tailed probability density functions, called asymmetric Laplacian distributions, at each forecast time step. The result is a mixture density function, called a Countable Mixture of Asymmetric Laplacians (CMAL) distribution, which represents a probabilistic prediction of the volumetric flow rate in a particular river at a particular time.

LSTM-based river forecast model architecture. Two LSTMs are applied in sequence, one ingesting historical weather data and one ingesting forecasted weather data. The model outputs are the parameters of a probability distribution over streamflow at each forecasted timestep.

Input and training data

The model uses three types of publicly available data inputs, mostly from governmental sources:

Static watershed attributes representing geographical and geophysical variables: From the HydroATLAS project, including data like long-term climate indexes (precipitation, temperature, snow fractions), land cover, and anthropogenic attributes (e.g., a nighttime lights index as a proxy for human development).
Historical meteorological time-series data: Used to spin up the model for one year prior to the issue time of a forecast. The data comes from NASA IMERG, NOAA CPC Global Unified Gauge-Based Analysis of Daily Precipitation, and the ECMWF ERA5-land reanalysis. Variables include daily total precipitation, air temperature, solar and thermal radiation, snowfall, and surface pressure.
Forecasted meteorological time series over a seven-day forecast horizon: Used as input for the forecast LSTM. These data are the same meteorological variables listed above, and come from the ECMWF HRES atmospheric model.

Training data are daily streamflow values from the Global Runoff Data Center over the time period 1980 – 2023. A single streamflow forecast model is trained using data from 5,680 diverse watershed streamflow gauges (shown below) to improve accuracy.

Location of 5,680 streamflow gauges that supply training data for the river forecast model from the Global Runoff Data Center.

Improving on the current state-of-the-art

We compared our river forecast model with GloFAS version 4, the current state-of-the-art global flood forecasting system. These experiments showed that ML can provide accurate warnings earlier and over larger and more impactful events.

The figure below shows the distribution of F1 scores when predicting different severity events at river locations around the world, with plus or minus 1 day accuracy. F1 scores are an average of precision and recall and event severity is measured by return period. For example, a 2-year return period event is a volume of streamflow that is expected to be exceeded on average once every two years. Our model achieves reliability scores at up to 4-day or 5-day lead times that are similar to or better, on average, than the reliability of GloFAS nowcasts (0-day lead time).

Distributions of F1 scores over 2-year return period events in 2,092 watersheds globally during the time period 2014-2023 from GloFAS (blue) and our model (orange) at different lead times. On average, our model is statistically as accurate as GloFAS nowcasts (0–day lead time) up to 5 days in advance over 2-year (shown) and 1-year, 5-year, and 10-year events (not shown).

Additionally (not shown), our model achieves accuracies over larger and rarer extreme events, with precision and recall scores over 5-year return period events that are similar to or better than GloFAS accuracies over 1-year return period events. See the paper for more information.

Looking into the future

The flood forecasting initiative is part of our Adaptation and Resilience efforts and reflects Google’s commitment to address climate change while helping global communities become more resilient. We believe that AI and ML will continue to play a critical role in helping advance science and research towards climate action.

We actively collaborate with several international aid organizations (e.g., the Centre for Humanitarian Data and the Red Cross) to provide actionable flood forecasts. Additionally, in an ongoing collaboration with the World Meteorological Organization (WMO) to support early warning systems for climate hazards, we are conducting a study to help understand how AI can help address real-world challenges faced by national flood forecasting agencies.

While the work presented here demonstrates a significant step forward in flood forecasting, future work is needed to further expand flood forecasting coverage to more locations globally and other types of flood-related events and disasters, including flash floods and urban floods. We are looking forward to continuing collaborations with our partners in the academic and expert communities, local governments and the industry to reach these goals.

How we are using AI for reliable flood forecasting at a global scale

An overview of our paper in Nature on how we are using AI to help scale flood forecasting to vulnerable communities around the world.Read More

Research Focus: Week of March 18, 2024

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

NEW RESEARCH

Fewer is More: Boosting LLM Reasoning with Reinforced Context Pruning

Large language models (LLMs) have shown impressive capabilities, yet they still struggle with math reasoning. In a recent paper: Fewer is More: Boosting LLM Reasoning with Reinforced Context Pruning, researchers from Microsoft propose CoT-Influx, a novel approach that pushes the boundary of few-shot chain-of-Thought (CoT) learning to improve LLM mathematical reasoning.

Given that adding more concise CoT examples in the prompt can improve LLM reasoning performance, CoT-Influx employs a coarse-to-fine pruner to maximize the input of effective and concise CoT examples. The pruner first selects as many crucial CoT examples as possible and then prunes unimportant tokens to fit the context window. A math reasoning dataset with diverse difficulty levels and reasoning steps is used to train the pruner, along with a math-specialized reinforcement learning approach. As a result, by enabling more CoT examples with double the context window size in tokens, CoT-Influx significantly outperforms various prompting baselines across various LLMs (LLaMA2-7B, 13B, 70B) and 5 math datasets, achieving up to 4.55% absolute improvements. Remarkably, without any fine-tuning, LLaMA2-70B with CoT-Influx surpasses GPT-3.5 and a wide range of larger LLMs (PaLM, Minerva 540B, etc.) on the GSM8K. CoT-Influx serves as a plug-and-play module for LLMs and is compatible with most existing reasoning prompting techniques, such as self-consistency and self-verification.

Read the paper

NEW RESEARCH

From User Surveys to Telemetry-Driven Agents: Exploring the Potential of Personalized Productivity Solutions

Organizations and individuals continuously strive to enhance their efficiency, improve time management, and optimize their work processes. Rapid advancements in AI, natural language processing, and machine learning technologies create new opportunities to develop tools that boost productivity.

In a recent paper: From User Surveys to Telemetry-Driven Agents: Exploring the Potential of Personalized Productivity Solutions, researchers from Microsoft present a comprehensive, user-centric approach to understand preferences in AI-based productivity agents and develop personalized solutions. The research began with a survey of 363 participants, seeking to reveal users’ specific needs and preferences for productivity agents such as relevant productivity challenges of information workers, preferred communication style and approach towards solving problems, and privacy expectations. With the survey insights, the researchers then developed a GPT-4 powered personalized productivity agent that uses telemetry data gathered from information workers via Viva Insights to provide tailored assistance. The agent’s performance was compared with alternative productivity-assistive tools, such as the traditional dashboard and AI-enabled summaries, in a study involving 40 participants. The findings highlight the importance of user-centric design, adaptability, and the balance between personalization and privacy in AI-assisted productivity tools. The insights distilled from this study could support future research to further enhance productivity solutions, ultimately leading to optimized efficiency and user experiences for information workers.

Read the paper

NEW RESEARCH

LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens

The size of the context window of a large language model (LLM) determines the amount of text that can be entered for processing to generate responses. The window size is specifically measured by a number of tokens—larger windows are more desirable. However, due to high fine-tuning costs, scarcity of long texts, and catastrophic values introduced by new token positions, current extended context windows are limited to around 128k tokens.

In a recent paper: LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens, researchers from Microsoft introduce a new method that extends the context window of pre-trained LLMs to an impressive 2.048 million tokens, without requiring direct fine-tuning on texts with extremely long lengths, which are scarce, while maintaining performance at the level of the original short context window. Extensive experiments on LLaMA2 and Mistral across various tasks demonstrate the effectiveness of this method. Models extended via LongRoPE retain the original architecture with minor modifications to the positional embedding and can reuse most pre-existing optimizations.

Read the paper

NEW RESEARCH

Exploring Interaction Patterns for Debugging: Enhancing Conversational Capabilities of AI-assistants

Conversational interactions with large language models (LLMs) enable programmers to obtain natural language explanations for various software development tasks. However, LLMs often leap to action without sufficient context, giving rise to implicit assumptions and inaccurate responses. Conversations between developers and LLMs are primarily structured as question-answer pairs, where the developer is responsible for asking the right questions and sustaining conversations across multiple turns.

In a recent paper: Exploring Interaction Patterns for Debugging: Enhancing Conversational Capabilities of AI-assistants, researchers from Microsoft draw inspiration from interaction patterns and conversation analysis to design Robin, an enhanced conversational AI-assistant for debugging. Robin works with the developer collaboratively, creating hypotheses about the bug’s root cause, testing them using IDE debugging features such as breakpoints and watches, and then proposing fixes. A user study with 12 industry professionals shows that equipping the LLM-driven debugging assistant to (1) leverage the insert expansion interaction pattern; (2) facilitate turn-taking; and (3) utilize debugging workflows, leads to lowered conversation barriers, effective fault localization, and 5x improvement in bug resolution rates.

Read the paper

NEW RESEARCH

Ironies of Generative AI: Understanding and mitigating productivity loss in human-AI interactions

Generative AI (GenAI) systems, which can produce new content based on input like code, images, speech, video, and more, offer opportunities to increase user productivity in many tasks, such as programming and writing. However, while they boost productivity in some studies, many others show that users are working ineffectively with GenAI systems and actually losing productivity. These ‘ironies of automation’ have been observed for over three decades in human factors research on automation in aviation, automated driving, and intelligence.

In a recent paper: Ironies of Generative AI: Understanding and mitigating productivity loss in human-AI interactions, researchers from Microsoft draw on this extensive research alongside recent GenAI user studies to outline four key reasons for why productivity loss can occur with GenAI systems: 1) a shift in users’ roles from production to evaluation; 2) unhelpful restructuring of workflows; 3) interruptions; and 4) a tendency for automation to make easy tasks easier and hard tasks harder. We then suggest how human factors research can also inform GenAI system design to mitigate productivity loss by using approaches such as continuous feedback, system personalization, ecological interface design, task stabilization, and clear task allocation. Grounding developments in GenAI system usability in decades of research aims to ensure that the design of human-AI interactions in this rapidly moving field learns from history instead of repeating it.

Read the paper

The post Research Focus: Week of March 18, 2024 appeared first on Microsoft Research.

AI Decoded From GTC: The Latest Developer Tools and Apps Accelerating AI on PC and Workstation

Editor’s note: This post is part of the AI Decoded series, which demystifies AI by making the technology more accessible, and which showcases new hardware, software, tools and accelerations for RTX PC users.

NVIDIA’s RTX AI platform includes tools and software development kits that help Windows developers create cutting-edge generative AI features to deliver the best performance on AI PCs and workstations.

At GTC — NVIDIA’s annual technology conference — a dream team of industry luminaries, developers and researchers have come together to learn from one another, fueling what’s next in AI and accelerated computing.

This special edition of AI Decoded from GTC spotlights the best AI tools currently available and looks at what’s ahead for the 100 million RTX PC and workstation users and developers.

Chat with RTX, the tech demo and developer reference project that quickly and easily allows users to connect a powerful LLM to their own data, showcased new capabilities and new models in the GTC exhibit hall.

The winners of the Gen AI on RTX PCs contest were announced Monday. OutlookLLM, Rocket League BotChat and CLARA were highlighted in one of the AI Decoded talks in the generative AI theater and each are accelerated by NVIDIA TensorRT-LLM. Two other AI Decoded talks included using generative AI in content creation and a deep dive on Chat with RTX.

Developer frameworks and interfaces with TensorRT-LLM integration continue to grow as Jan.ai, Langchain, LlamaIndex and Oobabooga will all soon be accelerated — helping to grow the already more than 500 AI applications for RTX PCs and workstations.

NVIDIA NIM microservices are coming to RTX PCs and workstations. They provide pre-built containers, with industry standard APIs, enabling developers to accelerate deployment on RTX PCs and workstations. NVIDIA AI Workbench, an easy-to-use developer toolkit to manage AI model customization and optimization workflows, is now generally available for RTX developers.

These ecosystem integrations and tools will accelerate development of new Windows apps and features. And today’s contest winners are an inspiring glimpse into what that content will look like.

Hear More, See More, Chat More

Chat with RTX, or ChatRTX for short, uses retrieval-augmented generation, NVIDIA TensorRT-LLM software and NVIDIA RTX acceleration to bring local generative AI capabilities to RTX-powered Windows systems. Users can quickly and easily connect local files as a dataset to an open large language model like Mistral or Llama 2, enabling queries for quick, contextually relevant answers.

Moving beyond text, ChatRTX will soon add support for voice, images and new models.

Users will be able to talk to ChatRTX with Whisper — an automatic speech recognition system that uses AI to process spoken language. When the feature becomes available, ChatRTX will be able to “understand” spoken language, and provide text responses.

A future update will also add support for photos. By integrating OpenAI’s CLIP — Contrastive Language-Image Pre-training — users will be able to search by words, terms or phrases to find photos in their private library.

In addition to Google’s Gemma, ChatGLM will get support in a future update.

Developers can start with the latest version of the developer reference project on GitHub.

Generative AI for the Win

The NVIDIA Generative AI on NVIDIA RTX developer contest prompted developers to build a Windows app or plug-in.

“I found that playing against bots that react to game events with in-game messages in near real time adds a new level of entertainment to the game, and I’m excited to share my approach to incorporating AI into gaming as a participant in this developer contest. The target audience for my project is anyone who plays Rocket League with RTX hardware.” — Brian Caffey, Rocket League BotChat developer

Submissions were judged on three criteria, including a short demo video posted to social media, relative impact and ease of use of the project, and how effectively NVIDIA’s technology stack was used in the project. Each of the three winners received a pass to GTC, including a spot in the NVIDIA Deep Learning Institute GenAI/LLM courses, and a GeForce RTX 4090 GPU to power future development work.

OutlookLLM gives Outlook users generative AI features — such as email composition — securely and privately in their email client on RTX PCs and workstations. It uses a local LLM served via TensorRT-LLM.

Rocket League BotChat, for the popular Rocket League game, is a plug-in that allows bots to send contextual in-game chat messages based on a log of game events, such as scoring a goal or making a save. Designed to be used only in offline games against bot players, the plug-in is configurable in many ways via its settings menu.

CLARA (short for Command Line Assistant with RTX Acceleration) is designed to enhance the command line interface of PowerShell by translating plain English instructions into actionable commands. The extension runs locally, quickly and keeps users in their PowerShell context. Once it’s enabled, users type their English instructions and press the tab button to invoke CLARA. Installation is straightforward, and there are options for both script-based and manual setup.

Congratulations to the #GenAIonRTX #DevContest winners:

CLARA – Talk with computers in human language – Matthew Yaeger
Rocket League BotChat – Generate in-game chat messages – Brian Caffey
Outlook LLM – Compose emails securely with AI – Francisco Gonzalez Blanch

pic.twitter.com/i52w5Pn1n9

— NVIDIA AI Developer (@NVIDIAAIDev) March 19, 2024

From the Generative AI Theater

GTC attendees can attend three AI Decoded talks on Wednesday, March 20 at the generative AI theater. These 15-minute sessions will guide the audience through ChatRTX and how developers can productize their own personalized chatbot; how each of the three contest winners’ showed some of the possibilities for generative AI apps on RTX systems; and a celebration of artists, the tools and methods they use powered by NVIDIA technology.

In the creator session, Lee Fraser, senior developer relations manager for generative AI media and entertainment at NVIDIA, will explore why generative AI has become so popular. He’ll show off new workflows and how creators can rapidly explore ideas. Artists to be featured include Steve Talkowski, Sophia Crespo, Lim Wenhui, Erik Paynter, Vanessa Rosa and Refik Anadol.

Anadol also has an installation at the show that combines data visualization and imagery based on that data.

Ecosystem of Acceleration

Top creative app developers, like Blackmagic Design and Topaz Labs have integrated RTX AI acceleration in their software. TensorRT doubles the speed of AI effects like rotoscoping, denoising, super-resolution and video stabilization in the DaVinci Resolve and Topaz apps.

“Blackmagic Design and NVIDIA’s ongoing collaborations to run AI models on RTX AI PCs will produce a new wave of groundbreaking features that give users the power to create captivating and immersive content, faster.” — Rohit Gupta, director of software development at Blackmagic Design

TensorRT-LLM is being integrated with popular developer frameworks and ecosystems such as LangChain, LlamaIndex, Oobabooga and Jan.AI. Developers and enthusiasts can easily access the performance benefits of TensorRT-LLM through top LLM frameworks to build and deploy generative AI apps to both local and cloud GPUs.

Enthusiasts can also try out their favorite LLMs — accelerated with TensorRT-LLM on RTX systems — through the Oobabooga and Jan.AI chat interfaces.

AI That’s NIMble, AI That’s Quick

Developers and tinkerers can tap into NIM microservices. These pre-built AI “containers,” with industry-standard APIs, provide an optimized solution that helps to reduce deployment times from weeks to minutes. They can be used with more than two dozen popular models from NVIDIA, Getty Images, Google, Meta, Microsoft, Shutterstock and more.

NVIDIA AI Workbench is now generally available, helping developers quickly create, test and customize pretrained generative AI models and LLMs on RTX GPUs. It offers streamlined access to popular repositories like Hugging Face, GitHub and NVIDIA NGC, along with a simplified user interface that enables developers to easily reproduce, collaborate on and migrate projects.

Projects can be easily scaled up when additional performance is needed — whether to the data center, a public cloud or NVIDIA DGX Cloud — and then brought back to local RTX systems on a PC or workstation for inference and light customization. AI Workbench is a free download and provides example projects to help developers get started quickly.

These tools, and many others announced and shown at GTC, are helping developers drive innovative AI solutions.

From the Blackwell platform’s arrival, to a digital twin for Earth’s climate, it’s been a GTC to remember. For RTX PC and workstation users and developers, it was also a glimpse into what’s next for generative AI.

See notice regarding software product information.

ScreenAI: A visual language model for UI and visually-situated language understanding

Posted by Srinivas Sunkara and Gilles Baechler, Software Engineers, Google Research

Screen user interfaces (UIs) and infographics, such as charts, diagrams and tables, play important roles in human communication and human-machine interaction as they facilitate rich and interactive user experiences. UIs and infographics share similar design principles and visual language (e.g., icons and layouts), that offer an opportunity to build a single model that can understand, reason, and interact with these interfaces. However, because of their complexity and varied presentation formats, infographics and UIs present a unique modeling challenge.

To that end, we introduce “ScreenAI: A Vision-Language Model for UI and Infographics Understanding”. ScreenAI improves upon the PaLI architecture with the flexible patching strategy from pix2struct. We train ScreenAI on a unique mixture of datasets and tasks, including a novel Screen Annotation task that requires the model to identify UI element information (i.e., type, location and description) on a screen. These text annotations provide large language models (LLMs) with screen descriptions, enabling them to automatically generate question-answering (QA), UI navigation, and summarization training datasets at scale. At only 5B parameters, ScreenAI achieves state-of-the-art results on UI- and infographic-based tasks (WebSRC and MoTIF), and best-in-class performance on Chart QA, DocVQA, and InfographicVQA compared to models of similar size. We are also releasing three new datasets: Screen Annotation to evaluate the layout understanding capability of the model, as well as ScreenQA Short and Complex ScreenQA for a more comprehensive evaluation of its QA capability.

ScreenAI

ScreenAI’s architecture is based on PaLI, composed of a multimodal encoder block and an autoregressive decoder. The PaLI encoder uses a vision transformer (ViT) that creates image embeddings and a multimodal encoder that takes the concatenation of the image and text embeddings as input. This flexible architecture allows ScreenAI to solve vision tasks that can be recast as text+image-to-text problems.

On top of the PaLI architecture, we employ a flexible patching strategy introduced in pix2struct. Instead of using a fixed-grid pattern, the grid dimensions are selected such that they preserve the native aspect ratio of the input image. This enables ScreenAI to work well across images of various aspect ratios.

The ScreenAI model is trained in two stages: a pre-training stage followed by a fine-tuning stage. First, self-supervised learning is applied to automatically generate data labels, which are then used to train ViT and the language model. ViT is frozen during the fine-tuning stage, where most data used is manually labeled by human raters.

ScreenAI model architecture.

Data generation

To create a pre-training dataset for ScreenAI, we first compile an extensive collection of screenshots from various devices, including desktops, mobile, and tablets. This is achieved by using publicly accessible web pages and following the programmatic exploration approach used for the RICO dataset for mobile apps. We then apply a layout annotator, based on the DETR model, that identifies and labels a wide range of UI elements (e.g., image, pictogram, button, text) and their spatial relationships. Pictograms undergo further analysis using an icon classifier capable of distinguishing 77 different icon types. This detailed classification is essential for interpreting the subtle information conveyed through icons. For icons that are not covered by the classifier, and for infographics and images, we use the PaLI image captioning model to generate descriptive captions that provide contextual information. We also apply an optical character recognition (OCR) engine to extract and annotate textual content on screen. We combine the OCR text with the previous annotations to create a detailed description of each screen.

A mobile app screenshot with generated annotations that include UI elements and their descriptions, e.g., TEXT elements also contain the text content from OCR, IMAGE elements contain image captions, LIST_ITEMs contain all their child elements.

LLM-based data generation

We enhance the pre-training data’s diversity using PaLM 2 to generate input-output pairs in a two-step process. First, screen annotations are generated using the technique outlined above, then we craft a prompt around this schema for the LLM to create synthetic data. This process requires prompt engineering and iterative refinement to find an effective prompt. We assess the generated data’s quality through human validation against a quality threshold.

You only speak JSON. Do not write text that isn’t JSON.
You are given the following mobile screenshot, described in words. Can you generate 5 questions regarding the content of the screenshot as well as the corresponding short answers to them? 

The answer should be as short as possible, containing only the necessary information. Your answer should be structured as follows:
questions: [
{{question: the question,
    answer: the answer
}},
 ...
]

{THE SCREEN SCHEMA}

A sample prompt for QA data generation.

By combining the natural language capabilities of LLMs with a structured schema, we simulate a wide range of user interactions and scenarios to generate synthetic, realistic tasks. In particular, we generate three categories of tasks:

Question answering: The model is asked to answer questions regarding the content of the screenshots, e.g., “When does the restaurant open?”
Screen navigation: The model is asked to convert a natural language utterance into an executable action on a screen, e.g., “Click the search button.”
Screen summarization: The model is asked to summarize the screen content in one or two sentences.

Block diagram of our workflow for generating data for QA, summarization and navigation tasks using existing ScreenAI models and LLMs. Each task uses a custom prompt to emphasize desired aspects, like questions related to counting, involving reasoning, etc.

LLM-generated data. Examples for screen QA, navigation and summarization. For navigation, the action bounding box is displayed in red on the screenshot.

Experiments and results

As previously mentioned, ScreenAI is trained in two stages: pre-training and fine-tuning. Pre-training data labels are obtained using self-supervised learning and fine-tuning data labels comes from human raters.

We fine-tune ScreenAI using public QA, summarization, and navigation datasets and a variety of tasks related to UIs. For QA, we use well established benchmarks in the multimodal and document understanding field, such as ChartQA, DocVQA, Multi page DocVQA, InfographicVQA, OCR VQA, Web SRC and ScreenQA. For navigation, datasets used include Referring Expressions, MoTIF, Mug, and Android in the Wild. Finally, we use Screen2Words for screen summarization and Widget Captioning for describing specific UI elements. Along with the fine-tuning datasets, we evaluate the fine-tuned ScreenAI model using three novel benchmarks:

Screen Annotation: Enables the evaluation model layout annotations and spatial understanding capabilities.
ScreenQA Short: A variation of ScreenQA, where its ground truth answers have been shortened to contain only the relevant information that better aligns with other QA tasks.
Complex ScreenQA: Complements ScreenQA Short with more difficult questions (counting, arithmetic, comparison, and non-answerable questions) and contains screens with various aspect ratios.

The fine-tuned ScreenAI model achieves state-of-the-art results on various UI and infographic-based tasks (WebSRC and MoTIF) and best-in-class performance on Chart QA, DocVQA, and InfographicVQA compared to models of similar size. ScreenAI achieves competitive performance on Screen2Words and OCR-VQA. Additionally, we report results on the new benchmark datasets introduced to serve as a baseline for further research.

Comparing model performance of ScreenAI with state-of-the-art (SOTA) models of similar size.

Next, we examine ScreenAI’s scaling capabilities and observe that across all tasks, increasing the model size improves performances and the improvements have not saturated at the largest size.

Model performance increases with size, and the performance has not saturated even at the largest size of 5B params.

Conclusion

We introduce the ScreenAI model along with a unified representation that enables us to develop self-supervised learning tasks leveraging data from all these domains. We also illustrate the impact of data generation using LLMs and investigate improving model performance on specific aspects with modifying the training mixture. We apply all of these techniques to build multi-task trained models that perform competitively with state-of-the-art approaches on a number of public benchmarks. However, we also note that our approach still lags behind large models and further research is needed to bridge this gap.

Acknowledgements

This project is the result of joint work with Maria Wang, Fedir Zubach, Hassan Mansoor, Vincent Etter, Victor Carbune, Jason Lin, Jindong Chen and Abhanshu Sharma. We thank Fangyu Liu, Xi Chen, Efi Kokiopoulou, Jesse Berent, Gabriel Barcik, Lukas Zilka, Oriana Riva, Gang Li,Yang Li, Radu Soricut, and Tania Bedrax-Weiss for their insightful feedback and discussions, along with Rahul Aralikatte, Hao Cheng and Daniel Kim for their support in data preparation. We also thank Jay Yagnik, Blaise Aguera y Arcas, Ewa Dominowska, David Petrou, and Matt Sharifi for their leadership, vision and support. We are very grateful toTom Small for helping us create the animation in this post.

8 years later: A world Go champion’s reflections on AlphaGo

World-famous Go champion Lee Sae Dol reflects on his match against our AI system AlphaGo eight years ago.Read More

Secure by Design: NVIDIA AIOps Partner Ecosystem Blends AI for Businesses

In today’s complex business environments, IT teams face a constant flow of challenges, from simple issues like employee account lockouts to critical security threats. These situations demand both quick fixes and strategic defenses, making the job of maintaining smooth and secure operations ever tougher.

That’s where AIOps comes in, blending artificial intelligence with IT operations to not only automate routine tasks, but also enhance security measures. This efficient approach allows teams to quickly deal with minor issues and, more importantly, to identify and respond to security threats faster and with greater accuracy than before.

By using machine learning, AIOps becomes a crucial tool in not just streamlining operations but also in strengthening security across the board. It’s proving to be a game-changer for businesses looking to integrate advanced AI into their teams, helping them stay a step ahead of potential security risks.

According to IDC, the IT operations management software market is expected to grow at a rate of 10.3% annually, reaching a projected revenue of $28.4 billion by 2027. This growth underscores the increasing reliance on AIOps for operational efficiency and as a critical component of modern cybersecurity strategies.

As the rapid growth of machine learning operations continues to transform the era of generative AI, a broad ecosystem of NVIDIA partners are offering AIOps solutions that leverage NVIDIA AI to improve IT operations.

NVIDIA is helping a broad ecosystem of AIOps partners with accelerated compute and AI software. This includes NVIDIA AI Enterprise, a cloud-native stack that can run anywhere and provides a basis for AIOps through software like NVIDIA NIM for accelerated inference of AI modes, NVIDIA Morpheus for AI-based cybersecurity and NVIDIA NeMo for custom generative AI. This software facilitates GenAI-based chatbot, summarization and search functionality.

AIOps providers using NVIDIA AI include:

Dynatrace Davis hypermodal AI advances AIOps by integrating causal, predictive and generative AI techniques with the addition of Davis CoPilot. This combination enhances observability and security across IT, development, security and business operations by offering precise and actionable, AI-driven answers and automation.
Elastic offers Elasticsearch Relevance Engine (ESRE) for semantic and vector search, which integrates with popular LLMs like GPT-4 to power AI Assistants in their Observability and Security solutions. The Observability AI Assistant is a next-generation AI Ops capability that helps IT teams understand complex systems, monitor health and automate remediation of operational issues.
New Relic is advancing AIOps by leveraging its machine learning, generative AI assistant frameworks and longstanding expertise in observability. Its machine learning and advanced logic helps IT teams reduce alerting noise, improve mean time to detect and mean time to repair, automate root cause analysis and generate retrospectives. Its GenAI assistant, New Relic AI, accelerates issue resolution by allowing users to identify, explain and resolve errors without switching contexts, and suggests and applies code fixes directly in a developer’s integrated development environment. It also extends incident visibility and prevention to non-technical teams by automatically producing high-level system health reports, analyzing and summarizing dashboards and answering plain-language questions about a user’s applications, infrastructure and services. New Relic also provides full-stack observability for AI-powered applications benefitting from NVIDIA GPUs.
PagerDuty has introduced a new feature in PagerDuty Copilot, integrating a generative AI assistant within Slack to offer insights from incident start to resolution, streamlining the incident lifecycle and reducing manual task loads for IT teams.
ServiceNow’s commitment to creating a proactive IT operations encompasses automating insights for rapid incident response, optimizing service management and detecting anomalies. Now, in collaboration with NVIDIA, it is pushing into generative AI to further innovate technology service and operations.
Splunk’s technology platform applies artificial intelligence and machine learning to automate the processes of identifying, diagnosing and resolving operational issues and threats, thereby enhancing IT efficiency and security posture. Splunk IT Service Intelligence serves as Splunk’s primary AIOps offering, providing embedded AI-driven incident prediction, detection and resolution all from one place.

Cloud service providers including Amazon Web Services (AWS), Google Cloud and Microsoft Azure enable organizations to automate and optimize their IT operations, leveraging the scale and flexibility of cloud resources.

AWS offers a suite of services conducive to AIOps, including Amazon CloudWatch for monitoring and observability; AWS CloudTrail for tracking user activity and API usage; Amazon SageMaker for creating repeatable and responsible machine learning workflows; and AWS Lambda for serverless computing, allowing for the automation of response actions based on triggers.
Google Cloud supports AIOps through services like Google Cloud Operations, which provides monitoring, logging and diagnostics across applications on the cloud and on-premises. Google Cloud’s AI and machine learning products include Vertex AI for model training and prediction and BigQuery for fast SQL queries using the processing power of Google’s infrastructure.
Microsoft Azure facilitates AIOps with Azure Monitor for comprehensive monitoring of applications, services and infrastructure. Azure Monitor’s built-in AIOps capabilities help predict capacity usage, enable autoscaling, identify application performance issues and detect anomalous behaviors in virtual machines, containers and other resources. Microsoft Azure Machine Learning (AzureML) offers a cloud-based MLOps environment for training, deploying and managing machine learning models responsibly, securely and at scale.

Platforms specializing in MLOps primarily focus on streamlining the lifecycle of machine learning models, from development to deployment and monitoring. While the core mission centers on making machine learning more accessible, efficient and scalable, their technologies and methodologies indirectly support AIOps by enhancing AI capabilities within IT operations:

Anyscale’s platform, based on Ray, allows for the easy scaling of AI and machine learning applications, including those used in AIOps for tasks like anomaly detection and automated remediation. By facilitating distributed computing, Anyscale helps AIOps systems process large volumes of operational data more efficiently, enabling real-time analytics and decision-making.
Dataiku can be used to create models that predict IT system failures or optimize resource allocation, with features that allow IT teams to quickly deploy and iterate on these models in production environments.
Dataloop’s platform delivers full data lifecycle management and a flexible way to plug in AI models for an end-to-end workflow, allowing users to develop AI applications using their data.
DataRobot is a full AI lifecycle platform that enables IT operations teams to rapidly build, deploy and govern AI solutions, improving operational efficiency and performance.
Domino Data Lab’s platform lets enterprises and their data scientists build, deploy and manage AI on a unified, end-to-end platform. Data, tools, compute, models and projects across all environments are centrally managed so teams can collaborate, monitor production models and standardize best practices for governed AI innovation. This approach is vital for AIOps as it balances the self-service needed by data science teams with complete reproducibility, granular cost tracking and proactive governance for IT operational needs.
Weights & Biases provides tools for experiment tracking, model optimization, and collaboration, crucial for developing and fine-tuning AI models used in AIOps. By offering detailed insights into model performance and facilitating collaboration across teams, Weights & Biases helps ensure that AI models deployed for IT operations are both effective and transparent.

Learn more about NVIDIA’s partner ecosystem and their work at NVIDIA GTC.

Flood forecasting at Google

The river forecast model

Input and training data

Improving on the current state-of-the-art

Looking into the future

NEW RESEARCH

Fewer is More: Boosting LLM Reasoning with Reinforced Context Pruning

Collaborators: Holoportation communication technology with Spencer Fowers and Kwame Darko

NEW RESEARCH

From User Surveys to Telemetry-Driven Agents: Exploring the Potential of Personalized Productivity Solutions

NEW RESEARCH

LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens

NEW RESEARCH

Exploring Interaction Patterns for Debugging: Enhancing Conversational Capabilities of AI-assistants

NEW RESEARCH

Ironies of Generative AI: Understanding and mitigating productivity loss in human-AI interactions

Hear More, See More, Chat More

Generative AI for the Win

From the Generative AI Theater

Ecosystem of Acceleration

AI That’s NIMble, AI That’s Quick

ScreenAI

Data generation

LLM-based data generation

Experiments and results

Conclusion

Acknowledgements

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.