On Noisy Evaluation in Federated Hyperparameter Tuning

On Noisy Evaluation in Federated Hyperparameter Tuning

Evaluating models in federated networks is challenging due to factors such as client subsampling, data heterogeneity, and privacy. These factors introduce noise that can affect hyperparameter tuning algorithms and lead to suboptimal model selection.

Hyperparameter tuning is critical to the success of cross-device federated learning applications. Unfortunately, federated networks face issues of scale, heterogeneity, and privacy, which introduce noise in the tuning process and make it difficult to faithfully evaluate the performance of various hyperparameters. Our work (MLSys’23) explores key sources of noise and surprisingly shows that even small amounts of noise can have a significant impact on tuning methods—reducing the performance of state-of-the-art approaches to that of naive baselines. To address noisy evaluation in such scenarios, we propose a simple and effective approach that leverages public proxy data to boost the evaluation signal. Our work establishes general challenges, baselines, and best practices for future work in federated hyperparameter tuning.

Federated Learning: An Overview

undefined
In federated learning (FL), user data remains on the device and only model updates are communicated. (Source: Wikipedia)

Cross-device federated learning (FL) is a machine learning setting that considers training a model over a large heterogeneous network of devices such as mobile phones or wearables. Three key factors differentiate FL from traditional centralized learning and distributed learning:

Scale. Cross-device refers to FL settings with many clients with potentially limited local resources e.g. training a language model across hundreds to millions of mobile phones. These devices have various resource constraints, such as limited upload speed, number of local examples, or computational capability.

Heterogeneity. Traditional distributed ML assumes each worker/client has a random (identically distributed) sample of the training data. In contrast, in FL client datasets may be non-identically distributed, with each user’s data being generated by a distinct underlying distribution.

Privacy. FL offers a baseline level of privacy since raw user data remains local on each client. However, FL is still vulnerable to post-hoc attacks where the public output of the FL algorithm (e.g. a model or its hyperparameters) can be reverse-engineered and leak private user information. A common approach to mitigate such vulnerabilities is to use differential privacy, which aims to mask the contribution of each client. However, differential privacy introduces noise in the aggregate evaluation signal, which can make it difficult to effectively select models.

Federated Hyperparameter Tuning

Appropriately selecting hyperparameters (HPs) is critical to training quality models in FL. Hyperparameters are user-specified parameters that dictate the process of model training such as the learning rate, local batch size, and number of clients sampled at each round. The problem of tuning HPs is general to machine learning (not just FL). Given an HP search space and search budget, HP tuning methods aim to find a configuration in the search space that optimizes some measure of quality within a constrained budget.

Let’s first look at an end-to-end FL pipeline that considers both the processes of training and hyperparameter tuning. In cross-device FL, we split the clients into two pools for training and validation. Given a hyperparameter configuration ((lambda_s, lambda_c)), we train a model using the training clients (explained in section “FL Training”). We then evaluate this model on the validation clients, obtaining an error rate/accuracy metric. We can then use the error rate to adjust the hyperparameters and train a new model.

A standard pipeline for tuning hyperparameters in cross-device FL.

The diagram above shows two vectors of hyperparameters (lambda_s, lambda_c). These correspond to the hyperparameters of two optimizers: one is server-side and the other is client-side. Next, we describe how these hyperparameters are used during FL training.

FL Training

A typical FL algorithm consists of several rounds of training where each client performs local training followed by aggregation of the client updates. In our work, we experiment with a general framework called FedOPT which was presented in Adaptive Federated Optimization (Reddi et al. 2021). We outline the per-round procedure of FedOPT below:

  1. The server broadcasts the model (theta) to a sampled subset of (K) clients.
  2. Each client (in parallel) trains (theta) on their local data (X_k) using ClientOPT and obtains an updated model (theta_k).
  3. Each client sends (theta_k) back to the server.
  4. The server averages all the received models \(theta’ = frac{1}{K} sum_k p_ktheta_k).
  5. To update (theta), the server computes the difference (theta – theta’) and feeds it as a pseudo-gradient into ServerOPT (rather than computing a gradient w.r.t. some loss function).
The FedOPT framework and the five hyperparameters ((lambda_s, lambda_c)) we consider tuning. (Source: edited from Wikipedia)

Steps 2 and 5 of FedOPT each require a gradient-based optimization algorithm (called ClientOPT and ServerOPT) which specify how to update (theta) given some update vector. In our work, we focus on an instantiation of FedOPT called FedAdam, which uses Adam (Kingma and Ba 2014) as ServerOPT and SGD as ClientOPT. We focus on tuning five FedAdam hyperparameters: two for client training (SGD’s learning rate and batch size) and three for server aggregation (Adam’s learning rate, 1st-moment decay, and 2nd-moment decay).

FL Evaluation

Now, we discuss how FL settings introduce noise to model evaluation. Consider the following example below. We have (K=4) configurations (grey, blue, red, green) and we want to figure out which configuration has the best average accuracy across (N=5) clients. More specifically, each “configuration” is a set of HP values (learning rate, batch size, etc.) that are fed into an FL training algorithm (more details in the next section). This produces a model we can evaluate. If we can evaluate every model on every client then our evaluation is noiseless. In this case, we would be able to accurately determine that the green model performs the best. However, generating all the evaluations as shown below is not practical, as evaluation costs scale with both the number of configurations and clients.

HP tuning without noise. Every configuration is evaluated on every client, which allows us to find the best (green) configuration.

Below, we show an evaluation procedure that is more realistic in FL. As the primary challenge in cross-device FL is scale, we evaluate models using only a random subsample of clients. This is shown in the figure by red ‘X’s and shaded-out phones. We cover three additional sources of noise in FL which can negatively interact with subsampling and introduce even more noise into the evaluation procedure:

Data heterogeneity. FL clients may have non-identically distributed data, meaning that the evaluations on various models can differ between clients. This is shown by the different histograms next to each client. Data heterogeneity is intrinsic to FL and is critical for our observations on noisy evaluation; if all clients had identical datasets, there would be no need to sample more than one client.

Systems heterogeneity. In addition to data heterogeneity, clients may have heterogeneous system capabilities. For example, some clients have better network reception and computational hardware, which allows them to participate in training and evaluation more frequently. This biases performance towards these clients, leading to a poor overall model.

Differential privacy. Using the evaluation output (i.e. the top-performing model), a malicious party can infer whether or not a particular client participated in the FL procedure. At a high level, differential privacy aims to mask user contributions by adding noise to the aggregate evaluation metric. However, this additional noise can make it difficult to faithfully evaluate HP configurations.

In the figure above, evaluations can lead to suboptimal model selection when we consider client subsampling, data heterogeneity, and differential privacy. The combination of all these factors leads us to incorrectly choose the red model over the green one.

Experimental Results

The first goal of our work is to investigate the impact of four sources of noisy evaluation that we outlined in the section “FL Evaluation”. In more detail, these are our research questions:

  1. How does subsampling validation clients affect HP tuning performance?
  2. How do the following factors interact with/exacerbate issues of subsampling?
    • data heterogeneity (shuffling validation clients’ datasets)
    • systems heterogeneity (biased client subsampling)
    • privacy (adding Laplace noise to the aggregate evaluation)
  3. In noisy settings, how do SOTA methods compare to simple baselines?

Surprisingly, we show that state-of-the-art HP tuning methods can perform catastrophically poorly, even worse than simple baselines (e.g., random search). While we only show results for CIFAR10, results on three other datasets (FEMNIST, StackOverflow, and Reddit) can be found in our paper. CIFAR10 is partitioned such that each client has at most two out of the ten total labels.

Noise hurts random search

This section investigates questions 1 and 2 using random search (RS) as the hyperparameter tuning method. RS is a simple baseline that randomly samples several HP configurations, trains a model for each one, and returns the highest-performing model (i.e. the example in “FL Evaluation”, if the configurations were sampled independently from the same distribution). Generally, each hyperparameter value is sampled from a (log) uniform or normal distribution.

Random search with varying only client subsampling (left) and varying both client subsampling and data heterogeneity (right).

Client subsampling. We run RS while varying the client subsampling rate from a single client to the full validation client pool. “Best HPs” indicates the best HPs found across all trials of RS. As we subsample less clients (left), random search performs worse (higher error rate).

Data heterogeneity. We run RS on three separate validation partitions with varying degrees of data heterogeneity based on the label distributions on each client. Client subsampling generally harms performance but has a greater impact on performance when the data is heterogeneous (IID Fraction = 0 vs. 1).

Random search with varying systems heterogeneity (left) and privacy budget (right). Both factors interact negatively with client subsampling.

Systems heterogeneity. We run RS and bias the client sampling to reflect four degrees of systems heterogeneity. Based on the model that is currently being evaluated, we assign a higher probability of sampling clients who perform well on this model. Sampling bias leads to worse performance since the biased evaluations are overly optimistic and do not reflect performance over the entire validation pool.

Privacy. We run RS with 5 different evaluation privacy budgets (varepsilon). We add noise sampled from (text{Lap}(M/(varepsilon |S|))) to the aggregate evaluation, where (M) is the number of evaluations (16), (varepsilon) is the privacy budget (each curve), and (|S|) is the number of clients sampled for an evaluation (x-axis). A smaller privacy budget requires sampling a larger raw number of clients to achieve reasonable performance.

Noise hurts complex methods more than RS

Seeing that noise adversely affects random search, we now focus on question 3: Do the same observations hold for more complex tuning methods? In the next experiment, we compare 4 representative HP tuning methods.

  • Random Search (RS) is a naive baseline.
  • Tree-Structured Parzen Estimator (TPE) is a selection-based method. These methods build a surrogate model that predicts the performance of various hyperparameters rather than predictions for the task at hand (e.g. image or language data).
  • Hyperband (HB) is an allocation-based method. These methods allocate more resources to the most promising configurations. Hyperband initially samples a large number of configurations but stops training most of them after the first few rounds.
  • Bayesian Optimization + Hyperband (BOHB) is a combined method that uses both the sampling strategy of TPE and the partial evaluations of HB.
Examples of (a) selection-based and (b) allocation-based HP tuning methods. (a) uses a surrogate model of the search space to sample the next configuration (numbered in order of exploration), while (b) randomly samples many configurations and adaptively allocates resources to the most promising ones. (Source: Hyperband (Li et al. 2018))

We report the error rate of each HP tuning method (y-axis) at a given budget of rounds (x-axis). Surprisingly, we find that the relative ranking of these methods can be reversed when the evaluation is noisy. With noise, the performance of all methods degrades, but the degradation is particularly extreme for HB and BOHB. Intuitively, this is because these two methods already inject noise into the HP tuning procedure via early stopping which interacts poorly with additional sources of noise. Therefore, these results indicate a need for HP tuning methods that are specialized for FL, as many of the guiding principles for traditional hyperparameter tuning may not be effective at handling noisy evaluation in FL.

We compare 4 HP tuning methods in noiseless vs. noisy FL settings. In the noiseless setting (left), we always sample all the validation clients and do not consider privacy. In the noisy setting (right), we sample 1% of validation clients and have a generous privacy budget of (varepsilon=100).

Proxy evaluation outperforms noisy evaluation

In practical FL settings, a practitioner may have access to public proxy data which can be used to train models and select hyperparameters. However, given two distinct datasets, it is unclear how well hyperparameters can transfer between them. First, we explore the effectiveness of hyperparameter transfer between four datasets. Below, we see that the CIFAR10-FEMNIST and StackOverflow-Reddit pairs (top left, bottom right) show the clearest transfer between the two datasets. One likely reason for this is that these task pairs use the same model architecture: CIFAR10 and FEMNIST are both image classification tasks while StackOverflow and Reddit are next-word prediction tasks.

We experimented with 4 datasets in our work (CIFAR10, FEMNIST, StackOverflow, and Reddit). For each pair of datasets, we randomly sample 128 configurations and plot each configuration at the coordinates corresponding to the error rate on the two datasets.

Given the appropriate proxy dataset, we show that a simple method called one-shot proxy random search can perform extremely well. The algorithm has two steps:

  1. Run a random search using the proxy data to both train and evaluate HPs. We assume the proxy data is both public and server-side, so we can always evaluate HPs without subsampling clients or adding privacy noise.
  2. The output configuration from 1. is used to train a model on the training client data. Since we pass only a single configuration to this step, validation client data does not affect hyperparameter selection at all.

In each experiment, we choose one of these datasets to be partitioned among the clients and use the other three datasets as server-side proxy datasets. Our results show that proxy data can be an effective solution. Even if the proxy dataset is not an ideal match for the public data, it may be the only available solution under a strict privacy budget. This is shown in the FEMNIST plot where the orange/red lines (text datasets) perform similarly to the (varepsilon=10) curve.

We compare tuning HPs using noisy evaluations on the private dataset (with 1% client subsampling and varying the privacy budget (varepsilon) versus noiseless evaluations on the proxy dataset. The proxy HP tuning methods appear as horizontal lines because they are one-shot.

Conclusion

In conclusion, our study suggests several best practices for federated HP tuning:

  • Use simple HP tuning methods.
  • Sample a sufficiently large number of validation clients.
  • Evaluate a representative set of clients.
  • If available, proxy data can be an effective solution.

Furthermore, we identify several directions for future work in federated HP tuning:

  1. Tailoring HP tuning methods for differential privacy and FL. Early stopping methods are inherently noisy/biased and the large number of evaluations they use is at odds with privacy. Another useful direction is to investigate HP methods specific to noisy evaluation.
  2. More detailed cost evaluation. In our work, we only considered the number of training rounds as our resource budget. However, practical FL settings consider a wide variety of costs, such as total communication, amount of local training, or total time to train a model.
  3. Combining proxy and client data for HP tuning. A key issue of using public proxy data for HP tuning is that the best proxy dataset is not known in advance. One direction to address this is to design methods that combine public and private evaluations to mitigate bias from proxy data and noise from private data. Another promising direction is to rely on the abundance of public data and design a method that can select the best proxy dataset.

Read More

Creature Feature: Safari Across 5 Animal-Focused AI Initiatives of 2023

Creature Feature: Safari Across 5 Animal-Focused AI Initiatives of 2023

Whether abundant, endangered or extinct, animal species are the focus of countless AI-powered conservation projects.

These initiatives — accelerated using NVIDIA GPUs, deep learning software and robotics technology — are alerting conservationists to poaching threats, powering more sustainable aquaculture and helping scientists monitor coral reef health.

Take a safari through the NVIDIA Blog’s top animal stories of 2023 below.

As a bonus, discover how animals — whether beautiful butterflies, flashy fish or massive mammoths — are inspiring a herd of digital artists.

Protecting Pangolins From Poachers

Conservation AI, a U.K.-based nonprofit, is preserving biodiversity with an edge AI platform that analyzes camera footage in real time to identify species of interest, rapidly alerting conservationists to threats such as wildfires or poachers.

Founded by researchers at Liverpool John Moores University, the nonprofit now has dozens of cameras deployed across the globe running an AI platform built using NVIDIA Jetson modules, the NVIDIA DeepStream software development kit and NVIDIA Triton Inference Server.

The AI software is being deployed in Uganda and South Africa to keep an eye on pangolins and rhinos at risk of being hunted by poachers.

Video courtesy of Chester Zoo, a U.K.-based conservation society. 

Bringing Colossal Insights to a Woolly Problem

Colossal Biosciences is tackling endangered species conservation and de-extinction using computational biology.

Using gene editing technology, AI models and the NVIDIA Parabricks software suite for genomic analysis, scientists at Colossal are working to bring back the woolly mammoth, the dodo bird and the Tasmanian tiger — and protect dwindling species such as the African forest elephant.

After combining incomplete DNA sequences from extinct species’ bone samples with genomic data from closely related creatures, the team uses Parabricks for sequence alignment and variant calling — enabling them to complete analysis 12x faster.

Enabling Efficient Fish Farming

GoSmart, a member of the NVIDIA Inception program for cutting-edge startups and the NVIDIA Metropolis vision AI partner ecosystem, is deploying AI for more efficient and sustainable fish farming.

The company’s compact edge AI system, powered by the NVIDIA Jetson platform, analyzes a pond or tank’s temperature and oxygen levels, as well as the average weight and population distribution of fish — information farmers can use for decisions around fish feeding and harvesting.

The team is also adding AI models that analyze fish behavior and indicators of disease and plans to integrate its solution with autonomous feeding systems.

AI Can See Your (Coral Reef) Halo

Researchers at the University of Hawaii at Mānoa are analyzing satellite imagery using NVIDIA GPU-powered AI to track halos — the rings of sand that surround coral reefs — as a way to assess ecosystem health.

The presence of halos indicates that a coral reef has a healthy population of marine life, including fish and invertebrates. A change in their shape suggests a degrading environment that needs attention from conservationists.

The researchers’ AI tool, which runs on an NVIDIA RTX A6000 GPU, can analyze hundreds of coral reef halos in around two minutes, a task that would take 10 hours for a human to complete.

Reef-Roving Robot Tracks Undersea Life

An autonomous underwater robot powered by the NVIDIA Jetson Orin NX module is roaming reef ecosystems to help scientists understand human impact on reefs and surrounding sea creatures.

Developed by researchers at MIT and the Woods Hole Oceanographic Institution’s Autonomous Robotics and Perception Laboratory, the CUREE robot collects environmental data that informs 3D models of reefs.

The team developed an AI model called DeepSeeColor that cleans up blurry underwater footage to enable more accurate fish detection by another neural network. They’re also working on detection models to identify audio samples from different creatures.

Fantastic Fauna: Animals Inspire AI-Powered Digital Art

Greek philosopher Plato said that art imitates life — and digital art is no exception, as exemplified by artists who this year used NVIDIA technology to develop stunning animal-inspired visuals.

Honoring marine life, BBC Studios’ Alessandro Mastronardi created a series of incredibly realistic shark videos and renders in Blender and NVIDIA Omniverse, a platform for connecting and building custom 3D tools and applications with Universal Scene Description (OpenUSD).

Taiwanese artist Steven Tung took a more whimsical approach in The Given Fish, an animation depicting stone fish created using Adobe Photoshop, Adobe Substance 3D Painter and Blender — all accelerated by an NVIDIA GeForce RTX 4090 GPU-powered system.

London-based Dominic Harris used GPU-accelerated AI to render a real-time collage of 13,000 imagined butterflies, which exhibit-goers could make flutter or change color. And Keerthan Sathya, based in Bangalore, used the NVIDIA Omniverse platform to render a mammoth-themed animation.

Read More

That’s a Wrap: GeForce NOW Celebrates Another Year of High-Performance Cloud Gaming

That’s a Wrap: GeForce NOW Celebrates Another Year of High-Performance Cloud Gaming

Before ringing in the new year, GeForce NOW is taking a look back at a 2023 full of top-notch gaming. Explore GeForce NOW’s year in review, which brought more hit games, improved service features and the launch of the Ultimate membership tier.

Plus, GFN Thursday is raising a toast to the GeForce NOW community by delivering more than 40 new games to stream from the cloud.

Wrapping It Up

It’s been an amazing year of cloud gaming. The launch of the Ultimate tier brought high-performance cloud gaming across North America and Europe, streaming from newly rolled out GeForce RTX 4080 SuperPODs.

For the first time in the cloud, members could stream up to 240 frames per second, or 4K 120 fps on the native PC and Mac apps, and experience support for ultrawide resolutions for the smoothest and most immersive gameplay — all thanks to the powerful NVIDIA Ada Lovelace GPU architecture.

To prove the power of the cloud, NVIDIA gave gamers the ultimate test of latency on KovaaK’s, a latency-sensitive, first-person-shooter aim trainer.

GeForce NOW Kovaak's Ultimate Challenge Leaderboard

Gamers competed for top scores on the leaderboard, and the results were staggering — showing a 1.8x improvement in aiming just by playing with an Ultimate membership.

NVIDIA also posed a challenge to Cyberpunk 2077 fans: to play the graphics-intensive game on an unknown system. Players were astonished to discover that they were playing with full ray tracing on a Chromebook with GeForce NOW. NVIDIA even brought the experience to The Game Awards, showing off the power of gaming on a Chromebook with GeForce NOW on a global stage.

With higher-performance streaming came more collaborations with top-quality publishers.

NVIDIA and Microsoft partnership for GeForce NOW
A new window to the cloud.

NVIDIA and Microsoft signed a 10-year partnership this year, bringing hit Xbox PC games and over 100 supported PC Game Pass titles to the cloud, with more to come. Members could stream some of the biggest Xbox PC titles, including the Wolfenstein and Forza Horizon franchises, Starfield, and the Ori and Age of Empires series, across devices at high performance for the first time.

Call of Duty on GeForce NOW
The cloud is calling.

With the Microsoft partnership came the first Activision game in the cloud, Call of Duty: Modern Warfare III. With NVIDIA DLSS 3 and Reflex technologies, Ultimate members can get the highest frame rates and lowest latencies for the smoothest gameplay.

Monster Hunter: World on GeForce NOW
Hear me roar.

Celebrated publisher Capcom also brought to the cloud some of its top games, including Monster Hunter Rise, Monster Hunter: World and Dragon’s Dogma: Dark Arisen.

90+ titles streaming with RTX ON.
Catch the most impressive ray-traced lighting in the cloud.

The year closed out with a celebration spotlighting  500 NVIDIA RTX-supported games and applications. Over 90 titles with RTX ON are featured on GeForce NOW, easily found on the app’s dedicated RTX ON row, including top games Cyberpunk 2077, Alan Wake II, Far Cry 6, Control and more.

GeForce NOW Stats 2023
Pretty impressive.

The GeForce NOW community put up some impressive numbers, streaming over 250 million hours in the cloud.

And it doesn’t stop there — check back in each week to see what’s in store for GeForce NOW throughout the new year.

In With the New

To celebrate the amazing GeForce NOW community, the cloud gaming service is adding more than 40 Xbox and PC Game Pass titles this week — sure to tide members into the new year.

The best way to experience these games and the over 100 PC Game Pass titles in the cloud is with the latest GeForce NOW membership bundle, which includes a free, three-month PC Game Pass subscription with the purchase of a six-month GeForce NOW Ultimate membership.

Catch the full list of 46 games:

  • AI: THE SOMNIUM FILES – nirvanA Initiative (Xbox, available on the Microsoft Store)
  • Amazing Cultivation Simulator (Xbox, available on the Microsoft Store)
  • The Anacrusis (Xbox, available on the Microsoft Store)
  • Age of Wonders 4 (Xbox, available on the Microsoft Store)
  • Before We Leave (Xbox, available on the Microsoft Store)
  • Century: Age of Ashes (Xbox, available on the Microsoft Store)
  • Chorus (Xbox, available on the Microsoft Store)
  • Control (Xbox, available on the Microsoft Store)
  • Darksiders III (Xbox, available on the Microsoft Store)
  • Destroy All Humans! (Xbox, available on the Microsoft Store)
  • Disgaea 4 Complete+ (Xbox, available on the Microsoft Store)
  • Edge of Eternity (Xbox, available on the Microsoft Store)
  • Europa Universalis IV (Xbox, available on the Microsoft Store)
  • Evil Genius 2: World Domination (Xbox, available on the Microsoft Store)
  • Fae Tactics (Xbox, available on the Microsoft Store)
  • Farming Simulator 17 (Xbox, available on the Microsoft Store)
  • The Forgotten City (Xbox, available on the Microsoft Store)
  • Human Fall Flat (Xbox, available on PC Game Pass)
  • Immortal Realms: Vampire Wars (Xbox, available on the Microsoft Store)
  • Lethal League Blaze (Xbox, available on the Microsoft Store)
  • Martha is Dead (Xbox, available on the Microsoft Store)
  • Matchpoint – Tennis Championships (Xbox, available on the Microsoft Store)
  • Maneater (Xbox, available on the Microsoft Store)
  • The Medium (Xbox, available on the Microsoft Store)
  • Metro Exodus (Xbox, available on the Microsoft Store)
  • Mortal Shell (Xbox, available on the Microsoft Store)
  • MotoGP 20 (Xbox, available on the Microsoft Store)
  • Moving Out (Xbox, available on the Microsoft Store)
  • MUSYNX (Xbox, available on the Microsoft Store)
  • Neon Abyss (Xbox, available on PC Game Pass)
  • Observer: System Redux (Xbox, available on the Microsoft Store)
  • Pathologic 2 (Xbox, available on the Microsoft Store)
  • The Pedestrian (Xbox, available on the Microsoft Store)
  • Raji: An Ancient Epic (Xbox, available on the Microsoft Store)
  • Recompile (Xbox, available on the Microsoft Store)
  • Remnant: From the Ashes (Xbox, available on PC Game Pass)
  • Remnant II (Xbox, available on PC Game Pass)
  • Richman 10 (Xbox, available on the Microsoft Store)
  • Sable (Xbox, available on the Microsoft Store)
  • SpellForce 3: Soul Harvest (Xbox, available on the Microsoft Store)
  • Surgeon Simulator 2 (Xbox, available on the Microsoft Store)
  • Sword and Fairy 7 (Xbox, available on PC Game Pass)
  • Tainted Grail: Conquest (Xbox, available on the Microsoft Store)
  • Tinykin (Xbox, available on the Microsoft Store)
  • Worms W.M.D (Xbox, available on the Microsoft Store)
  • Worms Rumble (Xbox, available on the Microsoft Store)

What are you planning to play this weekend? Let us know on Twitter or in the comments below.

Read More

Tune In to the Top 5 NVIDIA Videos of 2023

Tune In to the Top 5 NVIDIA Videos of 2023

2023 was marked by the generative AI boom, representing a new era for how artificial intelligence can be used across industries.

The year’s top videos from the NVIDIA YouTube channel reflect this focus, with popular videos highlighting the technology powering large language models, new platforms for building generative AI applications and how accelerated computing and AI can advance climate science.

And don’t miss replays of NVIDIA founder and CEO Jensen Huang’s event appearances — his GTC keynote in March has garnered 22 million views, making it by far the most-viewed video on the channel.

Tune in to NVIDIA’s top five videos of the year:

Predicting Extreme Weather Risk — Weeks in Advance

Explore in colorful detail how running FourCastNet — an AI framework developed by researchers at NVIDIA, Caltech and Lawrence Berkeley Lab — on NVIDIA GPUs enables quicker, more accurate extreme weather predictions.

Accelerating Carbon Capture and Storage

Buckle up — learn how reservoir engineers are using NVIDIA Omniverse, NVIDIA Modulus and accelerated computing to optimize carbon capture, ensuring long-term storage and safer operations.

Visualizing Global-Scale Climate Data

Seeing is achieving with this stunning demo of the NVIDIA Earth-2 platform, which offers high-resolution climate visualizations for scientists, as well as breathtakingly detailed urban airflow information for architects and city planners.

A Tour of the NVIDIA DGX H100

Presenting the engine behind the large language model breakthrough — the NVIDIA DGX H100. Hear from Huang on why DGX is “the essential instrument of AI.”

Fine-Tuning Generative AI With NVIDIA AI Workbench 

Check out this demo — featuring a multitude of Toy Jensens — to learn how NVIDIA AI Workbench streamlines selecting foundation models, building project environments and fine-tuning models with domain-specific data.

Read More

5 Ways AI Created Smarter Spaces in 2023

5 Ways AI Created Smarter Spaces in 2023

With all the talk of how generative AI is going to change the world, it’s worth looking back on how AI’s already enabled leaps and bounds.

NVIDIA helped automate airport operations, vehicle manufacturing, industrial inspections and more with AI to create smarter spaces in 2023.

Airport AI Takes Off

Toronto Pearson International Airport in June deployed the Zensors AI platform, which uses security cameras to generate spatial data to help optimize operations. Zensors is a member of NVIDIA Metropolis, a partner program for improving operations with visual data and AI, and NVIDIA Inception, a free program that nurtures cutting-edge startups.

The Zensors platform uses anonymized data to count travelers in lines, identify congested areas and predict passenger wait times — and it can send alerts to help speed operations. Other startups have landed in this space to reduce flight delays.

“Zensors is making visual AI easy for all to use,” said Anuraag Jain, the company’s cofounder and head of product and technology.

Inspect Your Gadget

Taiwanese manufacturers like Foxconn Industrial Internet, Pegatron, Quanta and Wistron are embracing NVIDIA Metropolis for Factories to enable automated optical inspections.

Pegatron makes motherboards, smartphones, laptops, game consoles and much more. It uses Metropolis for Factories to support its printed circuit board factories, achieving 99.8% accuracy on its automated optical inspection systems.

How’s that for a smarter workspace.

PepsiCo’s AI Pop

Beverage giant PepsiCo has deployed vision AI from KoiReader Technologies, an NVIDIA Metropolis partner, for efficiency gains in reading warehouse labels.

The startup’s technology is being tapped to train and run the deep learning models behind PepsiCo’s AI label and barcode scanning system.

“If you find the right lever, you could dramatically improve our throughput,” said Greg Bellon, senior director of digital supply chain at PepsiCo.

Driving Digital Production

With NVIDIA Omniverse — a collaborative platform for developing Universal Scene Description applications to design, plan and operate manufacturing and assembly facilities — Mercedes-Benz is using digital twins for production.

Harnessing Omniverse, Mercedes-Benz can interact directly with its suppliers, reducing coordination processes by 50%.

“Using NVIDIA Omniverse and AI, Mercedes-Benz is building a connected, digital-first approach to optimize its manufacturing processes, ultimately reducing construction time and production costs,” said Rev Lebaredian, vice president of Omniverse and simulation technology at NVIDIA.

Juicing AI for Batteries

Smart spaces often begin in the virtual world.

Siemens showcased an immersive digital model for a look into future FREYR Battery factories, powered by Omniverse.

The industrial giant demoed a blueprint for how teams can harness comprehensive digital twins virtually using models of existing and future plants. The technologies aim to help FREYR meet surging demand for high-density, cost-effective battery cells.

That’s AI to get charged up about.

Learn about building smart spaces with NVIDIA Metropolis. Learn about connecting and developing OpenUSD applications with NVIDIA Omniverse.

Read More

Ear-resistible: 5 AI Podcast Episodes That Perked Up Listeners in 2023

Ear-resistible: 5 AI Podcast Episodes That Perked Up Listeners in 2023

NVIDIA’s AI Podcast had its best year yet — with a record-breaking 1.2 million plays in 2023 and each biweekly episode now drawing more than 30,000 listens.

Among tech’s top podcasts, the AI Podcast has racked up more than 200 episodes and nearly 5 million total plays since its debut in 2016.

Listeners across the globe tune in for smart interviews on generative AI, large language models, as well as more offbeat topics like how AI is tackling challenges like building a self-driving baby stroller or discovering alien signals.

Here are five episodes that drew tens of thousands of listeners in 2023:

Gen AI Enables Scientific Leaps

Caltech’s Anima Anandkumar discusses generative AI’s potential to make splashes in the scientific community. The technology can, for example, be fed DNA, RNA, viral and bacterial data to craft a model that understands the language of genomes, or predict extreme-weather events like hurricanes and heat waves.

Class in Session: AI for Learning

The future of online education and the revolutionary impact of AI on the learning experience were the central themes discussed by Anant Agarwal, founder of edX and chief platform officer at 2U. The MIT professor and edtech pioneer also highlighted the implementation of AI-powered features in the edX platform, including a ChatGPT plug-in.

AI Gets Coding

The world increasingly runs on code. Accelerating the work of those who create that code will boost their productivity — and that’s just what AI startup Codeium, a member of the NVIDIA Inception program for startups, aims to do. The company’s leaders Varun Mohan and Jeff Wang talk about AI’s transformational role in software development.

Mindful Machine-Making

Julia Stoyanovich, associate professor of computer science and engineering at NYU and director of the university’s Center for Responsible AI, discusses how to make the terms “AI” and “responsible AI” synonymous.

AI for Regeneration, Scar Prevention

Scientists at Matice Biosciences are applying AI to study the regeneration of tissues in animals known as super-regenerators, such as salamanders and planarians. Cofounder Jessica Whited, a regenerative biologist at Harvard University, discusses how the research could unlock new treatments to help humans heal from injuries without scarring.

Subscribe to the AI Podcast

Get the AI Podcast through Amazon Music, iTunes, Google Podcasts, Google Play, Castbox, DoggCatcher, Overcast, PlayerFM, Pocket Casts, Podbay, PodBean, PodCruncher, PodKicker, Soundcloud, Spotify, Stitcher and TuneIn.

Make the AI Podcast better by filling out this listener survey.

Read More

NVIDIA Holiday Card Glows Gold and Green on Cold Winter’s Eve

NVIDIA Holiday Card Glows Gold and Green on Cold Winter’s Eve

Editor’s note: This post is part of our weekly In the NVIDIA Studio series, which celebrates featured artists, offers creative tips and tricks, and demonstrates how NVIDIA Studio technology improves creative workflows. We’re also deep diving on new GeForce RTX 40 Series GPU features, technologies and resources, and how they dramatically accelerate content creation.

NVIDIA’s holiday card — enchanting viewers from the perspective of snuggled-up family members on a couch — warmly depicts a crackling fireplace and an NVIDIA robo-dog by the hearth, all framed by a string of sparkling lights.

In the scene, shown above, characters are decked out in NVIDIA-themed socks and under blankets with the pattern from a custom NVIDIA holiday sweater. Detail-oriented viewers can discover hidden treasures: a virtual toy model of NVIDIA founder and CEO Jensen Huang — aka Toy Jensen — NVIDIA iconography in the woodwork and an NVIDIA-branded mug.

Members of NVIDIA’s creative team who are featured in this week’s special In the NVIDIA Studio beat — Alessandro Baldasseroni, Michael Johnson and Rini Sugianto — collaborated to build this 3D scene. They combined 60 years of creative experience, AI-powered features and NVIDIA RTX GPU acceleration in their favorite creative apps to incredible effect.

Plus, the latest version of Reallusion iClone, a real-time 3D animation software, offers a crowd-creation system for populating large worlds in NVIDIA Omniverse, a platform that interconnects 3D workflows for live-sync creation.

Populate Virtual Worlds in NVIDIA Omniverse

Reallusion iClone helps artists bring lifelike movement and realistic facial expressions to 3D models.

iClone version 8.4 builds on these capabilities with a simulation system that provides real-time, customizable crowd animations using Motion Director, a cutting-edge motion-matching and trigger-animation technology.

With it, artists can effortlessly spawn lifelike characters complete with facial expressions, accessories and diverse animation styles. The characters can then be directed to intelligently navigate 3D spaces while avoiding collisions and obstacles.

iClone supports live sync with NVIDIA Omniverse Kit-based apps, allowing users to more seamlessly tap its vast libraries of characters and motions.

iClone version 8.4 is free to download. Learn more about the release details.

Averkin’s at It Again

Seasoned In the NVIDIA Studio artist Andrew Averkin can’t help but spread holiday cheer.

His 3D scene Keep Me Warm seamlessly transitions between the immaculately detailed parts of a holiday-themed room. The Christmas trees, bright lights and children’s toys all feature photorealistic detail sure to move viewers, and calming music adds to the scene’s coziness.

Averkin built Keep Me Warm in NVIDIA Omniverse, which is based on the Universal Scene Description framework, aka OpenUSD.

Such inspirational, winter-themed content is just what the NVIDIA Studio team is looking for in the #WinterArtChallenge. Don’t forget to share winter-themed art with the hashtag on Facebook, Instagram or X for a chance to be featured on NVIDIA Studio and NVIDIA Omniverse social media channels.

And check out Averkin’s Instagram for more engaging content.

Deck the Halls With Tons of Renders

“The goal was to create something that invoked warmth, joy and holiday spirit,” said Johnson on ideating for this year’s NVIDIA holiday card. “There’s nothing better than being with family, cuddled up on the couch, enjoying each other’s time while wearing something really cozy and relaxing.”

The NVIDIA artists created foreground characters starting with basic elements from the trio’s collective asset library.

Baldasseroni took the lead on modeling and tweaking the characters in ZBrush, working closely with Johnson on the right composition, and even provided preliminary posing to help guide the character feel for a relaxed family portrait.

 

Moving to Adobe Substance 3D Painter, Baldasseroni created and applied custom textures to the models. His NVIDIA RTX A6000 GPU accelerated light and ambient occlusion, baking optimized assets in mere seconds.

“I used GPU acceleration in Adobe Substance Painter and worked with preliminary lookdev renders in the NVIDIA Iray engine.” — Alessandro Baldasseroni

Sugianto took on animation work, opening Autodesk Maya where her NVIDIA RTX A6000 GPU provided several key advantages.

RTX-accelerated ray tracing and AI-powered denoising with the Autodesk Arnold renderer resulted in highly interactive and photorealistic modeling.

Autodesk Maya also supports third-party GPU-accelerated renderers, such as V-Ray, OctaneRender and Redshift, which gave Sugianto more options to animate the scene.

 

With the assets beautifully modeled, textured and animated, Johnson imported all files into the NVIDIA Omniverse USD Composer app to add physically accurate properties for the realistic fire, candle lighting and smoke.

“NVIDIA RTX GPU rendering in USD Composer is so fast at enabling quick iterations and different looks,” said Johnson.

Johnson used OpenUSD files in USD Composer, allowing Baldasseroni and Sugianto to review Johnson’s edits in real time with fully ray-traced details. This eliminated the need to download, upload and reformat files to share and consolidate feedback from other stakeholders, saving valuable time and resources.

 

Johnson then rendered out still images into Adobe Photoshop for final color grading. He further improved visual quality by upscaling the image using the AI-powered, RTX-accelerated Super Resolution feature — which is significantly faster than traditional methods. Throughout his workflow, Johnson could choose from more than 30 GPU-accelerated features, including blur gallery, object selection, liquify, smart sharpen and perspective.

 

He then uploaded files into Nuke, a visual-effects and video-editing software, for final GPU-accelerated compositing of all the scene’s elements.

NVIDIA artists Alessandro Baldasseroni, Michael Johnson and Rini Sugianto.

Check out Baldasseroni, Johnson and Sugianto on Instagram.

Follow NVIDIA Studio on Facebook, Instagram and X. Access tutorials on the Studio YouTube channel and get updates directly in your inbox by subscribing to the Studio newsletter. 

Read More

Amazon SageMaker model parallel library now accelerates PyTorch FSDP workloads by up to 20%

Amazon SageMaker model parallel library now accelerates PyTorch FSDP workloads by up to 20%

Large language model (LLM) training has surged in popularity over the last year with the release of several popular models such as Llama 2, Falcon, and Mistral. Customers are now pre-training and fine-tuning LLMs ranging from 1 billion to over 175 billion parameters to optimize model performance for applications across industries, from healthcare to finance and marketing.

Training performant models at this scale can be a challenge. Highly accurate LLMs can require terabytes of training data and thousands or even millions of hours of accelerator compute time to achieve target accuracy. To complete training and launch products in a timely manner, customers rely on parallelism techniques to distribute this enormous workload across up to thousands of accelerator devices. However, these parallelism techniques can be difficult to use: different techniques and libraries are only compatible with certain workloads or restricted to certain model architectures, training performance can be highly sensitive to obscure configurations, and the state of the art is quickly evolving. As a result, machine learning practitioners must spend weeks of preparation to scale their LLM workloads to large clusters of GPUs.

In this post, we highlight new features of the Amazon SageMaker model parallel (SMP) library that simplify the large model training process and help you train LLMs faster. In particular, we cover the SMP library’s new simplified user experience that builds on open source PyTorch Fully Sharded Data Parallel (FSDP) APIs, expanded tensor parallel functionality that enables training models with hundreds of billions of parameters, and performance optimizations that reduce model training time and cost by up to 20%.

To learn more about the SageMaker model parallel library, refer to SageMaker model parallelism library v2 documentation. You can also refer to our example notebooks to get started.

New features that simplify and accelerate large model training

This post discusses the latest features included in the v2.0 release of the SageMaker model parallel library. These features improve the usability of the library, expand functionality, and accelerate training. In the following sections, we summarize the new features and discuss how you can use the library to accelerate your large model training.

Aligning SMP with open source PyTorch

Since its launch in 2020, SMP has enabled high-performance, large-scale training on SageMaker compute instances. With this latest major version release of SMP, the library simplifies the user experience by aligning its APIs with open source PyTorch.

PyTorch offers Fully Sharded Data Parallelism (FSDP) as its main method for supporting large training workload across many compute devices. As demonstrated in the following code snippet, SMP’s updated APIs for techniques such as sharded data parallelism mirror those of PyTorch. You can simply run import torch.sagemaker and use it in place of torch.

## training_script.py
import torch.sagemaker as tsm
tsm.init()

# Set up a PyTorch model
model = ...

# Wrap the PyTorch model using the PyTorch FSDP module
model = FSDP(
model,
...
)

optimizer = ...
...

With these updates to SMP’s APIs, you can now realize the performance benefits of SageMaker and the SMP library without overhauling your existing PyTorch FSDP training scripts. This paradigm also allows you to use the same code base when training on premises as on SageMaker, simplifying the user experience for customers who train in multiple environments.

For more information on how to enable SMP with your existing PyTorch FSDP training scripts, refer to Get started with SMP.

Integrating tensor parallelism to enable training on massive clusters

This release of SMP also expands PyTorch FSDP’s capabilities to include tensor parallelism techniques. One problem with using sharded data parallelism alone is that you can encounter convergence problems as you scale up your cluster size. This is because sharding parameters, gradients, and the optimizer state across data parallel ranks also increases your global batch size; on large clusters, this global batch size can be pushed beyond the threshold below which the model would converge. You need to incorporate an additional parallelism technique that doesn’t require an increase in global batch size as you scale your cluster.

To mitigate this problem, SMP v2.0 introduces the ability to compose sharded data parallelism with tensor parallelism. Tensor parallelism allows the cluster size to increase without changing the global batch size or affecting model convergence. With this feature, you can safely increase training throughput by provisioning clusters with 256 nodes or more.

Today, tensor parallelism with PyTorch FSDP is only available with SMP v2. SMP v2 allows you to enable this technique with a few lines of code change and unlock stable training even on large clusters. SMP v2 integrates with Transformer Engine for its implementation of tensor parallelism and makes it compatible with the PyTorch FSDP APIs. You can enable PyTorch FSDP and SMP tensor parallelism simultaneously without making any changes to your PyTorch model or PyTorch FSDP configuration. The following code snippets show how to set up the SMP configuration dictionary in JSON format and add the SMP initialization module torch.sagemaker.init(), which accepts the configuration dictionary in the backend when you start the training job, to your training script.

The SMP configuration is as follows:

{
"tensor_parallel_degree": 8,
"tensor_parallel_seed": 0
}

In your training script, use the following code:

import torch.sagemaker as tsm
tsm.init()

from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_config(..)
model = tsm.transform(model)

To learn more about using tensor parallelism in SMP, refer to the tensor parallelism section of our documentation.

Use advanced features to accelerate model training by up to 20%

In addition to enabling distributed training on clusters with hundreds of instances, SMP also offers optimization techniques that can accelerate model training by up to 20%. In this section, we highlight a few of these optimizations. To learn more, refer to the core features section of our documentation.

Hybrid sharding

Sharded data parallelism is a memory-saving distributed training technique that splits the state of a model (model parameters, gradients, and optimizer states) across devices. This smaller memory footprint allows you to fit a larger model into your cluster or increase the batch size. However, sharded data parallelism also increases the communication requirements of your training job because the sharded model artifacts are frequently gathered from different devices during training. In this way, the degree of sharding is an important configuration that trades off memory consumption and communication overhead.

By default, PyTorch FSDP shards model artifacts across all of the accelerator devices in your cluster. Depending on your training job, this method of sharding could increase communication overhead and create a bottleneck. To help with this, the SMP library offers configurable hybrid sharded data parallelism on top of PyTorch FSDP. This feature allows you to set the degree of sharding that is optimal for your training workload. Simply specify the degree of sharding in a configuration JSON object and include it in your SMP training script.

The SMP configuration is as follows:

{ "hybrid_shard_degree": 16 }

To learn more about the advantages of hybrid sharded data parallelism, refer to Near-linear scaling of gigantic-model training on AWS. For more information on implementing hybrid sharding with your existing FSDP training script, see hybrid shared data parallelism in our documentation.

Use the SMDDP collective communication operations optimized for AWS infrastructure

You can use the SMP library with the SageMaker distributed data parallelism (SMDDP) library to accelerate your distributed training workloads. SMDDP includes an optimized AllGather collective communication operation designed for best performance on SageMaker p4d and p4de accelerated instances. In distributed training, collective communication operations are used to synchronize information across GPU workers. AllGather is one of the core collective communication operations typically used in sharded data parallelism to materialize the layer parameters before forward and backward computation steps. For training jobs that are bottlenecked by communication, faster collective operations can reduce training time and cost with no side effects on convergence.

To use the SMDDP library, you only need to add two lines of code to your training script:

import torch.distributed as dist

# Initialize with SMDDP
import smdistributed.dataparallel.torch.torch_smddp
dist.init_process_group(backend="smddp") # Replacing "nccl"

# Initialize with SMP
import torch.sagemaker as tsm
tsm.init()

In addition to SMP, SMDDP supports open source PyTorch FSDP and DeepSpeed. To learn more about the SMDDP library, see Run distributed training with the SageMaker distributed data parallelism library.

Activation offloading

Typically, the forward pass of model training computes activations at each layer and keeps them in GPU memory until the backward pass for the corresponding layer finishes. These stored activations can consume significant GPU memory during training. Activation offloading is a technique that instead moves these tensors to CPU memory after the forward pass and later fetches them back to GPU when they are needed. This approach can substantially reduce GPU memory usage during training.

Although PyTorch supports activation offloading, its implementation is inefficient and can cause GPUs to be idle while activations are fetched back from CPU during a backward pass. This can cause significant performance degradation when using activation offloading.

SMP v2 offers an optimized activation offloading algorithm that can improve training performance. SMP’s implementation pre-fetches activations before they are needed on the GPU, reducing idle time.

Because SMP is built on top of PyTorch’s APIs, enabling optimized activation offloading requires just a few lines of code change. Simply add the associated configurations (sm_activation_offloading and activation_loading_horizon parameters) and include them in your training script.

The SMP configuration is as follows:

{
"activation_loading_horizon": 2,
"sm_activation_offloading": True
}

In the training script, use the following code:

import torch.sagemaker as tsm
tsm.init()

# Native PyTorch module for activation offloading
from torch.distributed.algorithms._checkpoint.checkpoint_wrapper import (
apply_activation_checkpointing,
offload_wrapper,
)

model = FSDP(...)

# Activation offloading requires activation checkpointing.
apply_activation_checkpointing(
model,
check_fn=checkpoint_tformer_layers_policy,
)

model = offload_wrapper(model)

To learn more about the open source PyTorch checkpoint tools for activation offloading, see the checkpoint_wrapper.py script in the PyTorch GitHub repository and Activation Checkpointing in the PyTorch blog post Scaling Multimodal Foundation Models in TorchMultimodal with Pytorch Distributed. To learn more about SMP’s optimized implementation of activation offloading, see the activation offloading section of our documentation.

Beyond hybrid sharding, SMDDP, and activation offloading, SMP offers additional optimizations that can accelerate your large model training workload. This includes optimized activation checkpointing, delayed parameter initialization, and others. To learn more, refer to the core features section of our documentation.

Conclusion

As datasets, model sizes, and training clusters continue to grow, efficient distributed training is increasingly critical for timely and affordable model and product delivery. The latest release of the SageMaker model parallel library helps you achieve this by reducing code change and aligning with PyTorch FSDP APIs, enabling training on massive clusters via tensor parallelism and optimizations that can reduce training time by up to 20%.

To get started with SMP v2, refer to our documentation and our sample notebooks.


About the Authors

Robert Van Dusen is a Senior Product Manager with Amazon SageMaker. He leads frameworks, compilers, and optimization techniques for deep learning training.

Luis Quintela is the Software Developer Manager for the AWS SageMaker model parallel library. In his spare time, he can be found riding his Harley in the SF Bay Area.

Gautam Kumar is a Software Engineer with AWS AI Deep Learning.  He is passionate about building tools and systems for AI. In his spare time, he enjoy biking and reading books.

Rahul Huilgol is a Senior Software Development Engineer in Distributed Deep Learning at Amazon Web Services.

Read More

2023: A year of groundbreaking advances in AI and computing

2023: A year of groundbreaking advances in AI and computing

This has been a year of incredible progress in the field of Artificial Intelligence (AI) research and its practical applications.

As ongoing research pushes AI even farther, we look back to our perspective published in January of this year, titled “Why we focus on AI (and to what end),” where we noted:

We are committed to leading and setting the standard in developing and shipping useful and beneficial applications, applying ethical principles grounded in human values, and evolving our approaches as we learn from research, experience, users, and the wider community.

We also believe that getting AI right — which to us involves innovating and delivering widely accessible benefits to people and society, while mitigating its risks — must be a collective effort involving us and others, including researchers, developers, users (individuals, businesses, and other organizations), governments, regulators, and citizens.

We are convinced that the AI-enabled innovations we are focused on developing and delivering boldly and responsibly are useful, compelling, and have the potential to assist and improve lives of people everywhere — this is what compels us.

In this Year-in-Review post we’ll go over some of Google Research’s and Google DeepMind’s efforts putting these paragraphs into practice safely throughout 2023.

Advances in products & technologies

This was the year generative AI captured the world’s attention, creating imagery, music, stories, and engaging conversation about everything imaginable, at a level of creativity and a speed almost implausible a few years ago.

In February, we first launched Bard, a tool that you can use to explore creative ideas and explain things simply. It can generate text, translate languages, write different kinds of creative content and more.

In May, we watched the results of months and years of our foundational and applied work announced on stage at Google I/O. Principally, this included PaLM 2, a large language model (LLM) that brought together compute-optimal scaling, an improved dataset mixture, and model architecture to excel at advanced reasoning tasks.

By fine-tuning and instruction-tuning PaLM 2 for different purposes, we were able to integrate it into numerous Google products and features, including:

  • An update to Bard, which enabled multilingual capabilities. Since its initial launch, Bard is now available in more than 40 languages and over 230 countries and territories, and with extensions, Bard can find and show relevant information from Google tools used every day — like Gmail, Google Maps, YouTube, and more.
  • Search Generative Experience (SGE), which uses LLMs to reimagine both how to organize information and how to help people navigate through it, creating a more fluid, conversational interaction model for our core Search product. This work extended the search engine experience from primarily focused on information retrieval into something much more — capable of retrieval, synthesis, creative generation and continuation of previous searches — while continuing to serve as a connection point between users and the web content they seek.
  • MusicLM, a text-to-music model powered by AudioLM and MuLAN, which can make music from text, humming, images or video and musical accompaniments to singing.
  • Duet AI, our AI-powered collaborator that provides users with assistance when they use Google Workspace and Google Cloud. Duet AI in Google Workspace, for example, helps users write, create images, analyze spreadsheets, draft and summarize emails and chat messages, and summarize meetings. Duet AI in Google Cloud helps users code, deploy, scale, and monitor applications, as well as identify and accelerate resolution of cybersecurity threats.
  • And many other developments.

In June, following last year’s release of our text-to-image generation model Imagen, we released Imagen Editor, which provides the ability to use region masks and natural language prompts to interactively edit generative images to provide much more precise control over the model output.

Later in the year, we released Imagen 2, which improved outputs via a specialized image aesthetics model based on human preferences for qualities such as good lighting, framing, exposure, and sharpness.

In October, we launched a feature that helps people practice speaking and improve their language skills. The key technology that enabled this functionality was a novel deep learning model developed in collaboration with the Google Translate team, called Deep Aligner. This single new model has led to dramatic improvements in alignment quality across all tested language pairs, reducing average alignment error rate from 25% to 5% compared to alignment approaches based on Hidden Markov models (HMMs).

In November, in partnership with YouTube, we announced Lyria, our most advanced AI music generation model to date. We released two experiments designed to open a new playground for creativity, DreamTrack and music AI tools, in concert with YouTube’s Principles for partnering with the music industry on AI technology.

Then in December, we launched Gemini, our most capable and general AI model. Gemini was built to be multimodal from the ground up across text, audio, image and videos. Our initial family of Gemini models comes in three different sizes, Nano, Pro, and Ultra. Nano models are our smallest and most efficient models for powering on-device experiences in products like Pixel. The Pro model is highly-capable and best for scaling across a wide range of tasks. The Ultra model is our largest and most capable model for highly complex tasks.

In a technical report about Gemini models, we showed that Gemini Ultra’s performance exceeds current state-of-the-art results on 30 of the 32 widely-used academic benchmarks used in LLM research and development. With a score of 90.04%, Gemini Ultra was the first model to outperform human experts on MMLU, and achieved a state-of-the-art score of 59.4% on the new MMMU benchmark.

Building on AlphaCode, the first AI system to perform at the level of the median competitor in competitive programming, we introduced AlphaCode 2 powered by a specialized version of Gemini. When evaluated on the same platform as the original AlphaCode, we found that AlphaCode 2 solved 1.7x more problems, and performed better than 85% of competition participants

At the same time, Bard got its biggest upgrade with its use of the Gemini Pro model, making it far more capable at things like understanding, summarizing, reasoning, coding, and planning. In six out of eight benchmarks, Gemini Pro outperformed GPT-3.5, including in MMLU, one of the key standards for measuring large AI models, and GSM8K, which measures grade school math reasoning. Gemini Ultra will come to Bard early next year through Bard Advanced, a new cutting-edge AI experience.

Gemini Pro is also available on Vertex AI, Google Cloud’s end-to-end AI platform that empowers developers to build applications that can process information across text, code, images, and video. Gemini Pro was also made available in AI Studio in December.

To best illustrate some of Gemini’s capabilities, we produced a series of short videos with explanations of how Gemini could:

ML/AI Research

In addition to our advances in products and technologies, we’ve also made a number of important advancements in the broader fields of machine learning and AI research.

At the heart of the most advanced ML models is the Transformer model architecture, developed by Google researchers in 2017. Originally developed for language, it has proven useful in domains as varied as computer vision, audio, genomics, protein folding, and more. This year, our work on scaling vision transformers demonstrated state-of-the-art results across a wide variety of vision tasks, and has also been useful in building more capable robots.

Expanding the versatility of models requires the ability to perform higher-level and multi-step reasoning. This year, we approached this target following several research tracks. For example, algorithmic prompting is a new method that teaches language models reasoning by demonstrating a sequence of algorithmic steps, which the model can then apply in new contexts. This approach improves accuracy on one middle-school mathematics benchmark from 25.9% to 61.1%.

By providing algorithmic prompts, we can teach a model the rules of arithmetic via in-context learning.

In the domain of visual question answering, in a collaboration with UC Berkeley researchers, we showed how we could better answer complex visual questions (“Is the carriage to the right of the horse?”) by combining a visual model with a language model trained to answer visual questions by synthesizing a program to perform multi-step reasoning.

We are now using a general model that understands many aspects of the software development life cycle to automatically generate code review comments, respond to code review comments, make performance-improving suggestions for pieces of code (by learning from past such changes in other contexts), fix code in response to compilation errors, and more.

In a multi-year research collaboration with the Google Maps team, we were able to scale inverse reinforcement learning and apply it to the world-scale problem of improving route suggestions for over 1 billion users. Our work culminated in a 16–24% relative improvement in global route match rate, helping to ensure that routes are better aligned with user preferences.

We also continue to work on techniques to improve the inference performance of machine learning models. In work on computationally-friendly approaches to pruning connections in neural networks, we were able to devise an approximation algorithm to the computationally intractable best-subset selection problem that is able to prune 70% of the edges from an image classification model and still retain almost all of the accuracy of the original.

In work on accelerating on-device diffusion models, we were also able to apply a variety of optimizations to attention mechanisms, convolutional kernels, and fusion of operations to make it practical to run high quality image generation models on-device; for example, enabling “a photorealistic and high-resolution image of a cute puppy with surrounding flowers” to be generated in just 12 seconds on a smartphone.

Advances in capable language and multimodal models have also benefited our robotics research efforts. We combined separately trained language, vision, and robotic control models into PaLM-E, an embodied multi-modal model for robotics, and Robotic Transformer 2 (RT-2), a novel vision-language-action (VLA) model that learns from both web and robotics data, and translates this knowledge into generalized instructions for robotic control.

RT-2 architecture and training: We co-fine-tune a pre-trained vision-language model on robotics and web data. The resulting model takes in robot camera images and directly predicts actions for a robot to perform.

Furthermore, we showed how language can also be used to control the gait of quadrupedal robots and explored the use of language to help formulate more explicit reward functions to bridge the gap between human language and robotic actions. Then, in Barkour we benchmarked the agility limits of quadrupedal robots.

Algorithms & optimization

Designing efficient, robust, and scalable algorithms remains a high priority. This year, our work included: applied and scalable algorithms, market algorithms, system efficiency and optimization, and privacy.

We introduced AlphaDev, an AI system that uses reinforcement learning to discover enhanced computer science algorithms. AlphaDev uncovered a faster algorithm for sorting, a method for ordering data, which led to improvements in the LLVM libc++ sorting library that were up to 70% faster for shorter sequences and about 1.7% faster for sequences exceeding 250,000 elements.

We developed a novel model to predict the properties of large graphs, enabling estimation of performance for large programs. We released a new dataset, TPUGraphs, to accelerate open research in this area, and showed how we can use modern ML to improve ML efficiency.

The TPUGraphs dataset has 44 million graphs for ML program optimization.

We developed a new load balancing algorithm for distributing queries to a server, called Prequal, which minimizes a combination of requests-in-flight and estimates the latency. Deployments across several systems have saved CPU, latency, and RAM significantly. We also designed a new analysis framework for the classical caching problem with capacity reservations.

Heatmaps of normalized CPU usage transitioning to Prequal at 08:00.

We improved state-of-the-art in clustering and graph algorithms by developing new techniques for computing minimum-cut, approximating correlation clustering, and massively parallel graph clustering. Additionally, we introduced TeraHAC, a novel hierarchical clustering algorithm for trillion-edge graphs, designed a text clustering algorithm for better scalability while maintaining quality, and designed the most efficient algorithm for approximating the Chamfer Distance, the standard similarity function for multi-embedding models, offering >50× speedups over highly-optimized exact algorithms and scaling to billions of points.

We continued optimizing Google’s large embedding models (LEMs), which power many of our core products and recommender systems. Some new techniques include Unified Embedding for battle-tested feature representations in web-scale ML systems and Sequential Attention, which uses attention mechanisms to discover high-quality sparse model architectures during training.

<!–

This year, we also continued our research in market algorithms to design computationally efficient marketplaces and causal inference. First, we remain committed to advancing the rapidly growing interest in ads automation for which our recent work explains the adoption of autobidding mechanisms and examines the effect of different auction formats on the incentives of advertisers. In the multi-channel setting, our findings shed light on how the choice between local and global optimizations affects the design of multi-channel auction systems and bidding systems.

–>

Beyond auto-bidding systems, we also studied auction design in other complex settings, such as buy-many mechanisms, auctions for heterogeneous bidders, contract designs, and innovated robust online bidding algorithms. Motivated by the application of generative AI in collaborative creation (e.g., joint ad for advertisers), we proposed a novel token auction model where LLMs bid for influence in the collaborative AI creation. Finally, we show how to mitigate personalization effects in experimental design, which, for example, may cause recommendations to drift over time.

The Chrome Privacy Sandbox, a multi-year collaboration between Google Research and Chrome, has publicly launched several APIs, including for Protected Audience, Topics, and Attribution Reporting. This is a major step in protecting user privacy while supporting the open and free web ecosystem. These efforts have been facilitated by fundamental research on re-identification risk, private streaming computation, optimization of privacy caps and budgets, hierarchical aggregation, and training models with label privacy.

Science and society

In the not too distant future, there is a very real possibility that AI applied to scientific problems can accelerate the rate of discovery in certain domains by 10× or 100×, or more, and lead to major advances in diverse areas including bioengineering, materials science, weather prediction, climate forecasting, neuroscience, genetic medicine, and healthcare.

Sustainability and climate change

In Project Green Light, we partnered with 13 cities around the world to help improve traffic flow at intersections and reduce stop-and-go emissions. Early numbers from these partnerships indicate a potential for up to 30% reduction in stops and up to 10% reduction in emissions.

In our contrails work, we analyzed large-scale weather data, historical satellite images, and past flights. We trained an AI model to predict where contrails form and reroute airplanes accordingly. In partnership with American Airlines and Breakthrough Energy, we used this system to demonstrate contrail reduction by 54%.

Contrails detected over the United States using AI and GOES-16 satellite imagery.

We are also developing novel technology-driven approaches to help communities with the effects of climate change. For example, we have expanded our flood forecasting coverage to 80 countries, which directly impacts more than 460 million people. We have initiated a number of research efforts to help mitigate the increasing danger of wildfires, including real-time tracking of wildfire boundaries using satellite imagery, and work that improves emergency evacuation plans for communities at risk to rapidly-spreading wildfires. Our partnership with American Forests puts data from our Tree Canopy project to work in their Tree Equity Score platform, helping communities identify and address unequal access to trees.

Finally, we continued to develop better models for weather prediction at longer time horizons. Improving on MetNet and MetNet-2, in this year’s work on MetNet-3, we now outperform traditional numerical weather simulations up to twenty-four hours. In the area of medium-term, global weather forecasting, our work on GraphCast showed significantly better prediction accuracy for up to 10 days compared to HRES, the most accurate operational deterministic forecast, produced by the European Centre for Medium-Range Weather Forecasts (ECMWF). In collaboration with ECMWF, we released WeatherBench-2, a benchmark for evaluating the accuracy of weather forecasts in a common framework.

A selection of GraphCast’s predictions rolling across 10 days showing specific humidity at 700 hectopascals (about 3 km above surface), surface temperature, and surface wind speed.

Health and the life sciences

The potential of AI to dramatically improve processes in healthcare is significant. Our initial Med-PaLM model was the first model capable of achieving a passing score on the U.S. medical licensing exam. Our more recent Med-PaLM 2 model improved by a further 19%, achieving an expert-level accuracy of 86.5%. These Med-PaLM models are language-based, enable clinicians to ask questions and have a dialogue about complex medical conditions, and are available to healthcare organizations as part of MedLM through Google Cloud.

In the same way our general language models are evolving to handle multiple modalities, we have recently shown research on a multimodal version of Med-PaLM capable of interpreting medical images, textual data, and other modalities, describing a path for how we can realize the exciting potential of AI models to help advance real-world clinical care.

Med-PaLM M is a large multimodal generative model that flexibly encodes and interprets biomedical data including clinical language, imaging, and genomics with the same model weights.

We have also been working on how best to harness AI models in clinical workflows. We have shown that coupling deep learning with interpretability methods can yield new insights for clinicians. We have also shown that self-supervised learning, with careful consideration of privacy, safety, fairness and ethics, can reduce the amount of de-identified data needed to train clinically relevant medical imaging models by 3×–100×, reducing the barriers to adoption of models in real clinical settings. We also released an open source mobile data collection platform for people with chronic disease to provide tools to the community to build their own studies.

AI systems can also discover completely new signals and biomarkers in existing forms of medical data. In work on novel biomarkers discovered in retinal images, we demonstrated that a number of systemic biomarkers spanning several organ systems (e.g., kidney, blood, liver) can be predicted from external eye photos. In other work, we showed that combining retinal images and genomic information helps identify some underlying factors of aging.

In the genomics space, we worked with 119 scientists across 60 institutions to create a new map of the human genome, or pangenome. This more equitable pangenome better represents the genomic diversity of global populations. Building on our ground-breaking AlphaFold work, our work on AlphaMissense this year provides a catalog of predictions for 89% of all 71 million possible missense variants as either likely pathogenic or likely benign.

Examples of AlphaMissense predictions overlaid on AlphaFold predicted structures (red – predicted as pathogenic; blue – predicted as benign; grey – uncertain). Red dots represent known pathogenic missense variants, blue dots represent known benign variants. Left: HBB protein. Variants in this protein can cause sickle cell anaemia. Right: CFTR protein. Variants in this protein can cause cystic fibrosis.

We also shared an update on progress towards the next generation of AlphaFold. Our latest model can now generate predictions for nearly all molecules in the Protein Data Bank (PDB), frequently reaching atomic accuracy. This unlocks new understanding and significantly improves accuracy in multiple key biomolecule classes, including ligands (small molecules), proteins, nucleic acids (DNA and RNA), and those containing post-translational modifications (PTMs).

On the neuroscience front, we announced a new collaboration with Harvard, Princeton, the NIH, and others to map an entire mouse brain at synaptic resolution, beginning with a first phase that will focus on the hippocampal formation — the area of the brain responsible for memory formation, spatial navigation, and other important functions.

Quantum computing

Quantum computers have the potential to solve big, real-world problems across science and industry. But to realize that potential, they must be significantly larger than they are today, and they must reliably perform tasks that cannot be performed on classical computers.

This year, we took an important step towards the development of a large-scale, useful quantum computer. Our breakthrough is the first demonstration of quantum error correction, showing that it’s possible to reduce errors while also increasing the number of qubits. To enable real-world applications, these qubit building blocks must perform more reliably, lowering the error rate from ~1 in 103 typically seen today, to ~1 in 108.

Responsible AI research

Design for Responsibility

Generative AI is having a transformative impact in a wide range of fields including healthcare, education, security, energy, transportation, manufacturing, and entertainment. Given these advances, the importance of designing technologies consistent with our AI Principles remains a top priority. We also recently published case studies of emerging practices in society-centered AI. And in our annual AI Principles Progress Update, we offer details on how our Responsible AI research is integrated into products and risk management processes.

Proactive design for Responsible AI begins with identifying and documenting potential harms. For example, we recently introduced a three-layered context-based framework for comprehensively evaluating the social and ethical risks of AI systems. During model design, harms can be mitigated with the use of responsible datasets.

We are partnering with Howard University to build high quality African-American English (AAE) datasets to improve our products and make them work well for more people. Our research on globally inclusive cultural representation and our publication of the Monk Skin Tone scale furthers our commitments to equitable representation of all people. The insights we gain and techniques we develop not only help us improve our own models, they also power large-scale studies of representation in popular media to inform and inspire more inclusive content creation around the world.

Monk Skin Tone (MST) Scale. See more at skintone.google.

With advances in generative image models, fair and inclusive representation of people remains a top priority. In the development pipeline, we are working to amplify underrepresented voices and to better integrate social context knowledge. We proactively address potential harms and bias using classifiers and filters, careful dataset analysis, and in-model mitigations such as fine-tuning, reasoning, few-shot prompting, data augmentation and controlled decoding, and our research showed that generative AI enables higher quality safety classifiers to be developed with far less data. We also released a powerful way to better tune models with less data giving developers more control of responsibility challenges in generative AI.

We have developed new state-of-the-art explainability methods to identify the role of training data on model behaviors. By combining training data attribution methods with agile classifiers, we found that we can identify mislabelled training examples. This makes it possible to reduce the noise in training data, leading to significant improvements in model accuracy.

We initiated several efforts to improve safety and transparency about online content. For example, we introduced SynthID, a tool for watermarking and identifying AI-generated images. SynthID is imperceptible to the human eye, doesn’t compromise image quality, and allows the watermark to remain detectable, even after modifications like adding filters, changing colors, and saving with various lossy compression schemes.

We also launched About This Image to help people assess the credibility of images, showing information like an image’s history, how it’s used on other pages, and available metadata about an image. And we explored safety methods that have been developed in other fields, learning from established situations where there is low-risk tolerance.

SynthID generates an imperceptible digital watermark for AI-generated images.

Privacy remains an essential aspect of our commitment to Responsible AI. We continued improving our state-of-the-art privacy preserving learning algorithm DP-FTRL, developed the DP-Alternating Minimization algorithm (DP-AM) to enable personalized recommendations with rigorous privacy protection, and defined a new general paradigm to reduce the privacy costs for many aggregation and learning tasks. We also proposed a scheme for auditing differentially private machine learning systems.

On the applications front we demonstrated that DP-SGD offers a practical solution in the large model fine-tuning regime and showed that images generated by DP diffusion models are useful for a range of downstream tasks. We proposed a new algorithm for DP training of large embedding models that provides efficient training on TPUs without compromising accuracy.

We also teamed up with a broad group of academic and industrial researchers to organize the first Machine Unlearning Challenge to address the scenario in which training images are forgotten to protect the privacy or rights of individuals. We shared a mechanism for extractable memorization, and participatory systems that give users more control over their sensitive data.

We continued to expand the world’s largest corpus of atypical speech recordings to >1M utterances in Project Euphonia, which enabled us to train a Universal Speech Model to better recognize atypical speech by 37% on real-world benchmarks.

We also built an audiobook recommendation system for students with reading disabilities such as dyslexia.

Adversarial testing

Our work in adversarial testing engaged community voices from historically marginalized communities. We partnered with groups such as the Equitable AI Research Round Table (EARR) to ensure we represent the diverse communities who use our models and engage with external users to identify potential harms in generative model outputs.

We established a dedicated Google AI Red Team focused on testing AI models and products for security, privacy, and abuse risks. We showed that attacks such as “poisoning” or adversarial examples can be applied to production models and surface additional risks such as memorization in both image and text generative models. We also demonstrated that defending against such attacks can be challenging, as merely applying defenses can cause other security and privacy leakages. We also introduced model evaluation for extreme risks, such as offensive cyber capabilities or strong manipulation skills.

Democratizing AI though tools and education

As we advance the state-of-the-art in ML and AI, we also want to ensure people can understand and apply AI to specific problems. We released MakerSuite (now Google AI Studio), a web-based tool that enables AI developers to quickly iterate and build lightweight AI-powered apps. To help AI engineers better understand and debug AI, we released LIT 1.0, a state-of-the-art, open-source debugger for machine learning models.

Colab, our tool that helps developers and students access powerful computing resources right in their web browser, reached over 10 million users. We’ve just added AI-powered code assistance to all users at no cost — making Colab an even more helpful and integrated experience in data and ML workflows.

One of the most used features is “Explain error” — whenever the user encounters an execution error in Colab, the code assistance model provides an explanation along with a potential fix.

To ensure AI produces accurate knowledge when put to use, we also recently introduced FunSearch, a new approach that generates verifiably true knowledge in mathematical sciences using evolutionary methods and large language models.

For AI engineers and product designers, we’re updating the People + AI Guidebook with generative AI best practices, and we continue to design AI Explorables, which includes how and why models sometimes make incorrect predictions confidently.

Community engagement

We continue to advance the fields of AI and computer science by publishing much of our work and participating in and organizing conferences. We have published more than 500 papers so far this year, and have strong presences at conferences like ICML (see the Google Research and Google DeepMind posts), ICLR (Google Research, Google DeepMind), NeurIPS (Google Research, Google DeepMind), ICCV, CVPR, ACL, CHI, and Interspeech. We are also working to support researchers around the world, participating in events like the Deep Learning Indaba, Khipu, supporting PhD Fellowships in Latin America, and more. We also worked with partners from 33 academic labs to pool data from 22 different robot types and create the Open X-Embodiment dataset and RT-X model to better advance responsible AI development.

Google has spearheaded an industry-wide effort to develop AI safety benchmarks under the MLCommons standards organization with participation from several major players in the generative AI space including OpenAI, Anthropic, Microsoft, Meta, Hugging Face, and more. Along with others in the industry we also co-founded the Frontier Model Forum (FMF), which is focused on ensuring safe and responsible development of frontier AI models. With our FMF partners and other philanthropic organizations, we launched a $10 million AI Safety Fund to advance research into the ongoing development of the tools for society to effectively test and evaluate the most capable AI models.

In close partnership with Google.org, we worked with the United Nations to build the UN Data Commons for the Sustainable Development Goals, a tool that tracks metrics across the 17 Sustainable Development Goals, and supported projects from NGOs, academic institutions, and social enterprises on using AI to accelerate progress on the SDGs.

The items highlighted in this post are a small fraction of the research work we have done throughout the last year. Find out more at the Google Research and Google DeepMind blogs, and our list of publications.

Future vision

As multimodal models become even more capable, they will empower people to make incredible progress in areas from science to education to entirely new areas of knowledge.

Progress continues apace, and as the year advances, and our products and research advance as well, people will find more and interesting creative uses for AI.

Ending this Year-in-Review where we began, as we say in Why We Focus on AI (and to what end):

If pursued boldly and responsibly, we believe that AI can be a foundational technology that transforms the lives of people everywhere — this is what excites us!


This Year-in-Review is cross-posted on both the Google Research Blog and the Google DeepMind Blog.

Read More